A notation in biology is any standardized system of symbols, letters, or numbers used to represent biological information in a compact, universally understood way. Biology uses dozens of notation systems, from the uppercase and lowercase letters that represent dominant and recessive genes to the single-letter codes for amino acids and DNA bases. These systems let scientists communicate complex information precisely, whether they’re naming a species, describing a genetic cross, or sharing a DNA sequence across a database.
Genetic Notation: Alleles and Crosses
The most familiar notation system in biology comes from genetics, and it dates back to Gregor Mendel’s pea plant experiments. Mendel used a capital letter like “A” to represent one form of a trait (such as round seeds) and a lowercase “a” for the alternative form (wrinkled seeds). This convention captured a core genetic concept: uppercase means dominant, lowercase means recessive. A plant with the combination “Aa” carries one of each and will display the dominant trait.
Modern genetics still follows this basic framework. A homozygous dominant organism is written as AA, a heterozygous one as Aa, and a homozygous recessive as aa. In some systems, a superscript “+” marks the wild-type allele, meaning the version most common in nature. When geneticists write out a cross between two organisms, they use a multiplication sign (Aa × Aa) and show the predicted offspring ratios, often with a Punnett square grid.
Gene and Protein Symbols
One notation rule trips up students constantly: genes and proteins are written differently even when they share the same name. Gene symbols are italicized, while the proteins they encode are written in regular (non-italicized) text. The Human Gene Nomenclature Committee endorses this convention specifically so readers can tell at a glance whether a sentence is discussing a stretch of DNA or the molecule it produces.
For example, the gene BRCA1 (italicized) contains instructions for making the BRCA1 protein (not italicized). In many organisms, gene names also follow capitalization rules. Human genes are written in all capitals (TP53), while mouse genes use only an initial capital (Tp53). These seemingly small formatting details prevent real confusion when researchers work across species.
DNA and RNA Base Codes
DNA sequences are written using the four familiar letters: A (adenine), C (cytosine), G (guanine), and T (thymine). RNA replaces T with U (uracil). But real-world sequencing isn’t always clean, and sometimes a position in a sequence could be one of several bases. For these ambiguous spots, biologists use an expanded set of single-letter codes established by the International Union of Pure and Applied Chemistry (IUPAC).
The most common ambiguity code is N, which means “any base.” Others are more specific: R means either of the two purines (A or G), Y means either pyrimidine (C or T), S means a strong-bonding pair (C or G), and W means a weak-bonding pair (A or T). There are 11 ambiguity codes in total, each with a mnemonic to help remember it. These codes appear constantly in genomics, especially when comparing sequences across individuals or species where variation exists at certain positions.
Amino Acid Abbreviations
Proteins are chains of amino acids, and biologists represent the 20 standard amino acids using either a three-letter or one-letter code. The three-letter versions are intuitive: Ala for alanine, Gly for glycine, Ser for serine. The one-letter codes are more compact but less obvious. Some make sense (G for glycine, L for leucine), while others seem arbitrary (W for tryptophan, K for lysine) because the most intuitive letters were already taken by other amino acids.
Here are a few common examples:
- Ala / A for alanine
- Cys / C for cysteine
- Glu / E for glutamic acid
- Phe / F for phenylalanine
- Trp / W for tryptophan
- Val / V for valine
The one-letter codes are essential for writing out long protein sequences efficiently. A protein hundreds of amino acids long would be unwieldy in three-letter notation but fits neatly in a single line of single-letter code. When scientists describe a mutation, they combine these codes with position numbers. “V600E” means that valine (V) at position 600 in the protein chain has been replaced by glutamic acid (E).
Binomial Nomenclature for Species
The system for naming species is one of the oldest and most recognizable notations in all of biology. Every species gets a two-part Latin name: the genus (always capitalized) followed by the species (always lowercase). Both parts are italicized. Homo sapiens, Escherichia coli, Tyrannosaurus rex.
After the first use in a document, the genus can be abbreviated to its initial: E. coli, T. rex. But the species name never stands alone. Writing just “coli” or “sapiens” without the genus or its abbreviation is incorrect, because many different genera can share the same species name. If two genera in the same text share the same first letter, the genus must be spelled out in full to avoid ambiguity.
Enzyme Commission Numbers
Enzymes, the proteins that drive chemical reactions in cells, have their own numerical classification system. Each enzyme receives an EC (Enzyme Commission) number with four digits separated by periods, like EC 2.1.1 or EC 3.6.1. The first number places the enzyme into one of six broad categories based on what type of reaction it catalyzes:
- EC 1 for oxidoreductases (transfer electrons)
- EC 2 for transferases (move chemical groups between molecules)
- EC 3 for hydrolases (break bonds using water)
- EC 4 for lyases (break bonds without water)
- EC 5 for isomerases (rearrange atoms within a molecule)
- EC 6 for ligases (join two molecules together)
The second and third numbers narrow down the subclass and sub-subclass, specifying the type of bond or chemical group involved. The fourth number identifies the specific enzyme. This system means any biologist in the world can look at an EC number and immediately understand what kind of reaction that enzyme performs, regardless of what common name it goes by in different organisms.
Sequence File Formats
When biologists share DNA or protein sequences digitally, they use standardized file formats. The most common is FASTA format, where each sequence starts with a header line beginning with a “>” symbol, followed by a unique identifier and information about the organism. The actual sequence of letters (A, T, G, C for DNA, or amino acid codes for proteins) begins on the next line, typically with no more than 80 characters per line.
The header line must stay on a single line with no line breaks, and the sequence identifier is limited to 25 characters using only letters, digits, hyphens, underscores, and a few other basic symbols. Any ambiguous positions in the sequence use the IUPAC codes described above, with N as the default for completely unknown bases. These formatting rules ensure that sequences can be read correctly by databases and analysis software worldwide.
Pedigree Chart Symbols
In medical genetics, family trees called pedigree charts use a standardized set of shapes. Males are represented by squares, females by circles, and individuals whose sex is unspecified by diamonds. A shape that is filled in (shaded) indicates that the person is clinically affected by the condition being studied. Carriers, people who carry one copy of a recessive gene without showing symptoms, are sometimes shown with a dot in the center of their symbol or with the shape half-filled.
Horizontal lines between two symbols indicate a mating pair, and vertical lines connect parents to their offspring. Generations are arranged in rows, with the oldest at the top. These symbols are standardized by professional genetic counseling organizations so that any clinician or researcher can read a pedigree chart and immediately understand the inheritance pattern being described.

