What Is a DNA Sequence? From Bases to Proteins

A DNA sequence is the specific order of chemical “letters” along a strand of DNA. These letters represent four molecules called bases: adenine (A), cytosine (C), guanine (G), and thymine (T). Every instruction your body needs to build and maintain itself is encoded in the arrangement of these four bases, and a single copy of the human genome contains roughly 3 billion of them, distributed across 23 chromosomes.

The Four Bases and the Backbone

A DNA molecule consists of two long chains twisted into the familiar double helix. Each chain is built from repeating units called nucleotides. Every nucleotide has three parts: a sugar (deoxyribose), a phosphate group, and one of the four bases. The sugars and phosphates link together to form a sturdy “backbone,” while the bases point inward like rungs on a twisted ladder.

The two strands hold together because bases on opposite strands pair in a predictable way: A always pairs with T, and C always pairs with G. This means one strand is a mirror image of the other. If you know the sequence of one strand, you automatically know the sequence of its partner.

Each strand also has a built-in direction, labeled 5′ (five-prime) at one end and 3′ (three-prime) at the other. This matters because the cellular machinery that copies or reads DNA only works in the 5′-to-3′ direction. When scientists write out a DNA sequence, they follow this same convention, listing bases from 5′ to 3′.

How a Sequence Becomes a Protein

The practical purpose of most well-studied DNA sequences is to provide blueprints for proteins. The cell first copies a stretch of DNA into a related molecule called messenger RNA. That RNA is then “read” in groups of three bases at a time. Each three-letter group is called a codon, and each codon specifies one of the 20 amino acids that make up proteins.

Because four bases can be arranged into 64 possible three-letter combinations, and there are only 20 amino acids to code for (plus a “stop” signal), the code is redundant. Several different codons can specify the same amino acid. Small adapter molecules called transfer RNAs serve as matchmakers: one end recognizes a specific codon, and the other end carries the corresponding amino acid. A molecular machine called a ribosome moves along the RNA, reads each codon, accepts the matching adapter, and links the amino acids into a growing protein chain.

Coding vs. Non-Coding DNA

Only about 1 percent of human DNA actually codes for proteins. The other 99 percent is non-coding. That doesn’t mean it’s useless. Non-coding regions include regulatory sequences that control when and where genes are turned on, structural segments that help chromosomes stay organized, and stretches whose functions scientists are still working out.

Even within a gene, not every base ends up in the final protein blueprint. Genes contain alternating segments called exons and introns. Exons carry the protein-coding information. Introns are removed from the RNA copy before it reaches the ribosome, and the remaining exons are spliced together to form the mature message. The pattern of which exons get included can vary, allowing a single gene to produce slightly different proteins in different tissues or at different times.

How DNA Sequences Are Read

Determining the exact order of bases in a DNA sample is called sequencing. The original method, known as Sanger sequencing, reads one short stretch of DNA at a time. It’s accurate and relatively inexpensive per individual read (typically under £1), but slow when applied to an entire genome. The Human Genome Project relied on Sanger technology and took over a decade to produce its final draft.

Modern approaches, collectively called next-generation sequencing, break DNA into millions of tiny fragments and read them all simultaneously. This parallelism is transformative: an entire human genome can now be sequenced in a single day. Next-generation methods also detect a broader range of variations, including large insertions, deletions, and mutations present in only a small fraction of cells. A state-of-the-art platform can generate roughly 150 million reads for around £1,000.

Once sequencing is complete, the raw data is stored in standardized digital formats. FASTA files contain plain sequence data, while FASTQ files pair each base with a quality score indicating how confident the machine is in that particular reading. These formats have become the common language for sharing genomic data between research groups and software tools worldwide.

Why Small Sequence Differences Matter

The genomes of any two humans are about 99.9 percent identical. The remaining 0.1 percent accounts for much of the physical variation between people. The most common type of variation is a single nucleotide polymorphism, or SNP (pronounced “snip”), where just one base differs at a particular spot in the genome.

SNPs serve as biological markers that help researchers pinpoint genes associated with disease. When a SNP falls inside a gene or near a regulatory region, it can directly affect how that gene works. Clinically, SNPs are used to estimate a person’s risk for certain diseases, predict how they will respond to specific medications, and trace disease-associated variants through families. Consumer ancestry tests also rely heavily on SNP patterns to estimate geographic heritage.

Comparing Sequences Across Species

DNA sequences are also a powerful tool for understanding evolutionary relationships. The first comprehensive comparison of human and chimpanzee genomes found that, when looking at directly alignable DNA, the two species are almost 99 percent identical. When insertions and deletions (places where one species has extra or missing stretches of DNA) are factored in, the overall similarity is about 96 percent. These numbers illustrate how relatively small sequence changes, accumulated over millions of years, can produce dramatically different organisms.

Comparing sequences across species helps researchers identify which stretches of DNA are functionally critical. Regions that have stayed nearly unchanged across distantly related animals are likely performing essential jobs, because mutations there would have been weeded out by natural selection. This comparative approach has been instrumental in discovering regulatory elements that don’t code for proteins but play key roles in development and health.