How the Human Genome Was Sequenced, Step by Step

The human genome was sequenced by breaking all 3 billion base pairs of DNA into millions of small fragments, reading each fragment individually using a chemistry-based method called Sanger sequencing, then using computers to stitch those fragments back together in the correct order. The process took 13 years, cost roughly $3 billion, and involved 20 research centers across six countries. Even then, about 8% of the genome remained unreadable until 2022, when newer technology finally completed the job.

The Core Technology: Sanger Sequencing

Every base pair in the original Human Genome Project was read using a technique developed by Fred Sanger in 1975. The method works by copying a strand of DNA in a test tube, but spiking the reaction with special modified bases that stop the copying process at random points. Each time a modified base gets incorporated, the growing DNA chain terminates. Run the reaction enough times and you get fragments of every possible length, each one ending at a different position along the original strand.

The trick is that each of the four terminating bases (A, T, C, and G) carries a different fluorescent color. A laser reads the color at the end of each fragment, shortest to longest, and the result is a readout of the DNA sequence one letter at a time. A single run could read about 500 to 800 base pairs. Reading 3 billion base pairs therefore required millions of these reactions, performed by hundreds of automated sequencing machines running around the clock.

Two Competing Strategies

The publicly funded Human Genome Project and the private company Celera Genomics took fundamentally different approaches to organizing all that sequencing work. The disagreement between them shaped how the project unfolded.

The Public Project: Hierarchical Shotgun

The international consortium used a methodical, map-first approach. Researchers began by breaking the genome into large, overlapping chunks roughly 150,000 base pairs long, stored inside bacteria as “bacterial artificial chromosomes,” or BACs. Each BAC was mapped to a known location on a specific chromosome using physical landmarks called sequence tagged sites. This created an ordered library of the genome, like tearing a book into chapters and numbering them before reading.

Once a BAC’s position was known, its DNA was shattered into tiny overlapping fragments, each fragment was sequenced individually, and software pieced them back together. Because researchers already knew where each BAC belonged on the chromosome, they could assemble the full genome region by region. The process was slow and expensive, but it produced highly accurate, well-organized results.

The Private Approach: Whole-Genome Shotgun

Celera Genomics, led by Craig Venter, proposed skipping the mapping step entirely. Instead of organizing the genome into mapped chunks first, the whole-genome shotgun method shreds the entire genome at once into small fragments, sequences everything, and relies on powerful computers to find overlaps and reconstruct the original order. This is faster and cheaper in principle, but far more computationally demanding, especially in repetitive regions of DNA where many fragments look nearly identical.

The two camps announced their draft sequences together at the White House on June 26, 2000. Later analysis, however, revealed that Celera’s assembly relied heavily on the public project’s data. Celera had taken the consortium’s assembled sequence, broken it back into simulated fragments, and combined those with their own data before running their assembly software. The result was not a pure test of the whole-genome shotgun method on a genome this large.

The Software That Made It Possible

Sequencing machines generated raw data, but computers did the real heavy lifting. A program called Phred read the output from sequencing machines and assigned a quality score to each base, essentially grading how confident the machine was in each letter. Another program, Phrap, took those scored fragments and assembled them into longer continuous stretches by finding where fragments overlapped. A graphical editor called Consed let researchers visually inspect the assemblies and identify problem areas that needed additional sequencing.

These tools were critical because the genome is full of repetitive sequences, regions where the same or very similar patterns appear thousands of times. Repetitive DNA can trick assembly software into collapsing distinct regions into one or placing fragments in the wrong location. Human review, guided by quality scores, caught many of these errors.

From Draft to Finished Sequence

The project hit its milestones in stages. It officially launched on October 1, 1990. The working draft, covering about 90% of the genome, was announced in June 2000 and published in February 2001. The finished sequence followed on April 14, 2003, two years ahead of the original 15-year schedule.

“Finished” had a precise definition. The sequence had to be accurate to at least one error per 10,000 base pairs, with no gaps. Every region needed to be confirmed by sequencing both strands of DNA or verified with an alternative chemistry. Problem areas like compressed sequences and repeats had to be actively investigated. No more than 5% of any given clone could rely on single-strand coverage, and even that was only acceptable if the quality scores were high enough. This finishing work was painstaking, often requiring targeted re-sequencing of stubborn regions one at a time.

The finished sequence covered 99% of the gene-containing portion of the genome to 99.99% accuracy. The final scientific description was published in October 2004.

What the Sequence Revealed

Before sequencing began, estimates of how many protein-coding genes humans carry ranged wildly, from 30,000 to 150,000, with some early guesses reaching into the millions. The initial analysis in 2001 put the number at 30,000 to 40,000. Even that turned out to be too high. The current best estimate is around 20,000 protein-coding genes, fewer than many scientists expected for an organism as complex as a human. The finding reshaped biology’s understanding of how complexity arises, shifting focus from gene count to how genes are regulated, spliced, and expressed.

The 8% That Was Left Behind

The 2003 “complete” genome was not truly complete. About 8% of the genome, roughly 200 million base pairs, remained unsequenced. The missing regions were concentrated in highly repetitive areas: the centers of chromosomes (centromeres), the tips (telomeres and subtelomeres), ribosomal DNA arrays, and large duplicated segments. These regions resisted the BAC cloning process because bacteria struggle to maintain highly repetitive DNA, and short Sanger reads could not span the repeats well enough for software to assemble them correctly.

Some parts of the reference weren’t just incomplete but actively wrong. The centromere sequences were filled in with computer-generated placeholder models. Parts of chromosome 21’s short arm were falsely duplicated. Across the genome, a deletion bias hinted at regions where the assembly had collapsed repetitive sequences together.

In March 2022, the Telomere-to-Telomere (T2T) Consortium published the first truly complete human genome sequence: 3.055 billion base pairs with no gaps. The effort used long-read sequencing technologies that can read tens of thousands of base pairs in a single pass, finally spanning the repetitive regions that defeated Sanger sequencing. The complete sequence added nearly 200 million base pairs of new information and identified 1,956 previously unrecognized genes, 99 of which are predicted to code for proteins. When researchers reanalyzed genetic data from over 3,200 people using the new reference, it simultaneously reduced both missed variants and false alarms compared to the old reference.

From 13 Years to One Day

The original project took 13 years of coordinated international effort. Today, long-read sequencing platforms have reportedly generated a complete, telomere-to-telomere human genome sequence in a single day. The cost has dropped even more dramatically: sequencing a human genome now costs well under $1,000, compared to the $3 billion price tag of the original project. That collapse in time and cost is what made modern applications possible, from diagnosing rare genetic diseases to screening tumors for targeted cancer therapy.