Long-Read Sequencing: What It Is and How It Works

Long-read sequencing is a DNA and RNA sequencing technology that reads stretches of genetic code tens of thousands of bases long in a single pass. Traditional sequencing methods read only about 100 to 300 bases at a time, then rely on software to piece those tiny fragments together, like assembling a jigsaw puzzle with millions of nearly identical pieces. Long-read sequencing skips much of that guesswork by reading fragments that are typically 10,000 bases or longer, with some reads exceeding 100,000 bases.

How It Differs From Short-Read Sequencing

Short-read sequencing, dominated by Illumina technology, has been the workhorse of genomics for over a decade. It produces reads around 100 to 300 base pairs long with very high accuracy, and it’s relatively cheap per base. The trade-off is that short reads struggle with repetitive regions of the genome, areas where the same or nearly the same sequence appears over and over. When you try to reconstruct the full picture from tiny overlapping snippets, repetitive sections look identical, and the assembly software can’t tell where each piece belongs.

Long-read sequencing solves this by reading through those repetitive stretches in one continuous pass. A single read of 10,000 or 20,000 bases can span an entire repetitive region and anchor itself in unique sequence on either side, making assembly far more straightforward. This difference matters enormously for studying complex parts of the genome that short reads simply cannot resolve.

The Two Major Platforms

Two companies produce the vast majority of long-read sequencing instruments: PacBio and Oxford Nanopore Technologies. They use fundamentally different approaches to reading DNA.

PacBio HiFi Sequencing

PacBio’s method is called Single Molecule Real-Time (SMRT) sequencing. A single DNA molecule is placed in a tiny well, and an enzyme begins copying it. As the enzyme adds each new base, it releases a small flash of fluorescent light, and the color of that flash identifies which base was added. The clever part is that PacBio circularizes the DNA molecule so the enzyme loops around it multiple times. Each pass generates an independent reading of the same sequence, and combining those readings produces what’s called a HiFi read. The result is a long read, typically around 15,000 to 20,000 bases, with over 99.9% accuracy. More than 90% of bases reach a quality score of Q30 or higher, meaning fewer than 1 in 1,000 bases is incorrect.

Oxford Nanopore Sequencing

Oxford Nanopore takes a completely different approach. A protein pore, just a few nanometers wide, sits in an electrically charged membrane. A strand of DNA or RNA is threaded directly through this pore. As each base passes through the narrowest point, it disrupts the electrical current flowing through the pore in a characteristic way. By measuring those current changes, the system identifies each base in real time. There’s no copying step and no fluorescent labeling. The technology sequences native DNA or RNA molecules directly.

Nanopore reads can be extraordinarily long. While typical reads average several thousand bases, ultralong reads exceeding 100,000 bases are possible. Earlier generations of nanopore sequencing had error rates above 5%, but the latest chemistry and improved computational analysis have pushed accuracy above 99%, with recent studies reporting quality scores near Q28, or roughly 99.8% accuracy. The percentage of reads that get filtered out due to errors remains higher than PacBio’s (around 17% versus less than 1% in one direct comparison), but the gap has narrowed significantly.

Detecting Structural Variants

One of the biggest practical advantages of long-read sequencing is finding structural variants: large-scale rearrangements in DNA that include deletions, insertions, duplications, and inversions. These changes involve chunks of DNA that are hundreds or thousands of bases long, and they play a major role in genetic disease and cancer.

Short-read sequencing misses many of these variants, especially when they fall within repetitive regions. Long-read sequencing can detect 20,000 to 30,000 structural variants per human genome, which is three to six times more than short-read methods find. For insertions and complex rearrangements in highly repetitive areas, the sensitivity improvement can be tenfold.

This capability has revealed pathogenic changes that conventional testing missed entirely. Researchers have identified tandem duplications disrupting genes critical for brain development, deep intronic insertions that create abnormal splicing sites, and mobile genetic element insertions responsible for conditions like epileptic encephalopathy and hemophilia A. These findings were invisible to short-read methods.

Completing the Human Genome

The original Human Genome Project, completed in 2003, actually left about 8% of the genome unsequenced. Those gaps sat in the most repetitive, structurally complex regions, including centromeres (the middle of chromosomes) and the short arms of certain chromosomes. Short reads simply couldn’t resolve them.

Long-read sequencing made it possible to finally close those gaps. The Telomere-to-Telomere (T2T) Consortium used a combination of PacBio HiFi reads (around 20,000 bases long with 0.1% error rate) and Oxford Nanopore ultralong reads (over 100,000 bases) to produce the first truly complete, gapless sequence of a human genome, published in 2022 in Science. Nanopore’s ultralong reads could span vast repetitive arrays, while PacBio’s high-accuracy reads resolved the fine details within them.

Haplotype Phasing

Humans carry two copies of each chromosome, one from each parent. Knowing which genetic variants sit on the same chromosome, called haplotype phasing, matters for understanding how variants interact and whether someone carries one or two copies of a disease-causing change. Short reads usually cover only a single variant at a time, so determining which variants are on the same chromosome requires statistical inference based on population-level data.

Long reads spanning 10,000 bases or more routinely cover multiple variants in a single read, allowing direct phasing without any statistical guesswork. If two variants appear on the same long read, they’re definitively on the same chromosome. This is particularly valuable in clinical genetics, where knowing whether two harmful variants are on the same or different copies of a gene changes the diagnosis entirely.

Full-Length RNA Sequencing

Most human genes can produce multiple versions of their protein through a process called alternative splicing, where different segments of the gene are included or excluded in the final messenger RNA. Short-read RNA sequencing chops these transcripts into small pieces and then tries to computationally reconstruct which pieces came from the same version. This reconstruction is error-prone, especially for genes with many similar versions.

Long-read sequencing can capture a full-length transcript in a single read, typically around 10,000 bases, eliminating the need for computational assembly. This reveals the complete structure of each RNA version directly. Studies using PacBio for full-length transcript sequencing have captured four to seven times more distinct RNA versions compared to standard approaches, and those reads also reveal more splice junctions with higher accuracy. Oxford Nanopore adds the ability to sequence RNA directly, without first converting it to DNA, which preserves chemical modifications on the RNA that are lost in other methods.

Diagnosing Rare Genetic Diseases

Roughly half of patients with suspected rare genetic diseases remain undiagnosed after standard genetic testing. Long-read sequencing is beginning to close that gap. A study published in Nature Communications applied long-read whole-genome sequencing to 51 patients who had already tested negative with conventional short-read methods and uncovered additional diagnoses in 10% of cases. Among those newly diagnosed patients was a case of spinal muscular atrophy identified through a methylation pattern at the disease locus, a finding that matters because the condition is treatable when caught.

Long-read sequencing is especially effective for disorders caused by repeat expansions, where a short DNA sequence is copied dozens or hundreds of extra times. These expansions drive a range of neurological and neuromuscular diseases, including Huntington’s disease, fragile X syndrome, and several forms of ataxia. Short-read sequencing usually cannot measure the full length of an expanded repeat or characterize its internal structure. Long reads can sequence straight through the expansion, providing an exact count and revealing structural details that affect disease severity.

Current Limitations

Cost remains higher than short-read sequencing for equivalent coverage, though prices have dropped substantially in recent years. Throughput is also lower: generating the same depth of coverage across a human genome takes more time and expense with long reads. For applications where short reads work perfectly well, such as counting gene expression levels across thousands of genes or identifying single-letter DNA changes in well-characterized regions, short-read sequencing is still more practical.

Computational demands are another consideration. The large file sizes and specialized analysis pipelines required for long-read data mean that processing and interpreting results requires more computing power and bioinformatics expertise than short-read workflows. The tools are maturing rapidly, but they’re not yet as standardized or user-friendly as the short-read ecosystem that has had a decade-long head start.

For many genomics projects, the ideal approach combines both technologies: short reads for cost-effective, high-accuracy coverage of straightforward regions, and long reads to resolve the complex parts of the genome that short reads cannot reach.