Whole Genome Sequencing: A Comprehensive Workflow Guide

Whole Genome Sequencing (WGS) is a laboratory and computational process used to determine the complete DNA sequence of an organism’s genome. This process reads all three billion base pairs in the human genome in a single experiment, moving beyond looking at small sections of the genome. Reading the entire genetic code offers unparalleled insight into genetic variation, disease association, and evolutionary history. The workflow begins with physical sample preparation and ends with the biological interpretation of the resulting data.

Preparing the Sample and Library Construction

The first phase involves careful preparation of the biological sample to create a sequencing “library.” This begins with the extraction of high-quality genomic DNA from sources like blood, saliva, or tissue. The purity of the extracted DNA is critical, as contaminants such as proteins or salts can interfere with later enzymatic reactions.

Following extraction, the long strands of genomic DNA must be broken into smaller, manageable pieces, a process called fragmentation. For short-read sequencing, DNA is typically fragmented to an average size between 300 to 600 base pairs (bp). This specific size range is necessary because sequencing instruments can only accurately read a limited length of DNA in a single pass.

The fragmented pieces then undergo library preparation, where specialized sequences known as adapters are ligated to both ends. These short, known DNA sequences serve multiple purposes. They act as binding sites for the flow cell surface and as primers for the sequencing chemistry. Adapters often contain barcode sequences that allow multiple individual samples to be mixed and sequenced simultaneously.

Generating the Raw Data

The sequencing library is loaded onto a specialized glass slide called a flow cell. DNA fragments bind to complementary oligonucleotides anchored to the surface, where they are clonally amplified into millions of identical copies to form dense clusters. This amplification ensures the fluorescent signal is strong enough for accurate detection.

Data generation relies on sequencing by synthesis (SBS), which uses fluorescently labeled nucleotides. In each cycle, only a single nucleotide is incorporated into the growing strand and temporarily blocked by a reversible terminator. A high-resolution camera images the flow cell to record the unique fluorescent signal associated with the incorporated base (A, T, C, or G) at every cluster.

After imaging, a chemical step removes the fluorescent tag and the terminator, allowing the next cycle of synthesis to begin. This cycle of incorporation, imaging, and cleavage is repeated hundreds of times to determine the sequence of bases for each DNA fragment. Software converts the raw light intensity data from the images into a sequence of A’s, T’s, C’s, and G’s.

Initial Computational Cleanup

The sequencing instrument produces a massive volume of raw data requiring processing before biological interpretation. The first computational step transforms raw signal intensities into the standardized FASTQ file format. Each sequence read in a FASTQ file includes a Phred quality score (Q-score), which is a logarithmic measure of the probability that a base call is incorrect. A score of Q30, for example, indicates 99.9% accuracy, while a Q20 indicates 99% accuracy.

Quality Control (QC) assesses the overall reliability of the sequencing run and identifies potential issues like low-quality bases or adapter contamination. This quality assessment is necessary because errors introduced during the physical sequencing process can lead to inaccurate variant calling later in the pipeline.

Following QC, the raw reads undergo trimming and filtering steps. Low-quality bases, typically found at the ends of the reads where the sequencing chemistry degrades, are removed to improve data accuracy. Additionally, any remaining adapter sequences must be computationally trimmed from the reads. This cleanup ensures that only high-confidence sequence information proceeds to the alignment and variant discovery stages.

Mapping, Assembly, and Variant Identification

Once the data is cleaned, the next phase is to align the millions of short reads back to a known reference genome. The reference genome provides a coordinate system for the human sequence. Software tools like Burrows-Wheeler Aligner (BWA) efficiently map the short sequences to their most likely location. The alignment result is typically stored in a Sequence Alignment/Map (SAM) or Binary Alignment/Map (BAM) file, which lists every read, its position, and its alignment quality.

Reference-based mapping is the standard approach for human WGS projects, aiming to identify differences from the established reference. This differs from de novo assembly, which is used for organisms without a reference genome, where reads are pieced together to build the genome from scratch. The mapping process includes marking and removing duplicate reads that arose from the clonal amplification step.

The central task is variant calling, which uses sophisticated algorithms to compare the aligned reads against the reference genome. This identifies differences, determining if they are true biological variants or sequencing errors. This process identifies Single Nucleotide Variants (SNVs), where a single base pair has changed, and small insertions or deletions (Indels).

The final step is annotation, which labels the identified variants with information about their location and potential biological consequence. Annotation software determines if a variant falls within a protein-coding region (exon), a non-coding region (intron), or a regulatory element. If a variant is in an exon, the annotation predicts its effect on the resulting protein.

Assigning Biological and Clinical Meaning

The final set of annotated variants represents all differences from the reference sequence and requires extensive filtering for interpretation. The first step involves filtering out common, benign variants present in the general population. This is accomplished by consulting large population databases, such as the Genome Aggregation Database (gnomAD), which aggregates sequencing data from thousands of generally healthy individuals.

Variants present at a high frequency in gnomAD are considered common polymorphisms and are unlikely to cause a rare disease. Analysts focus on rare or novel variants, which are then cross-referenced against specialized clinical databases, such as ClinVar or the Human Gene Mutation Database (HGMD). These databases determine if the variants have been previously associated with a known disease or trait, providing information on their clinical significance.

The ultimate goal of this interpretive phase is to link the technical genetic data back to a meaningful biological or clinical context. For research, this may involve identifying variants statistically enriched in a study cohort compared to controls. In a clinical setting, interpretation involves generating a focused report that highlights only the variants relevant to the patient’s condition.