What Is a CADD Score and How Is It Calculated?

High-throughput sequencing identifies millions of genetic variations, including single nucleotide changes (SNVs) and short insertions or deletions (indels). Geneticists face the challenge of distinguishing harmless variants from those capable of causing disease. This prioritization requires powerful computational tools to efficiently filter the vast landscape of genetic data. The Combined Annotation Dependent Depletion (CADD) score was developed as a unified computational framework to provide a standardized measure of a variant’s potential to be damaging.

Defining the CADD Score

The CADD score (Combined Annotation-Dependent Depletion) is a computational measure that predicts the deleteriousness of genetic variants across the entire human genome. It is a single, numerical metric designed to estimate a variant’s functional impact or likelihood of being pathogenic. Unlike many older prediction tools, CADD is not limited to assessing only changes that affect protein-coding regions.

This framework assigns a score to virtually every possible single nucleotide variant (SNV) and small insertion or deletion (indel) located anywhere in the genome. CADD provides a comprehensive measure that includes variants in non-coding regions, such as introns, promoters, and regulatory elements. A higher CADD score indicates a greater prediction that the variant is deleterious, meaning it is more likely to disrupt normal biological function.

The Logic Behind the Prediction

The CADD score calculation uses a sophisticated machine learning approach that contrasts two distinct categories of genetic variants. The model is trained to differentiate between variants likely benign and those likely to be functionally disruptive. The “proxy-neutral” set consists of variants fixed or nearly fixed in human populations since the split from chimpanzees, suggesting they were tolerated by natural selection.

Conversely, the model uses “proxy-deleterious” variants, which are computer-simulated and represent the spectrum of all possible mutations. The machine learning model, which has evolved from a Support Vector Machine to logistic regression, learns the patterns distinguishing benign from potentially damaging variants. CADD incorporates over 60 diverse biological features, forming the “Combined Annotation” aspect of its name. These features include measures of evolutionary conservation, proximity to gene boundaries or splice sites, and data on functional genomic elements like transcription factor binding sites. The integration of these sources allows the model to produce a singular score quantifying the probability of a variant being deleterious.

Interpreting the Scores

The final CADD output includes both a raw score and a Phred-scaled CADD score, with the latter being the more commonly used metric. The Phred scale transforms the raw output into a value contextualized against the entire genome. This Phred-scaled score measures the variant’s rank compared to all possible single nucleotide variants.

The interpretation is straightforward: a score of 10 or greater indicates the variant is predicted to be in the top 10% most deleterious variants across the genome. A score of 20 or higher means the variant falls within the top 1% of all possible variants, suggesting a strong likelihood of functional impact. Variants known to cause Mendelian diseases often have Phred-scaled scores well above 20, serving as a useful benchmark for pathogenicity. This relative ranking system allows researchers to directly compare the predicted effect of variants in coding and non-coding regions using a single, uniform scale.

Real-World Applications

The CADD score’s utility is its ability to prioritize candidate variants, especially when whole-exome or whole-genome sequencing identifies thousands of genetic differences.

In clinical diagnostics, CADD scores are applied to filter and rank potential causative variants in patients with suspected genetic disorders. When sequencing a patient with an undiagnosed condition, the CADD score helps narrow a list of hundreds of rare variants down to a handful of high-scoring candidates warranting further investigation.

CADD also plays a significant role in large-scale genomic research and population studies. Researchers utilize the scores to filter novel variants identified in massive sequencing projects, such as those focused on cancer genomics or complex trait analysis. Variants associated with complex diseases identified through genome-wide association studies (GWAS) often show higher CADD scores than background variants, allowing scientists to pinpoint promising functional candidates.

Understanding Score Limitations

The CADD score is a computational prediction tool, not a definitive diagnostic result, and must be used with caution. The score reflects predicted functional deleteriousness, which does not always equate directly to clinical pathogenicity. A variant predicted to be functionally disruptive might have subtle, tissue-specific effects or be mitigated by other genetic factors.

The model faces inherent difficulties in capturing the complexity of the genome, such as intricate regulatory interactions. Variants affecting regulatory regions can have highly tissue-specific effects that a general genome-wide score may not fully capture. CADD scores are not meant to be used with a single, universal cutoff value to declare a variant pathogenic or benign. Instead, the score functions as supporting evidence that must be considered alongside other clinical, biochemical, and family-based data.