How the BLOSUM62 Matrix Is Created and Used

Comparing protein sequences is fundamental in modern biology, allowing researchers to trace common ancestry and explore the deep history of life. Amino acid sequences change over time due to genetic mutations, but functional proteins retain similarities reflecting their shared evolutionary past. To effectively measure this similarity, bioinformatics uses specialized tools, primarily the BLOSUM62 matrix. This 20×20 table provides a quantifiable score for every possible pairing of the 20 standard amino acids. By assigning scores based on the likelihood of one amino acid replacing another during evolution, the matrix transforms a simple character comparison into a biologically meaningful calculation of evolutionary relatedness.

The Purpose of Substitution Matrices

Using a basic scoring system, such as awarding +1 for a match and -1 for a mismatch, fails to capture the realities of protein evolution. Not all amino acid substitutions have an equal effect on function. For example, replacing the hydrophobic Leucine (L) with the equally hydrophobic Isoleucine (I) is a conservative substitution often tolerated by the protein structure. This type of change is observed frequently in functional proteins.

Conversely, replacing a hydrophobic amino acid with a hydrophilic, charged one (e.g., Tryptophan (W) with Lysine (K)) is a non-conservative substitution. This change is much more likely to disrupt the protein’s shape and function. A simple match/mismatch system would treat both substitutions identically, which is biologically inaccurate. BLOSUM62 addresses this by assigning scores that reflect the actual observed frequency of these changes across many protein families. The matrix assigns a high score to plausible substitutions and a low or negative score to unfavorable changes, providing an accurate measure of sequence homology.

Derivation and Meaning of the “62”

The name BLOSUM is an acronym for BLOcks SUbstitution Matrix, indicating the data source. Scientists compiled thousands of conserved, ungapped segments, known as “blocks,” from multiple alignments of related protein families. These blocks represent regions of proteins that have remained highly stable and functionally constrained throughout evolution. The relative frequencies of amino acid changes within these blocks are counted to build the substitution statistics.

The number “62” in BLOSUM62 represents a clustering threshold of 62% identity. Sequences within the protein blocks that are 62% or more identical are grouped together, or “clustered,” and their substitution counts are weighted less heavily. This clustering step reduces the statistical influence of very closely related proteins, which would skew the data toward recent evolutionary events. By down-weighting highly similar sequences, BLOSUM62 is tailored to detect similarity among moderately distant relatives sharing approximately 62% average identity.

How the Scores are Calculated

Every score within the BLOSUM62 matrix is calculated using a Log-Odds Ratio, a mathematical concept comparing two probabilities. For any pair of amino acids (A and B), the ratio compares the observed frequency of A substituting for B in the conserved protein blocks to the expected frequency of A and B aligning purely by chance. The observed frequency reflects evolutionary history and is derived from the clustered block data. The expected frequency is the product of the individual background frequencies of A and B, assuming they align randomly.

The final score is the logarithm of this ratio, which simplifies summing scores during sequence alignment. A positive score means the substitution was observed more frequently than expected by random chance, indicating an evolutionarily favorable change. For example, the self-match score for Tryptophan (W:W) is +11, reflecting its rarity and importance. A negative score, like the -4 for Aspartic acid (D) paired with Leucine (L), means the substitution was observed less often than expected, suggesting a non-conservative, unfavorable event. Scores near zero suggest the substitution occurs at a frequency close to that of a random event.

Application in Sequence Alignment

The completed BLOSUM62 matrix serves as the scoring engine for powerful comparison tools used in bioinformatics. Algorithms such as the Basic Local Alignment Search Tool (BLAST) and the Smith-Waterman algorithm use the matrix scores to evaluate the overall similarity between two protein sequences. These algorithms calculate a total score for a potential alignment by summing the scores for every aligned pair of amino acids, drawing the appropriate value from the BLOSUM62 matrix. The higher the final alignment score, the more likely the two sequences are homologous, meaning they share a common ancestor and likely a similar biological function.

The matrix is also integral to calculating the statistical significance of an alignment score, typically expressed as an E-value. The E-value estimates the number of matches with an equally high score expected to be found purely by chance in a database. By providing a biologically derived scoring system, BLOSUM62 allows researchers to differentiate between a significant, evolutionarily meaningful match and a random coincidence. This ability to assess statistical significance is why BLOSUM62 is the default matrix for many protein sequence comparison programs, facilitating the detection of distant evolutionary relationships.