How to Interpret a Principal Component Analysis (PCA) Biplot

Principal Component Analysis (PCA) reduces the complexity of large datasets by transforming numerous original variables into a smaller set of uncorrelated principal components. This technique identifies axes that capture the maximum amount of variance, allowing complex relationships to be visualized. A biplot is the standard graphical representation of PCA, combining the two most informative components into a single two-dimensional scatter plot. Interpreting a biplot involves analyzing both the plotted data points (samples) and the vectors (original variables) to understand the underlying structure of the data.

Defining the Principal Components and Variance

The axes of the biplot, typically labeled PC1 and PC2, are the principal components, which are linear combinations of the original variables. These new axes are mathematically constructed to be orthogonal, meaning they are perpendicular and thus statistically uncorrelated. PC1, the horizontal axis, always captures the largest possible amount of variance present in the dataset, identifying the most important direction of data spread.

PC2 captures the greatest remaining variance that is independent of PC1. Before drawing any conclusions, examine the percentage of variance explained by each component, often listed on the axis labels. The cumulative percentage for PC1 and PC2 indicates how much of the dataset’s total variability is represented in the two-dimensional biplot.

A high explained variance, perhaps greater than 70%, suggests the two-dimensional plot is a faithful representation of the data structure. If the cumulative explained variance is low, one must be cautious, as a significant amount of variation exists outside the visualized plane. This initial assessment provides context for interpreting the positions of the data points and variable vectors.

Decoding the Data Points (Scores)

The individual points plotted on the biplot are the scores, representing the transformed coordinates of the original samples in the principal component space. These points are positioned based on their values for the two principal components. The proximity of two points suggests that the corresponding samples share similar characteristics across the measured variables.

Conversely, samples that are plotted far apart, often forming distinct clusters, represent groups with significantly different feature compositions. The spatial separation between clusters visually demonstrates the magnitude of the differences between sample groups.

Interpreting clusters provides a powerful exploratory tool, allowing researchers to visually identify hidden groupings that may correspond to different experimental conditions or biological phenotypes. For instance, if samples from a treated group cluster separately from a control group along PC1, it suggests the treatment introduced a major difference in the overall measured features.

Decoding the Variables (Loadings)

The variable arrows, or vectors, extending from the origin are known as loadings, and they show how the original measured features contribute to the principal components. The length of an arrow indicates the strength of the variable’s influence on the two plotted components. A longer vector signifies that the variable has a greater weight and is a stronger driver of the separation observed in the sample scores.

The direction of the arrows reveals the correlation structure among the original variables. Variables pointing in a similar direction (small angle) are positively correlated. Conversely, if two variable arrows point in opposite directions (near 180 degrees), they are negatively correlated.

A right angle (90 degrees) between two variable vectors indicates that those two variables are approximately uncorrelated with respect to the two principal components. Examining the variable vectors alone maps out the internal relationships among the measured features. The positioning of the vector head relative to the PC axes also shows the variable’s contribution to that specific component.

Integrating Scores and Loadings

Combining the information from the sample scores and the variable loadings is the most informative step in biplot analysis. The relationship between a specific sample point and a specific variable vector is determined by the rule of projection. To assess a sample’s value for a given variable, one must mentally project the sample point perpendicularly onto the variable vector.

A sample point that projects near the head of a variable vector indicates that the sample has a high value for that particular feature. Conversely, a sample that projects near the tail of the vector, close to the origin in the opposite direction, has a low value for that variable.

Analyzing the quadrants of the biplot allows for a clear synthesis of the data structure. Samples located in the upper-right quadrant are characterized by high values in the variables whose vectors point in that direction. If vectors for “Gene A” and “Protein B” point into the upper-right, then samples in that quadrant express high levels of both.

This integrated interpretation explains why certain samples cluster together and provides a biological explanation for the differences observed. The analysis reveals which specific combination of features is responsible for separating one group of samples from another.

Evaluating the Biplot Reliability

The biplot’s reliability is tied directly to the amount of total variance captured by the two plotted components. If the cumulative explained variance of PC1 and PC2 is low (below 60%), the two-dimensional representation may be misleading because significant patterns exist in unplotted dimensions. The true distances between samples and the actual correlations between variables may be poorly represented.

A diagnostic tool called a scree plot is often consulted to determine the appropriate number of components to retain. The scree plot displays the variance explained by each component in descending order, helping to identify where the explained variance drops sharply. If the variance explained by the third or fourth component remains substantial, relying solely on the PC1 and PC2 biplot may omit important data structure.