What Is Principal Component Analysis? How It Works

Principal component analysis (PCA) is a statistical technique that takes a dataset with many variables and compresses it into a smaller set of new variables, called principal components, that capture most of the meaningful patterns in the original data. If you have a spreadsheet with 50 columns of measurements, PCA can often distill the important variation down to just two or three components, making the data far easier to visualize, analyze, and understand.

How PCA Works at a High Level

Imagine you measured 20 different health markers on thousands of patients. Many of those markers are correlated with each other: blood pressure tends to move with cholesterol, weight correlates with waist circumference, and so on. PCA exploits those correlations. It looks for the direction through your data where values vary the most, draws a line along that direction, and calls it the first principal component. Then it finds the next direction of highest variation that’s perpendicular (completely uncorrelated) to the first, and calls that the second principal component. This process repeats until you have as many components as original variables, but the key insight is that the first few components usually capture the vast majority of what’s interesting.

Each principal component is a linear combination of the original variables. That means it’s a weighted mix: maybe the first component is 40% blood pressure, 35% cholesterol, and smaller contributions from everything else. These weights tell you which original measurements matter most for that component. The components are ordered by how much variation they explain, so the first component always explains the most, the second explains the next most, and so on.

Why You Need to Scale Your Data First

PCA finds the directions of maximum variance, which means it’s sensitive to how your variables are measured. If one column is height in millimeters (ranging from 1,500 to 2,000) and another is weight in kilograms (ranging from 50 to 100), the height column has a much larger numerical spread simply because of its units. PCA would treat height as the dominant source of variation, not because it’s more important, but because the numbers are bigger.

To prevent this, you standardize the data before running PCA. Standardization rescales each variable so it has a mean of 0 and a standard deviation of 1. This puts all variables on equal footing, letting PCA identify genuine patterns rather than artifacts of measurement scales.

The Steps in Practice

Running PCA follows a consistent sequence. First, you organize your data into a table where rows are observations (patients, samples, time points) and columns are variables (measurements, features). Next, you subtract the mean of each column so the data is centered at zero, and typically standardize as described above.

From there, PCA computes the covariance matrix, which summarizes how every pair of variables moves together. The core mathematical step is extracting the eigenvectors and eigenvalues of this covariance matrix. In plain terms, each eigenvector points in a direction of variation in the data, and its corresponding eigenvalue tells you how much variation lies along that direction. The eigenvector with the largest eigenvalue becomes the first principal component, the next largest becomes the second, and so on. Finally, the original data gets projected onto these new directions, giving you a transformed dataset with the same observations but new, uncorrelated variables.

Choosing How Many Components to Keep

You rarely keep all the components. The whole point is to reduce complexity, so you select the first few that explain enough of the total variation for your purposes. Two tools help with this decision.

A scree plot graphs each component’s eigenvalue in order. You look for the “elbow” where eigenvalues drop off and start leveling into a flat line. Components before the elbow carry meaningful signal; those after it mostly capture noise. In one common pattern, the scree plot shows a clear bend after the third component, suggesting three components are sufficient.

Cumulative explained variance is even more direct. It tells you, as a percentage, how much total variation is captured by the first component, the first two together, the first three, and so on. If you’re using PCA for a quick summary or visualization, 80% of variance explained is often enough. If you need the reduced data for further statistical modeling, aiming for 90% or higher is more appropriate. The right threshold depends on what you’re doing with the results.

Reading a PCA Plot

The most common PCA visualization is a scatter plot where each point represents one observation, plotted on the first two principal components as axes. Points that cluster together are similar across the original variables; points far apart are different. This kind of plot can instantly reveal groups in your data that weren’t obvious when staring at a 50-column spreadsheet.

A biplot adds another layer. It shows both the observations (as points) and the original variables (as arrows). The direction and length of each arrow tell you how that variable relates to the components. Two arrows pointing in roughly the same direction indicate correlated variables. Arrows pointing in opposite directions indicate a negative correlation. The angle between two arrows approximates their correlation: a small angle means strong positive correlation, a right angle means no correlation at all. The length of each arrow reflects how well that variable is represented in the plot.

Real-World Applications

PCA shows up across a surprisingly wide range of fields. In genetics, it’s one of the most common tools for understanding population structure. When researchers applied PCA to genomic data from the 1000 Genomes Project, the first two principal components separated individuals by continental ancestry (African, Asian, and European populations), while the third component captured finer structure within those groups. This happens without anyone telling the algorithm about ancestry. PCA simply found that the largest sources of genetic variation correspond to geographic and evolutionary divergence.

Beyond mapping populations, PCA-based genome scans have identified specific genes under natural selection, including well-known targets involved in skin pigmentation and immune defense, along with new candidates and biological pathways related to the innate immune system and fat metabolism. The technique works even when populations blend into each other along a gradient rather than forming neat, separate clusters.

Outside of biology, PCA is used in finance to identify the handful of factors driving movement across hundreds of stocks, in image processing to compress photographs into fewer dimensions, in climate science to extract dominant weather patterns from vast grids of temperature and pressure readings, and in manufacturing to detect which process variables are drifting when quality drops.

Limitations to Keep in Mind

PCA assumes that the relationships between your variables are linear. If two variables have a curved or more complex relationship, PCA can miss it entirely. Techniques like kernel PCA exist to handle nonlinear patterns, but standard PCA won’t catch them.

Outliers can seriously distort the results. Because PCA chases maximum variance, a few extreme data points can pull a principal component toward themselves, skewing the entire analysis. Careful data cleaning before running PCA helps, but it’s worth checking whether your results change dramatically when you remove suspicious outliers.

Interpretability is another common frustration. Since each principal component is a blend of all the original variables, it doesn’t always have an intuitive meaning. You might find that the first component loads heavily on a mix of blood pressure, age, and glucose, but there’s no single label that neatly describes what that combination represents. This is a trade-off: you gain a compact, powerful summary of variation, but you lose the straightforward naming of your original measurements.

Finally, deciding how many components to keep involves some judgment. The scree plot and cumulative variance help, but there’s no single correct answer. Two analysts looking at the same scree plot might reasonably disagree on whether to keep three or four components.

PCA vs. Similar Techniques

PCA is sometimes confused with factor analysis, but the goals differ. PCA aims to explain as much total variance as possible with fewer components. Factor analysis assumes there are hidden, underlying factors causing the correlations you observe, and tries to identify those factors. In practice the results often look similar, but the reasoning behind each method is distinct.

Another relative is t-SNE, a technique used mainly for visualization. While PCA preserves global structure (the overall spread and distances in your data), t-SNE focuses on preserving local neighborhoods, making it better at revealing tight clusters but worse at maintaining the big picture. PCA is also deterministic: you’ll get the same answer every time. t-SNE involves randomness and can produce different-looking plots on each run. For a quick, reliable overview of high-dimensional data, PCA is typically the first tool to reach for.