PCA, or principal component analysis, reduces the number of variables in a large dataset while preserving as much of the original information as possible. If you have data with dozens or hundreds of measurements per observation, PCA compresses that into a smaller set of new variables called principal components, each one capturing the most remaining variation in the data. It’s one of the most widely used techniques in data science, genetics, finance, and image processing.
The Core Idea Behind PCA
Imagine you’ve collected 50 different measurements on each of 1,000 people. Many of those measurements overlap or correlate with each other. Height and arm length, for example, move together. PCA finds these patterns of overlap and creates new, condensed variables that capture what’s really going on in the data without all the redundancy.
Each principal component is a combination of your original variables, weighted to capture the maximum amount of variation. The first principal component explains the single largest source of variation in the dataset. The second captures the next largest source, with one constraint: it must be completely uncorrelated with the first. The third is uncorrelated with both, and so on. By the time you’ve built a handful of these components, you’ve often captured 85% or more of the total information in the original dataset, using far fewer variables.
This is why PCA is called a dimensionality reduction method. You start with a high-dimensional dataset (many columns) and end with a lower-dimensional one (fewer columns) that still tells most of the same story.
How PCA Works Step by Step
You don’t need to understand the linear algebra to use PCA in practice, but knowing the general steps helps you interpret the output and avoid mistakes.
- Center the data. For each variable, subtract the mean so that the data is centered around zero. This ensures PCA focuses on variation rather than raw magnitude.
- Scale the variables. If your variables are on different scales (say, income in dollars and age in years), standardize them first. Otherwise, the variable with the largest numbers will dominate the results simply because its values are bigger.
- Compute the covariance matrix. This matrix captures how every pair of variables moves together. It’s the mathematical summary of all the relationships in your data.
- Extract eigenvectors and eigenvalues. The eigenvectors define the directions of the new principal components. The eigenvalues tell you how much variance each component captures. Larger eigenvalues mean more important components.
- Project the data. Multiply your original data by the top eigenvectors to get the new, reduced dataset expressed in terms of principal components.
Most software handles all of this in a single function call. In Python’s scikit-learn, for instance, it takes about three lines of code.
Deciding How Many Components to Keep
PCA produces as many components as you have original variables, but the whole point is to keep only a few. The question is: how few?
Each component comes with a variance ratio, which tells you the fraction of total variation it explains. The first component might explain 40% of the variance, the second 20%, the third 10%, and so on. These ratios always add up to 1 (100%). A common threshold is to keep enough components to explain at least 85% of the total variance, though this depends on your use case. In one published example, a dataset with 90 original variables was compressed down to 58 components while still retaining 85% of the information.
The most popular visual tool for making this decision is the scree plot, a simple graph that shows each component’s eigenvalue in decreasing order. You look for the “elbow,” the point where the curve flattens out and additional components stop adding meaningful information. Everything before the elbow is signal worth keeping. Everything after is mostly noise.
What PCA Assumes About Your Data
PCA makes one strict assumption: the relationships between your variables are linear. It works by finding straight-line combinations of variables that explain the most spread. If the true structure of your data is curved or circular, PCA will miss it. Techniques like kernel PCA or t-SNE exist for those nonlinear cases.
PCA also works best when your data roughly follows a bell-curve distribution, because it relies entirely on means and variances to summarize the data. Extreme outliers can distort the results significantly, pulling principal components toward rare, unusual observations rather than the main patterns. Checking for and handling outliers before running PCA is worth the effort.
Common Uses for PCA
In genetics, PCA is considered the gold standard for visualizing population structure. Researchers run PCA on genome-wide data and plot individuals on a scatter plot using the first two or three components. People with shared ancestry cluster together, which helps identify population groups, detect outliers, and adjust for ancestry differences in disease studies. Nearly every large genome-wide association study uses PCA or a PCA-based tool as a standard step.
In image processing, each pixel of an image is a variable. A single photograph might contain millions of pixels, making storage and computation expensive. PCA compresses images by keeping only the components that capture the most visual information, discarding subtle pixel-level variation that the human eye wouldn’t notice. The same principle applies to facial recognition systems, where PCA reduces a face image to a compact set of features (sometimes called “eigenfaces”) that can be compared efficiently.
In finance, portfolio managers use PCA to identify the underlying factors driving returns across dozens or hundreds of stocks. Rather than tracking each stock individually, PCA might reveal that three or four hidden factors (often interpretable as market-wide risk, sector effects, and interest rate sensitivity) explain most of the movement. This simplifies risk modeling and helps isolate which exposures actually matter.
What PCA Does Not Do
PCA is not a predictive model. It doesn’t tell you which variables cause an outcome or which group a new observation belongs to. It’s a preprocessing and exploration tool. You might use PCA to simplify your data before feeding it into a regression or classification model, but PCA itself only reorganizes information.
The new principal components are also harder to interpret than the original variables. Your first component might be “35% height, 28% weight, 20% arm length, and 17% shoe size.” That’s mathematically precise but doesn’t have a clean, intuitive label. In some fields, researchers spend considerable effort trying to interpret what each component represents in practical terms. In others, interpretation isn’t the goal. The reduced dataset is simply a cleaner, smaller input for the next analysis step.

