A PCA plot is a scatter plot that takes complex data with many variables and displays it in just two dimensions, making patterns, groupings, and outliers visible at a glance. Each dot on the plot represents one sample or observation from your dataset, and its position reflects how that sample relates to all the others based on every variable measured. PCA stands for Principal Component Analysis, a technique that compresses high-dimensional data down to a handful of new variables called principal components, which capture the most important patterns in the original data.
How PCA Reduces Complex Data
Imagine you have a dataset with thousands of variables per sample. Maybe it’s gene expression data from hundreds of patients, or survey responses across dozens of questions. Analyzing all those variables at once is impractical, and visualizing them is impossible. PCA solves this by creating new variables, the principal components, that are combinations of the originals. These components are ranked by how much of the data’s total variation they capture.
The first principal component (PC1) is the direction through the data that captures the most variance. Think of it as the single best summary of what makes your samples different from each other. The second principal component (PC2) captures the next largest chunk of variance and is completely uncorrelated with PC1, meaning it represents a genuinely independent pattern. Together, PC1 and PC2 often capture enough of the overall variation to reveal the main structure in your data. A PCA plot is simply PC1 on the x-axis and PC2 on the y-axis, with each sample plotted as a point.
What the Axes and Percentages Mean
You’ll notice that PCA plot axes typically include a percentage label, something like “PC1 (45%)” or “PC2 (18%).” Those percentages tell you how much of the total variation in the dataset each component explains. If PC1 explains 45% and PC2 explains 18%, your two-dimensional plot is capturing 63% of the information from what might be thousands of original variables. The higher those percentages, the more faithfully the flat plot represents the full complexity of your data.
When PC1 has a much higher percentage than PC2, the horizontal spread of points on the plot is more meaningful than the vertical spread. If both percentages are similar, both axes carry roughly equal weight in distinguishing your samples.
Reading the Points on a PCA Plot
The core rule for interpreting a PCA plot: points that are close together represent samples with similar profiles across all measured variables, and points far apart represent samples that differ substantially. If you’re looking at gene expression data from mice, for example, mice with similar expression profiles will cluster together on the plot. If two clusters separate along PC1, the biggest differences between them are driven by whichever original variables contribute most heavily to that first component.
This is why PCA plots are so useful for spotting groups. You might color-code your points by a known category, like treatment group or tissue type, and immediately see whether samples within each category cluster together. When they do, it confirms that the category drives real, measurable differences in the data. When they don’t, it suggests the category may not be the main source of variation.
Points sitting far from all clusters are potential outliers. These isolated samples may represent measurement errors, contaminated data, or genuinely unusual observations. A common threshold flags a sample as an outlier if its statistical leverage (a measure of how much it pulls on the analysis) exceeds three times the median leverage across all samples.
Loadings and Biplots
A standard PCA plot shows only the samples. A biplot adds another layer: arrows or vectors representing the original variables. Each arrow points in the direction that variable “pulls” the data, and its length reflects how strongly that variable contributes to the components shown. Variables with arrows pointing in similar directions are correlated with each other. Variables pointing in opposite directions are inversely correlated.
The coefficients behind these arrows are called loadings, and they range from -1 to 1. A loading close to -1 or 1 means the variable strongly influences that component. A loading near 0 means it has little effect. If you’re trying to figure out which of your original measurements actually matter for separating your samples, the loading plot or biplot is where you look.
How PCA Components Are Calculated
Behind the scenes, PCA works by computing eigenvectors and eigenvalues from the covariance matrix of your data. Each eigenvector defines the direction of a principal component, and its corresponding eigenvalue represents the amount of variance along that direction. The eigenvector with the largest eigenvalue becomes PC1, the next largest becomes PC2, and so on. Each sample then gets a “score” on each component based on where it falls along that direction, and those scores become the coordinates on your PCA plot.
The math guarantees that all principal components are perpendicular (orthogonal) to each other, which is why they capture independent patterns rather than redundant ones. A dataset with, say, 20 original variables will produce 20 principal components, but typically only the first two or three contain meaningful signal. The rest capture noise.
Choosing How Many Components Matter
A scree plot helps you decide how many components are worth examining. It’s a simple line graph with component number on the x-axis and the corresponding eigenvalue (or percentage of variance explained) on the y-axis. The eigenvalues drop as you move to higher-numbered components. You look for the “elbow,” the point where the curve flattens out and each additional component adds little new information. Components before the elbow are typically retained, and everything after is treated as noise.
This choice is somewhat subjective. Some analysts also use a cumulative variance threshold, keeping enough components to explain 80% or 90% of total variation. But for the purpose of making a PCA plot, most people simply use PC1 and PC2, since those capture the largest share of the data’s structure.
Why Data Scaling Matters
PCA is sensitive to the scale of your variables. If one variable is measured in thousands and another in decimals, the larger-scaled variable will dominate the first principal component simply because its numbers are bigger, not because it’s more important. To prevent this, data is typically centered (subtracting each variable’s mean) and standardized (dividing by its standard deviation) before running PCA. This puts all variables on equal footing so the analysis reflects genuine patterns rather than measurement units.
Skipping this step is one of the most common mistakes in PCA. If your PCA plot looks like all the variation comes from a single variable, check whether the data was scaled first.
Common Uses for PCA Plots
PCA plots appear across nearly every data-heavy field, but some applications are especially prominent. In genetics, PCA is considered the gold standard for visualizing population structure. Researchers plot individuals based on genome-wide data, and the resulting clusters often map neatly onto geographic ancestry. In one well-known type of analysis, Ashkenazi Jewish populations formed a distinct cluster positioned between European and Middle Eastern groups, suggesting mixed ancestry from both regions.
In genomics and single-cell RNA sequencing, PCA plots help researchers see whether cells or tissue samples group by disease state, cell type, or treatment condition. They’re also used as a quality control step to spot batch effects, where samples cluster by which day they were processed rather than by any biologically meaningful category.
In genome-wide association studies, PCA serves double duty: identifying population structure that could confound results and generating covariates to adjust for that structure statistically. PCA plots are also used in paleogenomics to place ancient DNA samples in the context of modern populations, in forensic genetics for biomarker identification, and in ecology, finance, and image processing for exploratory data analysis of any kind.
The versatility comes from the core function: whenever you have more variables than you can reasonably look at, PCA compresses them into a visual summary that highlights the dominant patterns. The plot itself is just the final step, a window into what the math uncovered.

