What Is Pearson Correlation and How Does It Work?

The Pearson correlation coefficient, often written as “r,” is a number between -1 and +1 that measures how strongly two variables move together in a straight-line pattern. An r of +1 means a perfect positive relationship (as one goes up, the other goes up by a proportional amount), -1 means a perfect negative relationship (as one goes up, the other goes down proportionally), and 0 means no linear relationship at all.

It’s one of the most widely used statistics in research, business, and data analysis. If you’ve ever wondered whether two things are related, and by how much, Pearson’s r is likely the first tool you’d reach for.

How It Works

At its core, Pearson’s r answers a simple question: when one variable increases, does the other tend to increase, decrease, or do nothing predictable? It does this by comparing how far each data point sits from the average of its variable. If both variables tend to be above their averages at the same time and below their averages at the same time, the correlation is positive. If one tends to be high when the other is low, the correlation is negative.

Mathematically, the formula divides the covariance of two variables by the product of their individual standard deviations. Covariance captures how much two variables change together, but its raw value is hard to interpret because it depends on the scale of the data. Dividing by the standard deviations normalizes everything into that clean -1 to +1 range, making comparisons possible regardless of whether you’re looking at height in centimeters or income in dollars.

What the Numbers Mean

The sign tells you direction. A positive r means both variables rise and fall together: think height and weight, or hours studied and exam scores. A negative r means they move in opposite directions: as temperature drops, heating bills go up.

The absolute value tells you strength. A widely used set of benchmarks comes from the statistician Jacob Cohen:

r = 0.10: small correlation
r = 0.30: medium correlation
r = 0.50: large correlation

These thresholds are guidelines, not rigid rules. In some fields, the typical correlations are much smaller. A 2016 analysis of individual-differences research in psychology found that small, medium, and large correlations were closer to .11, .19, and .29. In social psychology, the benchmarks landed at .12, .25, and .42. What counts as “strong” depends heavily on context.

What It Looks Like on a Scatterplot

If you plot your two variables on a graph, with one on each axis, each data point becomes a dot. The pattern those dots form tells you a lot. When r is close to +1, the dots cluster tightly around a straight line that slopes upward from left to right. When r is close to -1, the dots cluster around a line that slopes downward. A perfect correlation (r = +1 or -1) means every single dot falls exactly on that line.

When r is near zero, the dots scatter in no particular pattern. They might form a cloud, a circle, or even a clear curve. That last scenario is important: a strong curved relationship between two variables can produce a Pearson r near zero, because Pearson only detects straight-line patterns.

Assumptions You Need to Check

Pearson’s r isn’t appropriate for every dataset. Four conditions need to hold for the result to be meaningful.

First, both variables must be measured on a continuous scale, meaning values like weight, temperature, or income rather than categories like “agree/disagree” or rankings like “1st, 2nd, 3rd.” Second, the relationship between the two variables should be roughly linear. If you plot the data and see a curve, Pearson’s r will understate or misrepresent the actual relationship. Third, the data should be approximately normally distributed, not heavily skewed in one direction. Fourth, there should be no extreme outliers pulling the results in a misleading direction.

That last point deserves emphasis. Even a single extreme data point can dramatically shift a Pearson correlation. Research has shown that in large datasets, outliers can produce false positives, making two unrelated variables appear correlated, and can even flip the sign of a correlation entirely, turning what looks like a positive relationship into a negative one or vice versa.

When to Use Spearman Instead

When your data violate Pearson’s assumptions, the most common alternative is Spearman’s rank correlation. Instead of working with the raw data values, Spearman converts everything to ranks (1st, 2nd, 3rd, and so on) and then calculates the correlation on those ranks.

This makes Spearman a better fit in several situations: when your data are ordinal (like survey responses on a 1-to-5 scale), when the relationship is consistently upward or downward but not a straight line, when distributions are heavily skewed, or when you have stubborn outliers you can’t justify removing. The tradeoff is that Spearman is slightly less powerful than Pearson when the Pearson assumptions are genuinely met, meaning it’s less likely to detect a real relationship in borderline cases.

Correlation, Significance, and Sample Size

A common mistake is treating any nonzero r value as meaningful. In a small sample, random chance alone can produce surprisingly large correlations. This is where statistical significance comes in. A significance test evaluates whether your observed r is large enough, given your sample size, to be unlikely under the assumption that the true correlation is zero.

Sample size plays a huge role here. With 10 data points, you’d need an r of roughly 0.63 to reach statistical significance. With 100 data points, an r of about 0.20 might be significant. With thousands of observations, even a tiny correlation of 0.05 can be statistically significant, even though it explains almost none of the variation in your data. Significance tells you the correlation probably isn’t zero. It doesn’t tell you the correlation is big enough to matter.

What Pearson’s r Does Not Tell You

The most important limitation is one you’ve likely heard before: correlation does not equal causation. An r of 0.80 between two variables means they’re strongly associated, but it says nothing about whether one causes the other. Both could be driven by a third variable you haven’t measured, or the relationship could be coincidental.

Pearson’s r also only captures linear relationships. Two variables could have a strong U-shaped or exponential relationship and still produce an r near zero. Always look at a scatterplot before trusting the number. If the data follow a curve, the Pearson coefficient is the wrong summary.

Finally, squaring r gives you the coefficient of determination (r²), which is often more intuitive. An r of 0.50 means r² = 0.25, telling you that 25% of the variation in one variable is accounted for by its relationship with the other. The remaining 75% comes from other factors. This reframing helps put seemingly impressive correlations in perspective.