The linear correlation coefficient, most commonly called Pearson’s r, is a single number between -1 and +1 that measures how closely two variables follow a straight-line pattern. A value of +1 means the two variables move perfectly together in the same direction, -1 means they move perfectly in opposite directions, and 0 means there’s no linear relationship at all. It’s one of the most widely used tools in statistics, showing up in fields from medicine to economics to psychology.
What the Number Actually Tells You
The correlation coefficient answers a specific question: when one variable goes up, does the other tend to go up, go down, or do nothing predictable? The key word is “tend.” Real data is messy, so Pearson’s r captures the overall trend rather than requiring every single data point to fall on a perfect line.
A positive r means both variables rise together. Think height and weight: taller people generally weigh more, so those two measurements have a positive correlation. A negative r means one variable falls as the other rises. Hours of TV watched per day and physical fitness scores, for example, tend to move in opposite directions. When r sits near zero, knowing one variable tells you nothing useful about the other.
One useful property of this measure is that it has no units. If you’re comparing height in inches to weight in pounds, the inches and pounds cancel out during the calculation. That means you can directly compare the correlation between height and weight to the correlation between, say, study hours and test scores, even though the underlying measurements are completely different.
How Strong Is “Strong”?
Not all correlations carry the same weight, and researchers use rough benchmarks to describe them. The most widely cited guidelines come from the statistician Jacob Cohen, who proposed that a Pearson’s r of 0.10 represents a small effect, 0.30 a medium effect, and 0.50 a large effect. In practice, though, what counts as “large” depends heavily on the field. A later review of over 1,100 published effect sizes found that the actual 25th, 50th, and 75th percentiles for Pearson’s r were 0.12, 0.20, and 0.32, meaning most real-world correlations are smaller than people expect.
Here’s a practical way to think about strength:
- r = 0.10 to 0.30: A small relationship exists, but it would be hard to spot just by looking at the data. Many points scatter widely around any trend line.
- r = 0.30 to 0.50: A moderate relationship. You’d start to see a visible pattern on a scatter plot, though plenty of individual points still stray from the line.
- r = 0.50 to 1.0: A strong relationship. The data points cluster more tightly around a line, and you can make reasonably useful predictions about one variable from the other.
These same ranges apply to negative correlations. An r of -0.60 is just as strong as +0.60; the sign only tells you the direction, not the strength.
From r to Shared Variance
Squaring the correlation coefficient gives you something called the coefficient of determination, written as r². This number tells you the proportion of variation in one variable that’s explained by the other. If the correlation between study time and exam scores is 0.70, then r² is 0.49, meaning about 49% of the variation in exam scores can be accounted for by differences in study time. The remaining 51% comes from other factors.
This conversion is important because it puts the correlation into more concrete terms. A correlation of 0.30 sounds moderate, but squaring it reveals that only 9% of the variation is shared. That’s a useful reality check when you’re evaluating how meaningful a relationship really is.
How the Calculation Works
You don’t need to compute Pearson’s r by hand (software handles it instantly), but understanding the logic helps you grasp what it’s actually measuring. The formula looks at each data point’s distance from the average of its variable. For every pair of observations, it multiplies how far the x value is from the x average by how far the y value is from the y average. Those products are summed up and then divided by a scaling factor that keeps the result between -1 and +1.
When high x values tend to pair with high y values (both above their averages), those products are positive, pushing r toward +1. When high x values pair with low y values, the products are negative, pushing r toward -1. When there’s no consistent pairing, the positive and negative products roughly cancel out, landing r near zero.
What It Can’t Detect
The linear correlation coefficient has one major blind spot: it only measures straight-line relationships. Two variables can be strongly related in a curved pattern and still produce an r near zero. Imagine plotting the relationship between anxiety level and performance. At low anxiety, performance is low; at moderate anxiety, performance peaks; at high anxiety, performance drops again. That’s a clear, meaningful U-shaped pattern, but Pearson’s r would see the ups and downs canceling each other out and report little to no correlation.
Research in neuroscience has highlighted this limitation. Brain activity patterns often involve complex, nonlinear relationships, and studies relying solely on Pearson’s r can miss deeper connections between brain regions or between brain function and behavior. The takeaway for any field: a low correlation coefficient doesn’t necessarily mean two variables are unrelated. It means they aren’t related in a straight line.
Outliers Can Distort the Picture
Pearson’s r is sensitive to extreme data points. A single outlier, one observation that sits far from the rest of the data, can dramatically shift the correlation. In one instructive example from Penn State’s regression analysis course, removing a single extreme point changed the r² value from 55% to 97%. That’s the difference between concluding a relationship is moderate and concluding it’s nearly perfect, all because of one data point.
This sensitivity means it’s always worth plotting your data on a scatter chart before trusting the number. A scatter plot makes outliers visible immediately. If one or two points are pulling the correlation in an unusual direction, you’ll see them sitting far from the main cluster.
Correlation Does Not Mean Causation
This is the most repeated warning in statistics, and it’s repeated for good reason. A strong correlation between two variables doesn’t tell you that one causes the other. Three common traps explain why.
The first is coincidence through shared timing. Rising rates of breast cancer over a decade might correlate strongly with rising rates of joint replacement surgery simply because both increased during the same period. Neither caused the other; they just happened to trend upward together.
The second trap is reverse causality. Alcohol consumption and surgical complications provide a good example. People in poor health often stop drinking before surgery, which can make it look like non-drinkers have worse outcomes than moderate drinkers. The correlation is real, but the causal arrow points in an unexpected direction: poor health drives both the abstinence and the complications.
The third trap is hidden variables. Two things can correlate because a third, unmeasured factor influences both of them. A study might find that patients receiving a more aggressive treatment have worse outcomes, not because the treatment fails, but because doctors reserve aggressive treatment for the most severe cases. The severity of the condition is the hidden variable driving both the treatment choice and the outcome.
When to Use It
Pearson’s r works best when your data meets a few conditions. Both variables should be measured on a numerical scale (things like temperature, income, test scores) rather than categories (like favorite color or yes/no responses). The relationship between them should be roughly linear, meaning a scatter plot shows a general straight-line trend rather than a curve. And the spread of data points should be fairly consistent across the range, not fanning out dramatically at one end.
If your data involves ranked categories (like survey responses on a scale from “strongly disagree” to “strongly agree”), a variation called Spearman’s rank correlation is more appropriate. If the relationship is clearly curved, other statistical tools will capture the pattern that Pearson’s r would miss. Choosing the right type of correlation starts with looking at your data and asking whether a straight line is a reasonable summary of the trend you see.

