A correlation coefficient is a number between -1 and +1 that measures the strength and direction of a relationship between two variables. A value of +1 means the two variables move perfectly together, -1 means they move in perfectly opposite directions, and 0 means no linear relationship exists. It’s one of the most widely used tools in statistics, showing up in everything from medical research to economics to psychology.
How It Works
The most common version, the Pearson correlation coefficient (written as r), is essentially a normalized measure of how two variables change together. It takes the covariance of two variables (a raw measure of how they co-vary) and divides it by the product of their individual standard deviations. That division is what forces the result into the tidy -1 to +1 range, making it easy to compare across different datasets regardless of scale. You could correlate height in centimeters with weight in pounds, and the coefficient would still land on that same scale.
The sign tells you direction. A positive correlation means both variables tend to increase together: as one goes up, the other does too. A negative correlation means they move in opposite directions, so when one increases, the other tends to decrease. The absolute size of the number tells you strength. A correlation of -0.8 represents a stronger relationship than +0.3, even though it’s negative.
Interpreting the Strength
There’s no single universal scale for labeling a correlation as “weak” or “strong,” because context matters. That said, a widely used set of benchmarks comes from the behavioral sciences: an r of 0.10 is considered small, 0.30 is medium, and 0.50 is large. These thresholds were proposed by the statistician Jacob Cohen and are still the default reference point in many fields.
Other disciplines draw the lines differently. In psychology, correlations above 0.7 are typically called strong, while values between 0.4 and 0.6 are moderate and anything below 0.3 is weak. In medicine, the bar tends to be higher: a correlation of 0.5 might only be labeled “fair,” and values need to reach 0.8 or above to be considered very strong. The takeaway is that a correlation of 0.4 could be impressive in one field and unremarkable in another, so always consider the context.
R-Squared: The Practical Companion
If you square the correlation coefficient, you get the coefficient of determination, written as r². This number tells you the proportion of variation in one variable that’s explained by the other. It’s often more intuitive than r alone because it translates directly into a percentage.
For example, if the correlation between the number of stories in a building and its height is strong enough to produce an r² of 0.904, that means 90.4% of the variation in building height is explained by story count. Meanwhile, a study of student height and GPA found an r² of just 0.003, meaning height explains only 0.3% of the variation in grades. Squaring the coefficient is a quick reality check: a correlation of 0.5 sounds respectable, but it means only 25% of the variation is explained, leaving 75% unaccounted for.
Types of Correlation Coefficients
Pearson’s r is the default, but it’s designed specifically for continuous data with a roughly linear relationship and no extreme outliers. When your data doesn’t meet those conditions, other options work better.
- Spearman’s rho measures the strength of a monotonic relationship, meaning both variables consistently move in the same direction but not necessarily at a constant rate. It works with continuous or ordinal data (like survey responses on a 1-to-5 scale) and handles outliers better than Pearson’s. If the relationship between your variables is curved but still consistently upward or downward, Spearman is the better choice.
- Kendall’s tau is another option for ordinal data. It makes no assumptions about how the data is distributed, making it useful for smaller datasets or rankings.
- Point-biserial correlation is used when one variable is continuous and the other is binary (yes/no, male/female, pass/fail). It’s technically a special case of Pearson’s formula, calculated by coding the binary variable as 0 and 1.
Assumptions for Pearson Correlation
Before calculating a Pearson correlation, four conditions need to hold. Both variables should be measured on a continuous scale (things like temperature, weight, or income, not categories). The relationship between them should be approximately linear, which you can check by plotting the data on a scatterplot. There should be no extreme outliers, because even a single far-flung data point can dramatically inflate or deflate the coefficient. And the variables should be roughly normally distributed, meaning their values cluster around a central point rather than being heavily skewed to one side.
When these assumptions aren’t met, the Pearson coefficient can be misleading. If your scatterplot shows a curved relationship or your data has clear outliers, switching to Spearman’s correlation is usually the practical fix.
The Non-Linear Trap
One of the most important limitations to understand is that a Pearson correlation near zero does not always mean “no relationship.” It means no linear relationship. Two variables can be strongly related in a non-linear way and still produce a correlation close to zero. Hand-grip strength, for instance, increases through childhood and adolescence, peaks, then declines with age. If you calculate the Pearson correlation between grip strength and age across a full lifespan, you can get an r of essentially 0, despite the obvious and meaningful relationship. The coefficient misses it because the pattern is U-shaped rather than a straight line.
This is why looking at a scatterplot before running any correlation is so valuable. The visual will reveal curves, clusters, or other patterns that a single number can’t capture.
Correlation Does Not Mean Causation
This is the most repeated warning in statistics, and for good reason. A strong correlation between two variables doesn’t tell you that one causes the other. Both might be driven by a third, hidden variable. Ice cream sales and the rate of employees taking vacation days are correlated, but ice cream doesn’t cause people to book time off. Both happen more in summer. This kind of misleading connection is called a spurious correlation.
Even when a causal relationship does exist, correlation alone can’t confirm it or tell you which direction it runs. The presence of cancer is correlated with smoking, and in that case we do know from decades of research that smoking causes lung cancer. But the correlation number by itself can’t distinguish cause from effect or rule out other explanations. Establishing causation requires controlled experiments, longitudinal studies, or other methods that go well beyond computing a single coefficient.
Correlation vs. Agreement
In medical and clinical research, correlation coefficients are sometimes used to compare two measurement tools, like checking whether a new diagnostic test gives similar readings to an established one. Pearson’s r can tell you the results are linearly related, but it can’t tell you the results actually agree. Two thermometers could be perfectly correlated (r = 1.0) even if one consistently reads five degrees higher than the other. They move together perfectly, but they don’t agree.
For that reason, comparing diagnostic tools typically calls for specialized measures like the intraclass correlation coefficient or the concordance correlation coefficient, both designed to assess actual agreement rather than just association. If you’re evaluating whether two tests or two raters produce interchangeable results, a standard Pearson correlation will overstate how well they match.

