How to Interpret Correlation: Strength, Direction, and Causation

A correlation coefficient tells you two things: the direction of a relationship between two variables and how strong that relationship is. The number falls between -1 and +1, where values closer to either end indicate a stronger relationship and values near zero indicate little or no relationship. Understanding what these numbers actually mean, and what they don’t, is the difference between drawing useful conclusions and misleading yourself.

What the Numbers Mean

A positive correlation (r greater than 0) means that as one variable increases, the other tends to increase too. Taller people tend to weigh more, for example. A negative correlation (r less than 0) means the opposite: as one variable goes up, the other tends to go down. More hours spent sitting each day, for instance, correlates with lower cardiovascular fitness. When r equals exactly 0, there’s no detectable relationship between the variables at all.

The strength of the correlation depends on how close the value is to +1 or -1. A commonly used framework, based on guidelines originally proposed by the psychologist Jacob Cohen, breaks it down like this:

Small: around ±0.2
Medium: around ±0.3
Large: ±0.5 or higher

A general rule of thumb considers correlations above 0.7 or below -0.7 to be strong. So an r of -0.8 represents a strong negative correlation, while an r of 0.3 is positive but weak. These thresholds aren’t rigid laws. In some fields, like psychology, correlations of 0.3 are considered meaningful because human behavior is inherently noisy. In physics or engineering, anything below 0.9 might be considered poor. Context matters.

Direction vs. Strength

A mistake people often make is treating negative correlations as weaker than positive ones. The sign only tells you direction. An r of -0.7 is exactly as strong as an r of +0.7. One just describes a relationship where variables move together, and the other describes a relationship where they move in opposite directions. Strength is determined entirely by how far the number is from zero, regardless of whether it’s positive or negative.

R-Squared: How Much Is Actually Explained

One of the most useful things you can do with a correlation coefficient is square it. The result, called R-squared (r²), tells you the proportion of variation in one variable that’s predictable from the other. This is where many people get surprised.

A correlation of 0.5 sounds decent, but when you square it, you get 0.25. That means only 25% of the variation in one variable is explained by the other. The remaining 75% is driven by other factors you haven’t measured. Even a correlation of 0.7, which qualifies as strong, only explains about 49% of the variance. You need a correlation above 0.87 before more than 75% of the variation is shared between the two variables. Squaring the coefficient is one of the fastest ways to check whether a correlation is as impressive as it first appears.

Statistical Significance Isn’t the Same as Importance

When you see a correlation reported in a study, it usually comes with a p-value. The standard threshold for statistical significance is p less than 0.05, meaning there’s less than a 5% chance the observed correlation would appear if no real relationship existed. Some researchers argue this bar is too low and advocate for p less than 0.005 to reduce false positives.

Here’s the critical distinction: a correlation can be statistically significant but practically meaningless. With a large enough sample, even a tiny correlation like r = 0.05 can achieve p less than 0.05. That doesn’t make it useful. A correlation of 0.05 squared explains only 0.25% of the variance. Always look at the size of the correlation, not just whether it passed the significance test.

Always Look at the Scatter Plot

A correlation coefficient is a single number summary of a relationship, and single numbers can hide a lot. The classic demonstration of this is a set of four datasets created by the statistician Francis Anscombe. All four have nearly identical correlation coefficients (around 0.816), the same means, and the same variances. Yet when plotted, they look completely different.

One shows a clean linear relationship, exactly what you’d expect. The second shows a perfect curved relationship that isn’t linear at all, meaning the Pearson correlation is the wrong tool. The third is a tight linear relationship thrown off by a single outlier. The fourth shows a cluster of points at one x-value with a single distant point creating the illusion of correlation where none exists among the rest of the data. Without plotting your data, you have no way to distinguish between these situations. The number alone can’t tell you.

How Outliers Distort the Picture

Pearson’s correlation is highly sensitive to outliers. A single extreme data point can dramatically inflate or deflate the coefficient. Research using simulated data has shown that one outlier can reduce a correlation by 50% or even completely reverse it, flipping a strong positive relationship into a negative one depending on where the outlier falls in the data.

Outliers positioned between the two variables’ clusters tend to pull the correlation toward zero, making a real relationship look weaker. Outliers positioned far from the main cluster but along the direction of the trend can inflate the correlation, making a weak relationship look stronger. This is why checking scatter plots isn’t optional. If you spot one or two data points sitting far from the rest, they may be driving the entire result.

Pearson vs. Spearman Correlation

The most common correlation coefficient is Pearson’s, which measures linear relationships. It assumes your data follows a roughly normal distribution, the relationship between variables is a straight line, and the spread of data points is consistent across the range (not fanning out at one end).

When those assumptions don’t hold, Spearman’s correlation is often more appropriate. Instead of working with the raw values, Spearman’s ranks the data and measures whether those ranks move together in a consistent direction. This makes it better at detecting non-linear relationships where one variable consistently increases as the other increases, even if the increase isn’t at a constant rate.

An interesting diagnostic: if you find a weak Pearson correlation but a strong Spearman correlation, a relationship likely exists but isn’t linear. This can point you toward investigating what the actual shape of the relationship is. And despite the common advice, Pearson’s coefficient isn’t useless with non-normal data. It can still detect trends in many situations and sometimes even outperforms Spearman’s when extreme values carry meaningful information. The safest approach is to run both and compare.

Correlation Does Not Mean Causation

This is the most repeated warning in statistics, and for good reason. A correlation between two variables can arise in at least four ways: pure coincidence, variable X causes Y, variable Y causes X, or a hidden third variable causes both. Correlation alone cannot tell you which explanation is correct.

The third-variable problem is especially common. During the early research on smoking and cancer, one counterargument was that perhaps smokers also tend to drink more alcohol, and it was alcohol, not smoking, causing the cancer. This is what a confounding variable looks like: something you haven’t accounted for that creates the appearance of a direct relationship between the two things you measured. It took decades of additional evidence, including controlled experiments and dose-response data, to establish the causal link between smoking and lung cancer. Correlation was the starting point, not the proof.

When you encounter a correlation, train yourself to ask: what else could explain this? Is there a plausible mechanism connecting these two variables? Could the direction be reversed? Could something unmeasured be driving both? These questions are more valuable than the coefficient itself.

Putting It All Together

Interpreting a correlation coefficient well means checking several things in sequence. First, look at the sign to understand direction. Then check the magnitude against appropriate benchmarks for your field. Square it to see how much variance is actually shared. Confirm the p-value shows the result is unlikely due to chance. Plot the data to verify the relationship is genuinely linear and not driven by outliers or a curved pattern. And resist the temptation to assume that one variable is causing the other to change.

A correlation of 0.6 between exercise frequency and mood scores, for example, tells you the relationship is moderate to large, positive, and explains about 36% of the variation in mood. It does not tell you that exercise improves mood, that better mood leads to more exercise, or that some third factor like overall health drives both. Each of those is possible, and the correlation coefficient treats them all identically.