What Is Linear Correlation? Definition and Examples

Linear correlation is a statistical measure of how closely two variables move together in a straight-line pattern. It’s expressed as a single number, called the correlation coefficient (r), that ranges from −1 to +1. A value of +1 means the two variables rise together in perfect lockstep, −1 means one falls perfectly as the other rises, and 0 means there’s no linear relationship at all.

How the Correlation Coefficient Works

The most common version of this measure is the Pearson correlation coefficient, often just written as “r.” It captures two things at once: the direction of a relationship (positive or negative) and its strength (how tightly the data points cluster around a straight line).

A positive r means both variables tend to increase together. Taller people, for instance, tend to have longer arm spans. A negative r means one variable tends to decrease as the other increases. The closer r gets to +1 or −1, the more predictable that pattern becomes. When r sits near zero, the two variables don’t follow any consistent straight-line trend.

What the Numbers Actually Mean

Interpreting strength isn’t perfectly standardized. Different fields use slightly different thresholds, but a widely used framework from psychology research breaks it down like this:

0.7 to 1.0 (or −0.7 to −1.0): Strong correlation. The two variables track each other closely.
0.4 to 0.6 (or −0.4 to −0.6): Moderate correlation. There’s a clear trend, but plenty of scatter around it.
0.1 to 0.3 (or −0.1 to −0.3): Weak correlation. A slight tendency exists, but it’s not very useful for prediction.
0: No correlation. The variables have no linear relationship.

These labels shift depending on the field. In medicine, an r of 0.5 might be considered only “fair,” while in political science the same value could be called “strong.” Context matters. A correlation of 0.3 between a lifestyle habit and a health outcome can be meaningful when studied across thousands of people, even though it wouldn’t feel impressive as a raw number.

What It Looks Like on a Graph

The easiest way to grasp linear correlation is through a scatter plot, where each data point represents a pair of measurements. With a strong positive correlation, the dots climb from lower left to upper right in a tight band. With a strong negative correlation, they fall from upper left to lower right. When there’s no correlation, the dots look like a random cloud with no discernible pattern.

The key word is “linear.” If the data follows a curve, like a U-shape or an arc, the Pearson r can be misleadingly low even though a clear relationship exists. The coefficient only detects straight-line patterns.

Real-World Examples

Height and arm span show a strong positive linear correlation. Measure a few hundred people and you’ll find that taller individuals almost always have wider arm spans, producing an r close to +1. BMI and blood pressure also tend to have a positive linear relationship, though it’s usually moderate rather than near-perfect. Researchers at the University of Texas found a moderately strong positive correlation (r = 0.65) between how happy someone reported being and how funny other people rated them.

Negative correlations show up when one variable decreases as the other increases. Hours of physical activity per week and resting heart rate, for example, tend to move in opposite directions.

Why Outliers Can Wreck the Result

One of the biggest practical pitfalls of the Pearson coefficient is its sensitivity to extreme data points. A single outlier can reduce the correlation by 50% or completely reverse its direction. In one demonstration published in Frontiers in Psychology, researchers showed that moving just one data point to an extreme position changed a perfect correlation of r = 1.0 to r = −0.51. That’s a complete flip from “perfectly positive” to “moderately negative,” caused by a single unusual observation.

This is why checking your scatter plot before trusting the number matters so much. If one or two points sit far from the rest of the cluster, they can pull the entire result in a misleading direction.

Assumptions Behind the Calculation

For the Pearson r to give you a reliable result, a few conditions need to hold. The relationship between the variables should actually be linear, not curved. Both variables should be measured on a continuous scale (things like temperature, weight, or time, not categories like “agree/disagree”). The spread of data points should be roughly even along the full length of the line, a property called homoscedasticity. If the points fan out like a cone, getting more scattered at higher values, the coefficient becomes less trustworthy. Both variables should also follow an approximately normal (bell-curve) distribution.

When these assumptions are violated, especially when the data is ranked or ordinal rather than continuous, a different measure called the Spearman correlation is more appropriate. Instead of working with raw values, the Spearman method ranks each observation and correlates the ranks. It can detect any consistent increasing or decreasing relationship, not just straight-line ones.

Correlation Is Not Causation

This is the single most important caveat. A strong correlation between two variables does not mean one causes the other. Ice cream sales and sunscreen sales are highly correlated across the year, but buying ice cream doesn’t make people apply sunscreen. Both are driven by a third factor: hot weather. This kind of hidden third variable is called a confounding variable, and it’s behind most of the misleading correlations you’ll encounter.

Smoking and alcohol use are positively correlated, but smoking doesn’t cause alcoholism. The correlation coefficient is purely a measure of association. It tells you that two things move together, not why. Establishing causation requires controlled experiments or carefully designed studies that can rule out alternative explanations. When you see a correlation reported in the news or in a study, the first question to ask is always: could something else be driving both variables at the same time?