How to Analyze a Scatter Plot: Direction, Strength & Form

Analyzing a scatter plot comes down to four things: the direction of the pattern, how strong it is, what shape it takes, and whether any points don’t belong. Once you know how to evaluate each of these, you can pull meaningful conclusions from nearly any scatter plot you encounter.

Start With the Axes

Before interpreting the pattern, make sure you understand what’s being measured. The horizontal x-axis shows the explanatory (independent) variable, and the vertical y-axis shows the response (dependent) variable. The explanatory variable is the one you suspect might influence the other. If a scatter plot shows hours of study on the x-axis and exam scores on the y-axis, the implied question is “does studying more lead to higher scores?”

If you skip this step, you’ll misread everything else. A dot in the upper-right corner means something completely different depending on whether the axes represent temperature and ice cream sales or age and bone density.

Identify the Direction

Direction tells you whether the two variables move together or in opposite ways. A positive association means that as the x-variable increases, the y-variable also increases. The dots trend upward from left to right. A negative association means the opposite: as one variable goes up, the other goes down, creating a downward slope from left to right.

Some scatter plots show no clear direction at all. The dots look like a shapeless cloud with no upward or downward trend. That’s a sign the two variables have little or no relationship.

Assess the Strength

Strength describes how tightly the data points cluster around a clear pattern. If the dots hug closely to an imaginary line or curve, the relationship is strong. If they’re loosely scattered but still trend in a general direction, the relationship is weak.

You can eyeball strength, but the correlation coefficient (r) puts a number on it. This value ranges from -1 to +1. A value of +1 or -1 means every point falls perfectly on a line, while 0 means no linear relationship at all. The sign tells you direction (positive or negative), and the distance from zero tells you strength.

General guidelines for interpreting r values vary slightly by field, but a common framework looks like this:

  • 0.7 to 1.0 (or -0.7 to -1.0): Strong relationship
  • 0.4 to 0.7 (or -0.4 to -0.7): Moderate relationship
  • 0.1 to 0.3 (or -0.1 to -0.3): Weak relationship
  • 0: No relationship

These thresholds aren’t universal. In psychology, an r of 0.3 is typically considered weak. In political science, the same value might be called moderate. Context matters, so consider what’s normal for the type of data you’re looking at.

Determine the Form

Form refers to the shape the data points create. The two broad categories are linear and nonlinear.

A linear relationship means the data follows a roughly straight-line pattern. For every unit increase in x, y changes by a fairly consistent amount. An airline finding that every dollar increase in jet fuel price raises flight costs by about $3,500 is a classic linear relationship: the rate of change stays steady.

A nonlinear relationship produces a curved pattern. The rate of change speeds up, slows down, or reverses as you move along the x-axis. Think of how a car’s fuel efficiency might improve as speed increases from 20 to 50 mph, then drop sharply above 70 mph. That rise-then-fall pattern creates a curve, not a straight line. Common nonlinear shapes include U-curves, exponential curves (sharp upward or downward bends), and S-curves.

This distinction matters because the correlation coefficient r only measures linear relationships. A scatter plot with a perfect U-shape could have an r near zero, even though the two variables are clearly related. If the data looks curved, r alone won’t capture what’s happening.

Spot Outliers

An outlier is a data point that doesn’t follow the general trend of the rest of the data. Visually, it sits noticeably far from where most of the other dots cluster.

Not all outliers matter equally. Some are harmless. If a data point is off to the side but the overall pattern barely changes with or without it, it’s an outlier that isn’t influential. In one Penn State analysis, removing an outlier only shifted the slope of the trend line from 5.04 to 5.12, a trivial difference.

Other outliers can dramatically distort your results. In another example from the same analysis, a single influential outlier pulled the slope from 5.12 down to 3.32 and dropped the measure of how well the line fit the data from 97% to 55%. That’s the difference between concluding “these variables are very strongly related” and “the relationship is only moderate.” When you spot an outlier, it’s worth asking whether it represents a data entry error, a genuinely unusual case, or a sign that your model doesn’t fit the data well.

Look for Clusters and Gaps

Sometimes a scatter plot reveals distinct groupings of data points separated by empty spaces. These clusters can signal that your data contains separate subgroups. For example, if you plotted height versus weight for a mixed population of children and adults, you’d likely see two distinct clusters rather than one continuous spread.

Gaps, the empty spaces where no data points land, can be meaningful or meaningless. A gap might reflect a real divide in the population you’re studying, or it might just be an artifact of a small sample size that would disappear with more data. When you see clusters, consider whether there’s a hidden variable (like age group, geographic region, or category) that splits the data into natural subgroups.

Use a Line of Best Fit

A line of best fit (also called a regression line or trend line) is a straight line drawn through the scatter plot to approximate the overall pattern. Its main purpose is prediction: once you have the line, you can estimate what y-value to expect for a given x-value, even for x-values you haven’t directly measured.

The correlation coefficient tells you how well the line fits. The closer the absolute value of r is to 1, the more reliably the line represents the data. If r is close to zero, the line is essentially meaningless, just a line drawn through noise.

Keep in mind that a line of best fit only works within the range of your data. Extending it far beyond your highest or lowest x-values (called extrapolation) is risky because you’re assuming the same pattern continues where you have no evidence.

Correlation Is Not Causation

This is the single most important rule when drawing conclusions from a scatter plot. A strong correlation between two variables does not prove that one causes the other. There are two common traps.

The first is the third-variable problem. Two variables can move together because a hidden third factor drives both of them. Ice cream sales and drowning deaths are positively correlated, but ice cream doesn’t cause drowning. Hot weather increases both. If you’re not accounting for that confounding variable, the scatter plot will mislead you into seeing a direct connection that doesn’t exist.

The second trap is reverse causation. Even when two variables are genuinely connected, the scatter plot alone can’t tell you which one is driving the other. A plot showing that cities with more hospitals also have more disease doesn’t mean hospitals cause disease. It means larger cities have both more hospitals and more sick people.

A scatter plot tells you that a relationship exists and how strong it appears. Determining why the relationship exists requires a different kind of analysis entirely, typically a controlled experiment that can isolate one variable’s effect on the other.