What Graph to Use for Correlation by Data Type

A scatter plot is the default graph for showing correlation between two continuous variables, and for most situations it’s the right choice. But the best graph depends on your data type, sample size, and whether the relationship is linear or curved. Here’s how to pick the right one.

Scatter Plots: The Standard Choice

A scatter plot places one variable on the x-axis and the other on the y-axis, with each data point as a dot. The resulting cloud of dots immediately reveals the direction, strength, and shape of a relationship. A tight cluster rising from left to right signals a strong positive correlation. A loose, shapeless cloud signals little or no correlation. No other graph communicates this as intuitively.

Adding a trend line (also called a line of best fit) makes the pattern even clearer. A straight trend line works when the relationship is linear. If the data curves, a smoothed line (sometimes called a LOESS curve) will follow the actual shape of the relationship rather than forcing a straight line through it. Most charting tools can add either type with a click or a single line of code.

Why You Should Always Plot Your Data

A correlation coefficient alone can be deeply misleading. The classic demonstration is Anscombe’s quartet: four completely different datasets that all produce the same correlation value (0.816), the same line of best fit (y = 3.00 + 0.500x), and the same R² of 0.67. One dataset is a clean linear relationship. Another is a perfect curve. A third is a straight line with a single extreme outlier pulling the numbers off track. The fourth has all points stacked at one x-value except for one outlier that single-handedly creates the apparent correlation.

The numbers look identical. The graphs look nothing alike. Summary statistics compress your data into a single number, and by definition that means losing information. A graph preserves the shape, the outliers, and the surprises. Always visualize correlation before trusting a number.

Handling Large Datasets

Scatter plots start to break down when you have thousands of data points. The dots pile on top of each other, a problem called overplotting, and the plot becomes an unreadable blob. With as few as 4,000 points, a standard scatter plot can lose all clarity.

Two alternatives work well here. A hexbin plot (hexagonal heatmap) divides the plotting area into hexagonal bins and uses color to show how many points fall in each bin. Dense regions appear as darker or warmer-colored hexagons, while sparse regions stay light. It’s essentially a two-dimensional histogram. A contour density plot does something similar but draws smooth contour lines around areas of concentration, like a topographic map of your data. Both preserve the shape of the correlation while making density visible instead of hidden under overlapping dots.

If your dataset is large but not enormous, you can also try reducing dot opacity so overlapping points create darker regions naturally. This is a quick fix that works in the range of a few hundred to a couple thousand points.

Correlation Between Many Variables at Once

When you need to check correlations across a whole set of variables simultaneously, a correlation matrix heatmap is the standard tool. It calculates the correlation coefficient for every possible pair of variables, then displays the results in a grid where color represents the strength and direction of each relationship. Typically, strong positive correlations appear in one color (often red or dark blue), strong negative correlations in another (often blue or red, depending on the palette), and values near zero appear as neutral or white.

This lets you scan dozens of variable pairs at a glance and quickly spot which ones are strongly related. Circle size within each cell sometimes represents the statistical significance of the correlation. Heatmaps don’t show the shape of individual relationships, though. If you spot a strong correlation in the matrix, follow up with a scatter plot of that specific pair to confirm the pattern is real and linear.

When the Relationship Is Curved

Standard correlation (Pearson’s r) measures linear relationships. If your data follows a curve, Pearson’s r will underestimate or miss the actual relationship. A scatter plot makes this obvious: you’ll see a clear pattern that a straight trend line doesn’t capture.

For curved relationships, you have a few options. If the relationship is monotonic (consistently increasing or decreasing, just not in a straight line), Spearman’s rank correlation captures it better than Pearson’s. You can visualize Spearman correlation by plotting the ranks of each variable against each other instead of the raw values. If the data are correlated, the ranked points cluster near a diagonal line. If uncorrelated, they scatter randomly across a square. This ranked scatter plot strips away the curve and shows the monotonic association clearly.

If the curve is more complex, like a U-shape or an S-shape, neither Pearson nor Spearman will capture it well. A scatter plot with a smoothed trend line is the best visual tool. A quadratic curve fits parabolic (U-shaped) patterns, a cubic curve fits S-shapes, and higher-order curves can capture even more complex patterns like M or W shapes. The scatter plot itself is still the foundation. You’re just changing the trend line to match the data’s actual shape.

One Categorical and One Continuous Variable

If one of your variables is categorical (like treatment group, country, or yes/no status), a scatter plot doesn’t apply in the traditional sense. Side-by-side box plots are the go-to choice here. Each category gets its own box plot showing the distribution of the continuous variable, and you compare the position and spread of the boxes across groups. Differences in the median and range between groups reveal the association.

For example, if you want to see whether fuel efficiency differs by country of origin for a set of cars, you’d place origin on the x-axis and miles per gallon on the y-axis, with a separate box plot for each country. The relative positions of the boxes tell you whether and how strongly the categorical variable relates to the continuous one.

Time-Based Correlations

When both variables are measured over time, a standard scatter plot can miss an important detail: the relationship might involve a delay. One variable may lead the other by days, weeks, or months. The classic tool for this is a cross-correlation plot, which shows the correlation coefficient at different time lags. The x-axis represents the lag (how far you shift one series relative to the other), and the y-axis shows the strength of correlation at each lag.

The peak of the cross-correlation plot tells you two things: the lag at which the relationship is strongest and how strong that relationship is. If the highest peak sits at lag 3, for instance, it means one variable leads the other by three time periods. For time series data, this is far more informative than a single scatter plot, which collapses all the timing information.

Quick Reference by Data Type

  • Two continuous variables, small to moderate sample: scatter plot with trend line
  • Two continuous variables, thousands of points: hexbin plot or contour density plot
  • Many variable pairs at once: correlation matrix heatmap
  • Curved but monotonic relationship: ranked scatter plot (for Spearman correlation)
  • Complex curved relationship: scatter plot with smoothed or polynomial trend line
  • One categorical, one continuous variable: side-by-side box plots
  • Two time series with possible delay: cross-correlation plot