What Is Simple Linear Regression and How Does It Work?

Simple linear regression is a statistical method that finds the straight line best fitting the relationship between two variables. One variable (the independent variable) is used to predict or explain the other (the dependent variable). If you’ve ever drawn a “line of best fit” through a scatterplot, you’ve done the core work of simple linear regression by eye. The method just makes it precise and mathematical.

How the Equation Works

The relationship in simple linear regression is captured by a familiar formula: Y = b₀ + b₁X. Here, Y is the outcome you’re trying to predict, X is the variable you’re using to predict it, b₀ is the intercept (the value of Y when X equals zero), and b₁ is the slope (how much Y changes for every one-unit increase in X). There’s also an error term baked in, representing the natural scatter of real data around the line. No relationship is perfectly clean, and the error term accounts for everything the model doesn’t capture: measurement noise, missing variables, random variation.

The slope is the most important number in the equation. If you’re looking at the relationship between hours of exercise per week and resting heart rate, a slope of -1.5 would mean that each additional hour of weekly exercise is associated with a resting heart rate about 1.5 beats per minute lower. The intercept, meanwhile, is often just a mathematical anchor. It tells you the predicted Y when X is zero, which sometimes makes practical sense and sometimes doesn’t (zero hours of exercise does have a meaningful heart rate, but zero years of age doesn’t).

Finding the Best Fit Line

The standard technique for finding the slope and intercept is called ordinary least squares, or OLS. The idea is straightforward: for every data point, measure the vertical distance between the point and the line. Square each of those distances, then add them up. The line that makes that total as small as possible is the best fit. Squaring the distances serves two purposes. It prevents positive and negative errors from canceling each other out, and it penalizes large misses more heavily than small ones. A point that’s 10 units off the line contributes 100 to the total, while a point that’s 2 units off contributes only 4.

This minimization has a single, unique solution. There is exactly one line that produces the smallest possible sum of squared distances for any given dataset. That’s what makes OLS so widely used: it’s mathematically clean and computationally fast.

The Four Key Assumptions

Simple linear regression only works well when the data meet four conditions, sometimes remembered by the acronym LINE.

  • Linearity: The relationship between X and Y is actually a straight line, not a curve. If you plot your data and see a U-shape or an S-curve, a straight line will give misleading results.
  • Independence: Each data point is collected independently of the others. Measurements taken from the same person over time, for example, tend to be correlated, which violates this assumption.
  • Normality: At any given value of X, the spread of Y values follows a bell curve. In practice, this means the leftover errors (residuals) after fitting the line should look roughly normally distributed.
  • Equal variance: The spread of Y values is consistent across all values of X. If your data fans out like a cone, with tight clustering on the left and wide scatter on the right, the model’s predictions will be unreliable in the wider region. The technical term for this equal-spread requirement is homoscedasticity.

Violating these assumptions doesn’t always make the model useless, but it can distort the slope estimate, inflate error, or make significance tests unreliable. Checking residual plots after fitting a model is the simplest way to spot problems.

Measuring How Well the Model Fits

The most common measure of fit is R-squared, also called the coefficient of determination. It tells you what proportion of the variation in Y is explained by X. An R-squared of 0.85 means 85% of the variability in your outcome can be accounted for by the predictor. The remaining 15% is unexplained scatter.

R-squared ranges from 0 to 1 in simple linear regression. A value of 0 means X tells you absolutely nothing about Y, and the best-fit line is flat. A value of 1 means every data point falls exactly on the line, with no scatter at all. In real datasets, perfect scores essentially never happen. What counts as a “good” R-squared depends entirely on the field. In physics experiments with tightly controlled conditions, 0.99 is expected. In social science or medical research, 0.3 or 0.4 can be genuinely useful.

Testing Whether the Relationship Is Real

Finding a slope in your data doesn’t automatically mean the relationship is meaningful. Random noise alone can produce a nonzero slope in a small sample. To distinguish signal from noise, you test a specific question: is the true slope in the population actually zero?

This is framed as a hypothesis test. The null hypothesis states that the slope equals zero, meaning X has no linear relationship with Y. The alternative hypothesis states that the slope is not zero. You then calculate a p-value, which represents the probability of seeing a slope as extreme as yours (or more so) if the true relationship were actually flat. If that p-value falls below a preset threshold (typically 0.05), you reject the null hypothesis and conclude there is statistical evidence of a real linear relationship.

A small p-value doesn’t tell you the relationship is strong or practically important. It only tells you the relationship is unlikely to be a fluke of sampling. A tiny slope can be statistically significant with enough data points, even if the effect is too small to matter in practice. That’s why R-squared and the slope’s actual size matter alongside the p-value.

How Outliers Can Distort Results

Simple linear regression is sensitive to extreme values. A single unusual data point can pull the entire line toward it, shifting both the slope and intercept. Two types of extreme points deserve attention. An outlier has an unusual Y value: it falls far above or below where the line predicts. A high-leverage point has an unusual X value, sitting far to the left or right of the rest of the data. High-leverage points are particularly dangerous because they have outsized influence on where the line tilts. One data point at the far edge of your X range can single-handedly change the slope.

Checking for influential points is a routine part of any regression analysis. If removing one observation dramatically changes your results, that observation is driving the conclusion, and you need to understand why before trusting the model.

Real-World Applications

Simple linear regression shows up anywhere someone wants to quantify a one-to-one relationship. In medical research, it’s used to assess connections like drug concentration in blood versus drug concentration in exhaled breath. One study on propofol in a rat model found that exhaled concentrations increased by an average of 4.6 units for each 1-unit increase in plasma concentration, a relationship estimated through simple linear regression. Clinicians also use it to explore questions like whether a treatment group differs from a control group on a measurable outcome such as blood pressure.

Outside medicine, the applications are everywhere: predicting home prices from square footage, estimating crop yield from rainfall, projecting sales from advertising spend. The method is most useful when the relationship is genuinely linear, the dataset is reasonably clean, and you’re working with a single predictor variable.

Simple vs. Multiple Linear Regression

The word “simple” in simple linear regression means one independent variable. When you add two or more predictors, the method becomes multiple linear regression. Instead of fitting a line in two dimensions, you’re fitting a surface (or hyperplane) in higher-dimensional space, but the core logic of minimizing squared errors stays the same.

Most real-world outcomes are shaped by many factors simultaneously. A patient’s blood pressure depends on age, weight, medication, diet, and genetics, not just one of those. Multiple regression captures these combined effects and often produces more accurate predictions. Simple linear regression is best understood as a building block: it teaches the fundamental mechanics that extend naturally to more complex models, and it remains the right tool when you genuinely have a single predictor or want to isolate one relationship for clarity.