A linear model is an equation that describes the relationship between two or more quantities using a straight line. It assumes that when one variable changes, another changes at a constant rate. If you’ve ever seen the equation y = mx + b in a math class, you’ve already encountered the simplest form of a linear model. These models are one of the most widely used tools in statistics, data science, and research because they’re straightforward to build, easy to interpret, and surprisingly powerful across a wide range of problems.
The Basic Equation
Every linear model has the same core structure: an outcome you’re trying to predict (the dependent variable), one or more inputs that might influence it (independent variables), and a set of numbers the model learns from your data. In its simplest form, the equation looks like this:
Outcome = Weight × Input + Intercept
The “weight” (also called the slope or coefficient) tells you how much the outcome changes for every one-unit increase in the input. The “intercept” is the baseline value of the outcome when the input equals zero. Take housing prices as an example: if you’re estimating a home’s price based on its square footage, the weight tells you how many additional dollars each extra square foot adds, and the intercept represents a starting price before square footage factors in.
The model figures out the best values for the weight and intercept by looking at your data. Once those values are locked in, you can plug in new inputs to make predictions.
Simple vs. Multiple Linear Models
When a model uses just one input to predict an outcome, it’s called simple linear regression. The housing price example above, using only square footage, is a simple linear model. But most real situations involve more than one factor. A home’s price depends on square footage, number of bedrooms, neighborhood, age of the building, and more.
When you include two or more inputs, you have a multiple linear model. The equation extends naturally: Outcome = Weight₁ × Input₁ + Weight₂ × Input₂ + … + Intercept. Each weight tells you the effect of that particular input while holding the others constant. So you can ask, “How much does an extra bedroom add to the price, for homes of the same size and age?” Multiple regression is far more common in practice because it captures the complexity of real-world relationships more honestly than a single-variable model can.
How the Model Finds the Best Fit
The most common method for fitting a linear model is called ordinary least squares, or OLS. The idea is intuitive: the model draws a line (or a plane, in multiple dimensions) through your data points and measures how far each point falls from that line. Those gaps between the data and the line are called residuals. OLS finds the line that makes the total of all squared residuals as small as possible.
Why squared? Because some residuals are positive (the point is above the line) and some are negative (below it). Squaring them prevents positives and negatives from canceling each other out, and it also penalizes large errors more heavily than small ones. The result is a line that sits as close to the data as possible overall. Under the right conditions, OLS produces the most precise estimates you can get from a linear model, a property statisticians describe as being the “best linear unbiased estimator.”
Key Assumptions Behind Linear Models
Linear models work well when certain conditions hold in your data. Violating these assumptions doesn’t always make a model useless, but it can make the results misleading.
- Linearity: The relationship between input and outcome is actually a straight line, not a curve. If the true pattern is curved, a linear model will systematically miss it.
- Independence: Each data point is collected independently of the others. If your data points are related (like repeated measurements on the same person), the model can underestimate uncertainty.
- Constant variance: The spread of errors stays roughly the same across all values of the input. If errors fan out as the input grows (like price estimates becoming less accurate for larger homes), the model’s confidence intervals become unreliable.
- Normally distributed errors: The residuals roughly follow a bell curve. This matters most for hypothesis testing and confidence intervals. It matters less for simply making predictions with large datasets.
You can check most of these visually. A plot of residuals against predicted values should show a random cloud of points with no obvious pattern. If you see a curve, the linearity assumption is violated. If the points fan out like a megaphone, the constant-variance assumption is broken.
Measuring How Well a Model Fits
The most common measure of fit is R-squared, which tells you what proportion of the variation in your outcome is explained by your inputs. An R-squared of 0.80 means your model accounts for 80% of the variation in the data, with the remaining 20% unexplained. An R-squared of 0 means the model explains nothing; an R-squared of 1 means it explains everything perfectly.
There’s no universal threshold for a “good” R-squared. In physics experiments with tightly controlled conditions, you might expect values above 0.95. In social science or healthcare research, where human behavior is inherently noisy, an R-squared of 0.30 can be genuinely useful.
When comparing models with different numbers of inputs, adjusted R-squared is more informative. Adding any new input to a model will almost always increase the raw R-squared, even if that input is irrelevant. Adjusted R-squared penalizes for extra inputs, so it only goes up if the new input genuinely improves the model. Researchers commonly use the change in R-squared between a simpler model and a more complex one to judge whether the additional inputs are worth including.
Where Linear Models Break Down
Two problems frequently undermine linear models: outliers and multicollinearity.
Outliers are data points that sit far from the rest. Because OLS minimizes squared errors, a single extreme point can pull the fitted line toward it, distorting the weights and changing your conclusions. One misrecorded data point, like a house listed at $10 million in a neighborhood of $300,000 homes, can shift the entire model. Identifying and investigating outliers before trusting your results is essential.
Multicollinearity happens when two or more inputs are highly correlated with each other. If square footage and number of rooms both go into a housing model, and larger homes almost always have more rooms, the model struggles to separate their individual effects. The overall predictions may still be fine, but the individual weights become unstable and hard to interpret. When both problems exist simultaneously, the standard OLS approach suffers significant setbacks in accuracy.
Common Real-World Applications
Linear models show up in nearly every field that works with data. In economics, they’re used to estimate how factors like education, experience, and location affect wages. In healthcare, researchers use them to model relationships between treatment dosages and patient outcomes, or to plan hospital operations by predicting admission rates based on seasonal patterns and resource constraints.
In business, linear models drive sales forecasting, pricing strategies, and marketing spend analysis. A company might model how each additional dollar spent on advertising translates into revenue, holding other factors constant. In engineering, they help predict material stress, energy consumption, and equipment wear based on operating conditions.
The appeal in all these cases is the same: linear models produce interpretable results. Unlike more complex algorithms that act as black boxes, a linear model tells you exactly how much each input matters and in which direction. When you need to explain your reasoning to a manager, a regulator, or a patient, that transparency is enormously valuable.
When to Use Something Else
Linear models assume straight-line relationships. If the true pattern in your data is curved, threshold-based, or involves complex interactions between inputs, a linear model will underperform. For instance, the relationship between medication dose and effectiveness often follows a curve: benefits increase up to a point, then plateau or reverse. A straight line can’t capture that.
In those situations, you might turn to polynomial regression (which adds curves), decision trees, or neural networks. But linear models remain the standard starting point for almost any analysis. They’re fast, require relatively small amounts of data, and their assumptions are easy to check. Many practitioners begin with a linear model to establish a baseline, then move to more complex methods only when the data clearly demands it.

