A linear model is a mathematical equation that describes the relationship between variables as a straight line (or flat surface, when more variables are involved). In its simplest form, it looks like this: y = mx + b, where y is the outcome you’re trying to predict, x is the input driving that outcome, m is the slope (how much y changes for each unit change in x), and b is the starting value of y when x equals zero. Linear models are the foundation of most statistical analysis and remain one of the most widely used tools in science, finance, and machine learning.
The Basic Equation
The slope-intercept form, y = mx + b, is the version most people encounter first. The variable x is called the independent variable because you choose or observe it freely. The variable y is the dependent variable because its value depends on x. The slope m tells you the rate of change: for every one-unit increase in x, y increases (or decreases) by m. The y-intercept b is simply where the line crosses the vertical axis, representing y’s value when x is zero.
Say you’re modeling how study hours affect exam scores. If m = 5 and b = 40, the model predicts that a student who doesn’t study at all scores 40, and each additional hour of studying adds 5 points. That’s the entire logic of a linear model: a constant starting point plus a fixed rate of change per input.
Simple Versus Multiple Linear Models
When a model has one independent variable and one dependent variable, it’s called simple linear regression. It captures a single relationship, like hours studied and exam score. But most real situations involve more than one factor. A student’s exam score might also depend on sleep quality, prior GPA, and class attendance.
Multiple linear regression handles this by adding more independent variables, each with its own coefficient (its own slope). The equation becomes something like y = b + m₁x₁ + m₂x₂ + m₃x₃, where each m represents the individual contribution of that variable while holding the others constant. Each coefficient is weighted so you can see which factors matter most. For example, an analyst modeling a company’s stock price might include the price-to-earnings ratio, dividend yield, inflation rate, and daily trading volume as separate inputs, each with its own coefficient showing how strongly it pushes the stock price up or down.
How the Line Gets Drawn
A linear model doesn’t just eyeball where the line should go. The standard method for fitting the line is called ordinary least squares (OLS). The idea is straightforward: for every data point, measure the vertical distance between the point and the proposed line. That distance is called a residual. OLS finds the line that makes the sum of all those squared residuals as small as possible.
Squaring the residuals serves two purposes. It treats overshooting and undershooting equally (since squaring removes negative signs), and it penalizes large misses more heavily than small ones. The result is the single line that, on average, sits closest to all your data points. Under certain conditions, this method produces the most accurate estimates possible among all approaches that use a straight-line formula.
Key Assumptions Behind the Model
Linear models work well when a few conditions hold. First, the true relationship between your variables needs to be roughly linear. If the real pattern is a curve, forcing a straight line through it will give misleading results. Second, the errors (the residuals) should be independent of each other. One data point’s error shouldn’t predict another’s, which can happen when you’re working with data collected over time.
Third, the spread of errors should stay roughly constant across all values of x. If predictions become wildly less accurate at higher values, the model’s confidence intervals stop being trustworthy. This property is called constant variance. Fourth, for certain statistical tests to work correctly, the errors should follow a bell-shaped (normal) distribution. When these assumptions hold, OLS produces the best possible estimates. When they don’t, more advanced techniques are needed.
Measuring How Well a Model Fits
The most common measure of a linear model’s quality is R-squared, also called the coefficient of determination. It represents the proportion of variation in the dependent variable that the model explains. An R-squared of 0.85 means the model accounts for 85% of the variation in your outcome, with the remaining 15% unexplained. A perfect model scores 1.0, meaning every prediction lands exactly on the data. An R-squared of 0 means the model does no better than simply predicting the average value every time.
R-squared can even go negative in unusual cases, such as when a model is forced through the origin (the zero point) or when constraints make the fitted line perform worse than a flat horizontal line. A negative value is a clear signal that something has gone wrong with the model specification. In multiple regression, adding more variables will always increase R-squared even if those variables are meaningless, so analysts often use an adjusted version that penalizes unnecessary complexity.
Where Linear Models Struggle
Linear models assume the world behaves in straight lines, and it often doesn’t. When the true relationship between variables is curved or involves interactions that multiply rather than add, a linear model will systematically miss the pattern.
Outliers pose another problem. Because OLS minimizes squared errors, a single extreme data point can drag the entire line toward it, distorting every prediction. This effect is especially pronounced with small datasets, where one unusual observation carries more weight. Multicollinearity is a subtler issue: when two or more independent variables are highly correlated with each other, the model struggles to figure out which one is actually responsible for changes in the outcome. The coefficient estimates become unstable and can swing dramatically with small changes in the data, making the model unreliable for interpretation even if its overall predictions look fine.
Generalized Linear Models
Standard linear models assume the outcome variable is continuous and normally distributed. But many real outcomes aren’t. Whether a patient survives surgery is a yes-or-no outcome. The number of customer complaints per day is a count that can’t go below zero. Generalized linear models (GLMs) extend the linear framework to handle these situations by adding a “link function” that transforms the outcome so the linear equation still applies.
For yes-or-no outcomes, the logit link converts probabilities (which are trapped between 0 and 1) into values that can range from negative infinity to positive infinity, making them compatible with a linear equation. This is called logistic regression. For count data, a log link ensures predictions stay positive. The core mechanics remain the same: coefficients, independent variables, and a weighted sum. The link function simply bends the output into the right shape for the type of data you’re modeling.
Why Linear Models Still Matter
In an era of complex machine learning algorithms, linear models remain the standard starting point for most prediction tasks. They serve as baseline models in research and industry because they’re fast to compute, easy to interpret, and surprisingly hard to beat on many problems. When a more complex model like a neural network only slightly outperforms a linear model, the simpler option is often preferred because you can clearly see what each variable contributes and explain the result to someone who isn’t a data scientist.
In medical research, linear regression is used to isolate the effect of a treatment while controlling for confounding factors like age or sex. In finance, it quantifies how different economic indicators relate to asset prices. In any observational study where you need to separate the influence of multiple overlapping factors, linear models provide a transparent, well-understood framework for doing so. Their interpretability and low computational cost make them not just a historical artifact but an active, practical tool across nearly every field that works with data.

