AIC and BIC in Regression: Formulas and Differences

AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) are scoring methods that help you choose between regression models by balancing how well each model fits the data against how complex it is. Both produce a single number for each candidate model, and in both cases, a lower value indicates a better model. The key difference: BIC penalizes complexity more heavily, especially as your dataset grows, so it tends to prefer simpler models than AIC does.

What Problem They Solve

When you add more variables to a regression, the fit always improves, at least slightly. A model with 20 predictors will fit your training data better than one with 3 predictors, but it may be capturing noise rather than real patterns. This is overfitting. AIC and BIC both address this by subtracting a penalty for every parameter you add. The result is a score that rewards good fit but punishes unnecessary complexity, giving you a principled way to compare models without just chasing the highest R-squared.

How Each Score Is Calculated

Both criteria start with the same ingredient: the log-likelihood, which measures how well the model explains the observed data. A higher log-likelihood means a better fit. The formulas then subtract a penalty that grows with the number of parameters.

AIC uses a simple, fixed penalty: 2 times the number of parameters. So AIC = 2k − 2(log-likelihood), where k is the number of estimated parameters in the model (including the intercept and any variance terms).

BIC replaces that fixed “2” with log(n), where n is your sample size. So BIC = k × log(n) − 2(log-likelihood). Because log(n) exceeds 2 for any dataset with 8 or more observations, BIC’s penalty per parameter is larger in nearly every real-world scenario and grows as your sample size increases.

Why the Penalty Difference Matters

That single difference in the penalty term has a big practical effect. With a dataset of 100 observations, log(100) is about 4.6, so BIC charges each additional parameter more than twice what AIC does. At 1,000 observations, log(1,000) is about 6.9. This means BIC becomes increasingly reluctant to add variables as your dataset grows, while AIC’s penalty stays the same regardless of sample size.

The consequence: AIC will generally keep more variables in the model. BIC will lean toward a more parsimonious model with fewer predictors. Neither is universally “better.” They’re answering slightly different questions.

Different Goals Behind Each Criterion

AIC comes from information theory. Its goal is to find the model that best approximates the real-world process generating your data, even if that true process is infinitely complex. Technically, AIC selects the model with the smallest expected information loss (measured by something called Kullback-Leibler divergence) relative to reality. It doesn’t assume the “true model” is among your candidates. It just wants the closest approximation, which makes it optimized for prediction.

BIC comes from Bayesian probability. It approximates the probability that a given model actually generated your data. If the true model is one of the candidates you’re comparing, BIC will identify it with near certainty as the sample size grows. This makes BIC better suited for situations where you believe a relatively simple, “true” model exists and you want to identify it.

In plain terms: use AIC when your priority is building a model that predicts well on new data. Use BIC when your priority is identifying which variables genuinely matter.

How to Compare Models Using These Scores

You can’t interpret a single AIC or BIC value in isolation. The number only means something when compared to the scores of competing models fit on the same dataset. The model with the lowest AIC (or lowest BIC) is the preferred one.

For AIC, the difference between two models’ scores (called delta AIC) tells you how much support the worse model has relative to the better one. A delta AIC under 2 suggests both models have roughly comparable support. A delta of 4 to 7 means the worse model has considerably less support. A delta above 10 means you can essentially dismiss the worse model.

For BIC, a widely used grading scale from statisticians Kass and Raftery interprets the difference in BIC values (specifically, twice the log of the implied odds ratio) as categories of evidence. A difference of 0 to 2 is barely worth mentioning. A difference of 2 to 6 is positive evidence. A difference of 6 to 10 is strong evidence, and anything above 10 is very strong evidence favoring the better model.

AIC and Variable Selection

Many statistical software packages use AIC or BIC as the stopping rule in stepwise regression, where variables are added or removed one at a time. The default in R’s commonly used stepAIC() function, for instance, is AIC. Switching to BIC typically produces a model with fewer variables.

One detail worth knowing: a variable doesn’t need to meet the traditional p < 0.05 threshold to improve a model’s AIC. A variable with a p-value as high as 0.157 can still lower the AIC when it’s the only additional parameter. This is why AIC-selected models often include variables that wouldn’t survive a standard significance test. Whether that’s a feature or a bug depends on your goal. For prediction, those marginally useful variables can genuinely help. For explanation, they can muddy your interpretation.

Small Sample Sizes and AICc

AIC has a known weakness with small datasets: it tends to select models with too many parameters. A corrected version called AICc adds an extra penalty that depends on both sample size and the number of parameters. The correction is largest when your number of observations is close to the number of parameters, and it fades to essentially zero when the sample size is much larger than the squared number of parameters. As a rule of thumb, if your ratio of observations to parameters is below about 40, AICc is a safer choice than standard AIC.

When They Agree and When They Don’t

For many datasets, AIC and BIC will point to the same model, especially when one candidate is clearly better than the rest. They tend to disagree in borderline cases where a moderately complex model fits only slightly better than a simpler one. In those situations, AIC will lean toward the more complex model and BIC toward the simpler one.

If you’re running both and getting different answers, that’s actually informative. It tells you the extra variables are contributing enough to improve predictions (AIC prefers them) but not enough to justify their inclusion if parsimony is the goal (BIC rejects them). Reporting both scores lets your audience see exactly where that trade-off stands, which is more honest than picking one criterion after the fact because it gave you the answer you wanted.