Balanced accuracy is a metric for evaluating classification models that gives equal weight to each class, regardless of how many examples belong to it. It’s calculated as the average recall (or detection rate) across all classes. In a two-class problem, that simplifies to the arithmetic mean of sensitivity and specificity. Scores range from 0 to 1, where 1 is perfect classification and 0.5 is equivalent to random guessing.
Why Standard Accuracy Can Mislead You
Standard accuracy measures the overall proportion of correct predictions. That sounds reasonable until your data is lopsided. Imagine a dataset where 95% of examples belong to Class A and only 5% belong to Class B. A model that blindly predicts “Class A” every single time scores 95% accuracy without learning anything useful. It has zero ability to detect Class B, yet the accuracy number looks impressive.
This happens because standard accuracy weights each class proportionally to its size. The minority class barely registers. A 2023 study in NeuroImage confirmed this pattern: as class imbalance increases, standard accuracy yields misleadingly high scores that reflect the size of the majority class rather than any genuine ability to distinguish between categories.
How Balanced Accuracy Fixes This
Balanced accuracy neutralizes the effect of class size by computing the recall for each class separately, then averaging those values. Recall for a given class is simply the fraction of that class’s examples the model identified correctly. Because every class contributes equally to the final score, a model that ignores the minority class gets penalized heavily.
For a binary problem, the formula is:
Balanced Accuracy = (Sensitivity + Specificity) / 2
Sensitivity is the recall on the positive class (how many actual positives were caught). Specificity is the recall on the negative class (how many actual negatives were correctly identified). Averaging the two means neither class can dominate the score.
Going back to the 95/5 example: a model that always predicts Class A has 100% recall on Class A but 0% recall on Class B. Its balanced accuracy is (1.0 + 0.0) / 2 = 0.5, correctly exposing it as no better than a coin flip. That’s a far more honest picture than the 95% standard accuracy suggested.
Extending to Multiple Classes
The same logic scales naturally beyond two classes. For a problem with four classes, you compute the recall for each of the four classes, then take the arithmetic mean. If the per-class recall values are 0.67, 0.64, 0.77, and 0.75, the balanced accuracy is their average: about 0.71. Every class carries 25% of the weight, no matter how many examples it contains.
This property makes balanced accuracy especially practical in real-world tasks like medical diagnosis, fraud detection, or document classification, where some categories are naturally rare but critically important to identify.
How to Interpret the Score
Balanced accuracy values fall between 0 and 1. A score of 0.5 means the model performs no better than random chance, equivalent to flipping a coin in a two-class scenario. A score of 1.0 means perfect classification across all classes. Anything below 0.5 indicates the model is actively getting things wrong more often than random guessing would.
One useful way to think about it: balanced accuracy tells you what standard accuracy would be if your test set were perfectly balanced, with an equal number of examples in every class. It strips away the flattering effect of class imbalance and shows you the model’s true discriminative ability.
When your data is already balanced (roughly equal class sizes), balanced accuracy and standard accuracy will produce the same number. The metric only diverges from standard accuracy when class sizes differ, which is precisely when you need it most.
Balanced Accuracy vs. Other Metrics
Balanced accuracy isn’t the only metric designed for imbalanced data, and understanding the alternatives helps you pick the right tool.
- F1 Score is the harmonic mean of precision and recall for the positive class. It’s most useful when you care specifically about one class (detecting fraud, diagnosing a disease) and want to balance catching positives against avoiding false alarms. Unlike balanced accuracy, F1 focuses on a single class rather than averaging across all of them.
- AUC-ROC (Area Under the Receiver Operating Characteristic curve) evaluates model performance across all possible decision thresholds. It’s good for comparing models but can still produce misleadingly high values with severe imbalance, since the false positive rate is diluted by the large number of negative examples.
- Matthews Correlation Coefficient (MCC) considers all four cells of the confusion matrix (true positives, true negatives, false positives, false negatives) and produces a single value from -1 to +1. It’s widely regarded as the most informative single metric for binary classification, though it’s less intuitive to interpret than balanced accuracy.
Balanced accuracy’s main advantage is simplicity. It’s easy to compute, easy to explain, and directly comparable to the standard accuracy number that most people already understand. Researchers have recommended it as a default evaluation metric for machine learning applications that aim to minimize overall classification error, noting that it works identically to standard accuracy on balanced data and gracefully handles imbalance when it appears.
Where Balanced Accuracy Falls Short
Balanced accuracy treats all types of errors as equally costly. In many real scenarios, they aren’t. Missing a cancer diagnosis (a false negative) is far more consequential than a false alarm (a false positive). Balanced accuracy won’t reflect that asymmetry because it weights sensitivity and specificity equally.
It also doesn’t account for the confidence of predictions, only whether the final label was correct. Two models might have the same balanced accuracy, but one could be making borderline guesses while the other is highly confident in its predictions. Threshold-based metrics like AUC-ROC capture that distinction.
In situations where the clinical or business context demands prioritizing one type of error over another, you’ll want to look at sensitivity and specificity individually, or use a metric that lets you weight errors differently. Balanced accuracy is a summary statistic. It provides an honest overview, but it can still mask important details about where and how a model fails.
Computing It in Practice
Most machine learning libraries include balanced accuracy out of the box. In Python’s scikit-learn, the function balanced_accuracy_score(y_true, y_pred) takes your true labels and predicted labels and returns the score. The implementation works for both binary and multi-class problems using the same average-of-recalls approach.
If you’re calculating it manually from a confusion matrix, the steps are straightforward: find the recall for each class (correct predictions for that class divided by the total number of actual examples in that class), then average those recall values. For two classes, that’s sensitivity plus specificity divided by two. For more classes, sum all per-class recall values and divide by the number of classes.

