How to Read a Confusion Matrix and Interpret Results

A confusion matrix is a table that shows how many predictions your model got right and wrong, broken down by each class. It puts actual labels on one axis and predicted labels on the other, so you can see exactly where your model succeeds and where it confuses one class for another. Once you know how to read the four cells in a binary matrix, you can extract nearly every performance metric that matters.

The Four Cells of a Binary Matrix

A binary confusion matrix is a 2×2 grid. The rows represent what the data actually is (the true labels), and the columns represent what the model predicted. This is the convention used by scikit-learn, the most common machine learning library in Python, where the entry in row i, column j counts the number of samples with true label i that were predicted as label j.

The four cells break down like this:

  • True Negative (top left): The model predicted “no” and the actual answer was “no.” Correct.
  • False Positive (top right): The model predicted “yes” but the actual answer was “no.” This is a false alarm, also called a Type I error.
  • False Negative (bottom left): The model predicted “no” but the actual answer was “yes.” This is a missed detection, also called a Type II error.
  • True Positive (bottom right): The model predicted “yes” and the actual answer was “yes.” Correct.

The two cells on the diagonal (top left and bottom right) are your correct predictions. The two off-diagonal cells are your errors. A perfect model would have zeros in both off-diagonal cells, but in practice, every model makes some of each type of mistake. The key insight is that these two types of errors have very different consequences depending on the problem. A false positive in a spam filter means a legitimate email gets sent to junk. A false negative means spam reaches your inbox. Which one matters more depends entirely on your use case.

Watch for Layout Differences

Not every tool arranges the matrix the same way. Scikit-learn places true labels on rows and predicted labels on columns, with the negative class (0) first and the positive class (1) second. Some textbooks and other libraries flip the axes or put the positive class first. Before reading any confusion matrix, check which axis is “actual” and which is “predicted.” Misreading this will swap your false positives and false negatives, which changes every metric you calculate from the matrix.

Metrics You Can Pull From the Matrix

Every common classification metric is just arithmetic on these four numbers. Once you have the counts, you can calculate anything by hand.

Accuracy is the simplest: add up the correct predictions (true positives plus true negatives) and divide by the total number of samples. It tells you the overall percentage your model got right. Accuracy is useful when your classes are roughly balanced, but it can be deeply misleading when they’re not (more on that below).

Precision answers: “Of everything the model flagged as positive, how many actually were?” The formula is true positives divided by (true positives plus false positives). High precision means few false alarms. This matters when the cost of a false positive is high, like flagging a legitimate transaction as fraud and freezing someone’s account.

Recall (also called sensitivity or true positive rate) answers: “Of all the actual positives in the data, how many did the model catch?” The formula is true positives divided by (true positives plus false negatives). High recall means few missed detections. This matters when missing a positive case is dangerous, like failing to detect a disease in a medical screening.

Specificity is recall’s mirror image for the negative class: true negatives divided by (true negatives plus false positives). It tells you the proportion of actual negatives the model correctly identified.

F1 score combines precision and recall into a single number using their harmonic mean. It reaches its best value at 1 and its worst at 0. The formula is (2 × true positives) divided by (2 × true positives + false positives + false negatives). Unlike a simple average, the harmonic mean penalizes models that do well on one metric but poorly on the other. A model with 95% precision but 10% recall will get a low F1 score, which correctly reflects that the model is not useful overall.

Why Accuracy Alone Can Fool You

Imagine you’re building a model to detect credit card fraud, and your dataset has 990 legitimate transactions and only 10 fraudulent ones. A model that labels every single transaction as “legitimate” will be right 99% of the time. Its confusion matrix would show 990 true negatives and 10 false negatives, with zero true positives and zero false positives. The accuracy is 99%, but the model catches exactly zero fraud cases. It’s completely useless for the task it was designed for.

Now consider a second model that correctly identifies 6 of the 10 fraud cases while accidentally flagging 4 legitimate transactions. Its accuracy drops slightly to 98.6%, but it actually detects fraud. This situation has a name: the accuracy paradox. A worse model by accuracy can be a far better model in practice. The confusion matrix makes this obvious at a glance, because you can see that the first model has an empty true positive cell. Accuracy alone would have hidden the problem entirely.

This is the core reason confusion matrices exist. A single accuracy number compresses all the information about your model’s performance into one percentage, and in doing so, it can mask critical failures. The matrix preserves the full picture.

Reading a Multi-Class Matrix

When your model predicts more than two classes, the confusion matrix expands from a 2×2 grid into an n×n grid, where n is the number of classes. The principle stays the same: rows are actual labels, columns are predicted labels, and the diagonal cells are correct predictions.

A strong model will have large numbers along the diagonal and small numbers everywhere else. The off-diagonal cells tell you exactly which classes the model confuses with each other. If you’re classifying handwritten digits and the cell at row 7, column 1 has a high count, your model frequently misreads 7s as 1s. That specific insight is something no single metric can give you.

You can still calculate precision, recall, and F1 for each individual class. To find precision for class A, take the true positives for A (the diagonal cell) and divide by the total of that column (everything the model predicted as A). To find recall for class A, take the same diagonal cell and divide by the total of that row (everything that was actually A). This gives you a per-class breakdown that reveals whether your model struggles with specific categories even when its overall numbers look fine.

Normalizing for Easier Comparison

Raw counts in a confusion matrix can be hard to compare when classes have very different sizes. If class A has 1,000 samples and class B has 50, the raw numbers for class A will naturally be much larger. Normalization converts the counts into proportions so you can compare across classes fairly.

The most common approach is row normalization: dividing each cell by the total of its row. This turns each row into a percentage breakdown of how the actual members of that class were classified. After normalization, each row sums to 1.0 (or 100%), and the diagonal values directly show you each class’s recall rate. Scikit-learn’s confusion matrix function supports this directly with a “normalize” parameter that can normalize over rows (true labels), columns (predictions), or the entire population.

Column normalization is less common but equally valid. It divides each cell by its column total, and the diagonal values then represent precision for each class.

Choosing the Right Metric for Your Problem

The confusion matrix gives you the raw material, but which metric you emphasize depends on what mistakes cost the most. In medical screening, missing a disease (false negative) can be life-threatening, so you’d prioritize recall. In email spam filtering, blocking a legitimate message (false positive) frustrates users, so you’d lean toward precision. In many real-world applications, you need a balance, and the F1 score provides that.

For imbalanced datasets where accuracy becomes unreliable, the Matthews correlation coefficient (MCC) offers a more robust single number. It factors in all four cells of the matrix and ranges from -1 (total disagreement) to +1 (perfect prediction), with 0 meaning the model is no better than random. A 2023 paper in BioData Mining argued that MCC is the most informative single statistic you can derive from a confusion matrix when both positive and negative predictions matter equally, because unlike F1 and accuracy, it can’t be inflated by class imbalance.

No single metric tells the full story. The confusion matrix itself is always the most complete view of your model’s performance. Start there, identify where the errors cluster, then pick the metric that aligns with the real-world cost of those errors.