How to Interpret a Precision-Recall Curve

A precision-recall curve plots precision on the y-axis against recall on the x-axis at every possible classification threshold your model can use. The closer the curve hugs the top-right corner (high precision and high recall simultaneously), the better the model. Reading this curve well lets you understand not just overall performance but exactly where your model starts making trade-offs, and whether those trade-offs are acceptable for your specific problem.

What Precision and Recall Actually Measure

Precision answers: “Of everything the model flagged as positive, how many actually were?” It equals true positives divided by all predicted positives (true positives plus false positives). A model with high precision rarely cries wolf.

Recall answers: “Of all the actual positives out there, how many did the model catch?” It equals true positives divided by all actual positives (true positives plus false negatives). A model with high recall rarely misses a real case.

These two metrics pull against each other. Making your model more aggressive about flagging positives will catch more real cases (higher recall) but also flag more false alarms (lower precision). Making it more conservative does the opposite. The precision-recall curve maps out this entire trade-off by sweeping through every possible decision threshold.

How the Curve Is Built

Most classifiers output a probability score for each prediction rather than a simple yes or no. The decision threshold is the cutoff above which you label something positive. At a very high threshold (say 0.99), the model only flags cases it’s extremely confident about, so precision is high but recall is low because many true positives fall below that strict cutoff. At a very low threshold, the model flags almost everything, so recall climbs toward 1.0 but precision drops because you’re now sweeping in many false positives.

Each threshold produces one precision-recall pair, which becomes a single point on the curve. Connect all those points and you have the full precision-recall curve. The leftmost points represent the most conservative thresholds (high precision, low recall), and as you move right, the threshold loosens, recall increases, and precision typically drops.

What a Good Curve Looks Like

A perfect model would hold precision at 1.0 across all recall values, forming a flat line along the top of the plot. That means it catches every positive case without a single false alarm. No real model achieves this, but the closer your curve stays to that top-right corner, the better. A strong model maintains high precision even as recall increases, producing a curve that stays elevated before eventually dropping.

A weak model’s curve drops sharply toward the bottom of the plot as recall increases. The steepness and location of that drop tell you something specific: if precision collapses early (at low recall), the model struggles to identify even the easiest positive cases without generating false alarms. If precision holds steady until recall reaches, say, 0.7 and then plummets, the model is reliable for the top-ranked predictions but falls apart when you ask it to catch the harder cases.

The No-Skill Baseline

Every precision-recall curve needs a reference point for “random guessing.” Unlike an ROC curve, where the baseline is always a diagonal line at 0.5, the precision-recall baseline depends on your dataset. It’s a horizontal line equal to the proportion of positive samples in your data. If 10% of your samples are positive, the baseline sits at 0.1. If 50% are positive, it sits at 0.5.

This matters because a model that simply labels everything as positive would achieve recall of 1.0 but a precision equal to that class proportion. Any useful model needs a curve that sits well above this baseline. In a dataset where only 1 in 1,000 samples is positive, even a curve that looks visually low might dramatically outperform random chance.

Area Under the Curve (AUC-PR)

The area under the precision-recall curve (often called AUC-PR or AUPRC) collapses the entire curve into a single number between 0 and 1. Higher is better. A perfect model scores 1.0, and a random model scores roughly equal to the positive class proportion in your dataset.

One practical way to gauge this: divide the AUC-PR by the positive class proportion. In a medical study predicting a rare complication that occurred in 0.7% of patients, a logistic regression model achieved an AUC-PR of 0.116. That’s 16.6 times better than random guessing. The raw number looks small, but relative to what random chance would produce (0.007), it represents a meaningful improvement.

You’ll sometimes see AUC-PR called “Average Precision” (AP). Average Precision calculates the area by summing up rectangular slices under the curve at each threshold, similar to how you might estimate the area under any curve using narrow rectangles. In most machine learning libraries, average precision score and AUC-PR refer to the same concept computed with slightly different numerical methods, but for practical purposes they’re interchangeable.

Comparing Models on the Same Curve

When you plot multiple models on the same precision-recall graph, you can see exactly where one model outperforms another. Two models might have similar overall AUC-PR but differ in ways that matter for your use case. In one study predicting a rare pediatric complication, a logistic regression model and a gradient boosting model had nearly identical ROC scores (0.953 vs. 0.947), but the precision-recall curve revealed a meaningful difference: at lower recall values, the logistic regression model achieved noticeably higher precision. That distinction was invisible on the ROC curve.

This is a key lesson. Don’t rely solely on AUC-PR as a single number. Look at the curve itself to see which model wins in the recall range you actually care about. If your application demands catching 90% of true positives, compare the models’ precision at 0.9 recall specifically.

Why PR Curves Beat ROC Curves for Imbalanced Data

ROC curves plot recall against the false positive rate, and the false positive rate uses the number of true negatives in its denominator. When negatives vastly outnumber positives (as in fraud detection, rare disease screening, or spam filtering), even a large number of false positives barely moves the false positive rate because the denominator is so huge. The ROC curve can look excellent while the model is actually generating an unacceptable number of false alarms.

Precision-recall curves don’t use the true negative count at all. Precision only cares about what happens among the items flagged as positive, so it exposes false alarms directly. For datasets where the positive class is rare, the PR curve provides a much more honest picture of model performance. As a rule of thumb, if your positive class makes up less than 10-20% of the dataset, the PR curve is likely more informative than the ROC curve.

Choosing a Threshold From the Curve

The precision-recall curve doesn’t just evaluate your model. It helps you pick an operating point. Every point on the curve corresponds to a specific threshold, and your choice depends on the cost of different errors in your domain.

If false negatives are dangerous (missing a cancer diagnosis, failing to detect fraud), you want high recall and should accept lower precision. Move right along the curve to find a threshold where recall is high enough, then check whether the resulting precision is tolerable. In the pediatric study mentioned earlier, clinicians looked at the precision-recall curve and found that achieving 90% recall (catching 9 out of 10 true cases) meant accepting a precision around 15-20%, which translated to monitoring roughly 6-7 patients for every one who truly had the complication. Whether that’s acceptable depends on what the monitoring involves.

If false positives are expensive (unnecessary surgeries, wrongful fraud blocks on legitimate transactions), you want high precision and should tolerate lower recall. Move left along the curve to a stricter threshold. For spam filtering, a higher threshold means you’ll only flag emails as spam when the model is very confident, say above 90% probability. You’ll miss some spam, but you won’t accidentally bury important emails.

Common Interpretation Mistakes

The most frequent mistake is comparing AUC-PR values across datasets with different class balances. An AUC-PR of 0.3 on a dataset where positives represent 1% of samples is far more impressive than an AUC-PR of 0.3 where positives represent 25%. Always consider the baseline.

Another pitfall is connecting the points on the curve with straight lines. Precision can behave non-monotonically between thresholds, meaning a straight line between two points can overestimate the area underneath. Proper implementations use step-wise or trapezoidal interpolation to avoid inflating the AUC-PR. If you’re computing this yourself, use an established library rather than manually connecting dots.

Finally, don’t ignore the shape of the curve in favor of the single AUC-PR number. A model with a slightly lower AUC-PR but consistently stable precision across a wide recall range may be more useful than a model with a higher AUC-PR that achieves it by being extremely precise at very low recall and then collapsing. The curve tells a story that the summary statistic cannot.