Average precision (AP) is a single number that summarizes how well a system ranks relevant results ahead of irrelevant ones. It captures both precision (how many of your predictions are correct) and recall (how many correct items you actually found) across every point in a ranked list, then collapses that information into one score between 0 and 1. A perfect AP of 1.0 means every relevant item appeared at the top of the list with no irrelevant items mixed in. The metric is widely used in information retrieval (search engines, recommendation systems) and computer vision (object detection, image classification).
How Precision and Recall Work Together
To understand average precision, you first need precision and recall. Precision answers: “Of everything the system flagged as relevant, how much actually was?” Recall answers: “Of everything that was actually relevant, how much did the system find?” These two numbers often pull in opposite directions. A system that returns every possible result will have perfect recall but terrible precision. A system that returns only its single most confident result might have high precision but miss almost everything relevant.
A single precision score or a single recall score can’t tell you how good a ranking system is overall. That’s where average precision comes in. It measures precision at every point in the ranked list where a relevant item appears, then averages those values. This rewards systems that push relevant results to the top.
Calculating AP Step by Step
Imagine a search engine returns 10 results for a query, and there are 4 truly relevant documents in the entire collection. You walk down the ranked list from position 1 to position 10. Every time you hit a relevant document, you calculate precision at that position, meaning the fraction of results so far that are relevant.
Say relevant documents appear at positions 1, 3, 6, and 10. At position 1, precision is 1/1 = 1.0. At position 3, you’ve seen 3 results and 2 are relevant, so precision is 2/3 ≈ 0.67. At position 6, it’s 3/6 = 0.5. At position 10, it’s 4/10 = 0.4. Average precision is the mean of those four values: (1.0 + 0.67 + 0.5 + 0.4) / 4 = 0.64.
The formal equation looks like this: AP = the sum of (precision at position k × relevance at position k) divided by the total number of relevant documents, computed across all positions in the list. The “relevance at position k” term is simply 1 if the item at that position is relevant and 0 if it isn’t, which is why you only accumulate precision values at positions where relevant items appear. If a relevant document is never retrieved at all, its contribution to the sum is zero, which pulls the AP score down.
Why AP Beats a Single-Point Metric
You could evaluate a system by picking one threshold and measuring precision and recall there, the way the F1 score does. But that throws away information about how the system performs across the full range of its predictions. Two systems might have the same F1 at a particular cutoff yet produce very different rankings overall.
Average precision avoids this problem because it’s equivalent to measuring the area under the precision-recall curve. That curve plots precision on the vertical axis and recall on the horizontal axis as you move down the ranked list. A system that maintains high precision even as recall grows will have a curve that stays near the top of the plot, producing a large area underneath and a high AP score. A system whose precision drops quickly will have a curve that sags toward the bottom, producing a low AP.
AP in Object Detection
In computer vision, average precision is the standard way to evaluate object detection models. The setup is slightly different from search: instead of documents, the system produces bounding boxes around objects in images, each with a confidence score. Those predictions are ranked by confidence, and each one is checked against the ground-truth boxes to see if it’s a true positive or a false positive.
The check uses a metric called Intersection over Union (IoU), which measures how much a predicted box overlaps with the correct box. A common threshold is 0.5, meaning the predicted box must overlap at least 50% with the ground truth to count as correct. Anything below that threshold counts as a false positive. Once every prediction is labeled as true or false positive, you build a precision-recall curve and compute the area underneath it, exactly like the retrieval case.
From AP to Mean Average Precision
Average precision applies to a single class or a single query. In practice, you usually care about performance across many classes or many queries. Mean average precision (mAP) is simply the average of the AP scores computed for each class or query. If an object detection model recognizes 10 categories and achieves AP scores of 0.9, 0.85, 0.7, and so on for each one, the mAP is the mean of all 10 values.
Different benchmarks define mAP slightly differently. The COCO benchmark, one of the most widely used in object detection, averages AP across 80 object classes and 10 IoU thresholds ranging from 0.5 to 0.95 in steps of 0.05. This is stricter than using a single 0.5 threshold because the model must produce tightly fitting boxes to score well at higher IoU levels. When you see mAP scores reported in research papers, it’s worth checking which IoU thresholds and which dataset definition they’re using, since the numbers aren’t directly comparable across different evaluation protocols.
What AP Scores Actually Tell You
AP scores range from 0 to 1 (sometimes reported as 0 to 100 as a percentage). There’s no universal “good” or “bad” threshold because difficulty varies enormously across tasks. An AP of 0.60 on a dataset with 80 diverse object categories and strict IoU requirements might represent a strong model, while the same score on a simple two-class retrieval task could signal a problem.
What AP reliably tells you is how two systems compare on the same task. If model A scores AP 0.72 and model B scores AP 0.68 on the same dataset with the same evaluation rules, model A produces better rankings. The gap between scores also matters more at different ranges. Improving from 0.50 to 0.55 is often easier than improving from 0.85 to 0.90, because squeezing out gains at the top of the scale requires near-perfect ordering of results.
One subtlety to keep in mind: AP weights early mistakes more heavily than late ones. If your system puts an irrelevant result at position 1, it drags down the precision calculation for every relevant item that follows. This is by design. In search and detection, getting the top results right matters more than perfecting the tail of the list, and AP reflects that priority.

