What Is a Good Score on a Diagnostic Test?

A good score on a diagnostic test depends on what the test is measuring and what it’s being used for, but there are well-established benchmarks. For overall diagnostic accuracy (measured by a metric called AUC), a score above 0.85 is considered high, 0.75 to 0.85 is moderate, and below 0.75 is low. For individual metrics like sensitivity and specificity, screening tests generally aim for 80% or higher, though many clinical situations demand 95% or above. The catch is that no single number tells the whole story.

The Core Metrics That Define Test Performance

Diagnostic tests are judged on several different scores, each capturing a different slice of performance. The two most fundamental are sensitivity and specificity.

Sensitivity measures how well a test catches people who actually have the condition. A test with 90% sensitivity will correctly identify 90 out of 100 people who are sick, missing 10. Specificity measures the opposite side: how well the test correctly clears people who don’t have the condition. A test with 95% specificity will correctly give a negative result to 95 out of 100 healthy people, while 5 will get a false alarm.

Overall diagnostic accuracy combines both into a single percentage: the proportion of all results (positive and negative) that were correct. This gives you a bird’s-eye view but can be misleading on its own, especially when the condition being tested for is rare.

What Counts as “Good” Depends on the Test’s Purpose

A screening test and a confirmatory test have very different jobs, and what qualifies as a good score shifts accordingly. Screening tests are meant to cast a wide net, so high sensitivity is the priority. You’d rather flag a few healthy people for follow-up than miss someone who’s actually sick. For this reason, screening tests with sensitivity in the 80% to nearly 100% range are considered effective, even if specificity is somewhat lower.

Confirmatory tests flip the emphasis. When a doctor needs to nail down a diagnosis before starting treatment, high specificity matters more. A false positive at this stage could mean unnecessary surgery, toxic medication, or serious psychological harm. In these cases, specificity above 95% is typically the target.

Some tests need both. A rule-out test, designed to safely exclude a dangerous condition like a heart attack, needs a negative predictive value (the chance that a negative result is truly negative) of at least 98%. If the condition is rarer in the population being tested, that threshold climbs even higher.

AUC: The Single Best Summary Score

When researchers want one number to summarize a test’s overall discriminating power, they use the area under the curve, or AUC. This score comes from plotting sensitivity against specificity at every possible cutoff point, creating what’s called a receiver operating characteristic (ROC) curve. The AUC ranges from 0.5 (no better than a coin flip) to 1.0 (perfect).

The standard classification breaks down like this:

  • Above 0.85: High accuracy
  • 0.75 to 0.85: Moderate accuracy
  • Below 0.75: Low accuracy

An AUC of 0.85 or higher is generally what clinicians and test developers aim for. But context still matters. A test with an AUC of 0.78 might be perfectly acceptable for an initial screen if better options don’t exist, while an AUC of 0.88 might be inadequate for a high-stakes surgical decision.

Likelihood Ratios Add Another Layer

Likelihood ratios tell you how much a test result should shift your confidence about whether someone has a condition. A positive likelihood ratio (LR+) above 10 is considered very useful for ruling a condition in, meaning a positive result makes the diagnosis roughly 10 times more likely. A negative likelihood ratio (LR-) below 0.1 is very useful for ruling a condition out, meaning a negative result makes the diagnosis about 10 times less likely.

Tests that hit both of those thresholds simultaneously have excellent discriminating ability. Many real-world tests fall somewhere in between, which is why doctors often use them in combination rather than relying on a single result.

How Disease Prevalence Changes What “Good” Means

Here’s something that surprises most people: the same test with the same sensitivity and specificity will be more or less reliable depending on how common the condition is in the population being tested. This happens because prevalence directly affects predictive values.

When a disease is common, a positive result is more likely to be a true positive. As prevalence drops, false positives start to outnumber true positives, even with a highly specific test. A test with 95% sensitivity and 95% specificity sounds excellent, but if only 1 in 1,000 people in the tested population actually has the disease, the vast majority of positive results will be false alarms.

This is why the same test can be “good” in a hospital emergency department (where the condition is common among people showing up with symptoms) and much less useful as a mass screening tool for the general population. Predictive value isn’t just about the test itself. It’s about who you’re testing.

Real-World Benchmarks

Putting numbers in context helps. Mammography, the standard breast cancer screening tool and the only one proven to reduce breast cancer deaths, has a sensitivity of about 87%. That means it misses roughly 13 out of every 100 breast cancers, and the miss rate is even higher in women with dense breast tissue. Despite this limitation, it remains the gold standard because no alternative performs better at scale.

Rapid strep tests, the kind your doctor runs in the office with results in minutes, have a sensitivity of about 86% and a specificity of 95%. In practical terms, out of 100 children who actually have strep throat, the rapid test will catch 86 and miss 14. Out of 100 children with a non-strep sore throat, 95 will be correctly cleared and 5 will be incorrectly told they have strep. Those missed cases are why doctors sometimes follow up a negative rapid test with a throat culture.

These examples illustrate an important point: widely used, well-regarded diagnostic tests often have sensitivity in the mid-80s, not the 99% range many people assume. A “good” score doesn’t mean perfect. It means the test performs well enough relative to its clinical role and the alternatives available.

Why No Universal Cutoff Exists

The FDA does not set a single minimum sensitivity or specificity that all diagnostic tests must meet. Instead, it requires that each test demonstrate clinically meaningful performance for its specific intended use, with appropriate study design and endpoints. What’s clinically meaningful for a cancer screening tool is different from what’s meaningful for a rapid flu test.

This is why asking “what is a good score” always leads back to the same set of follow-up questions: What condition is being tested for? How serious are the consequences of a missed diagnosis versus a false alarm? How common is the condition in the group being tested? And what happens next if the result is positive or negative?

A test with 80% sensitivity might be outstanding if the previous best option was 60%. A test with 95% sensitivity might be inadequate if missing even 5% of cases means preventable deaths. The numbers only become meaningful when you understand what they’re being asked to do.