What Is the C-Statistic and How Is It Interpreted?

The c-statistic is a number between 0.5 and 1.0 that tells you how well a prediction model distinguishes between two outcomes. A value of 0.5 means the model is no better than flipping a coin, while 1.0 means it classifies every case perfectly. You’ll see it most often in medical research, where prediction models estimate a patient’s risk of developing a disease, dying, or experiencing a complication.

What the C-Statistic Actually Measures

The “c” stands for concordance. The statistic answers a straightforward question: if you pick two people at random, one who developed the outcome (say, a heart attack) and one who didn’t, how often does the model correctly assign a higher risk to the person who actually had the event? A c-statistic of 0.75 means the model gets this ordering right 75% of the time.

To calculate it, the model looks at every possible pair of subjects where one had the outcome and the other didn’t. Each pair is labeled concordant if the model ranked them correctly (higher predicted risk for the person who had the event), discordant if it ranked them backward, or tied if both got the same predicted risk. The c-statistic is essentially the proportion of all pairs that the model ranked correctly.

For binary outcomes like “disease or no disease,” the c-statistic is mathematically identical to the area under the receiver operating characteristic curve (AUC or AUROC). These terms are interchangeable in that context, so if you’ve seen AUC reported in a study, you already know the c-statistic.

How to Interpret the Values

The scale runs from 0.5 to 1.0, and the general benchmarks are intuitive:

0.5: No discrimination at all. The model performs like random guessing.
0.6 to 0.7: Poor to modest discrimination. The model picks up some signal but misses a lot.
0.7 to 0.8: Acceptable discrimination. Many widely used clinical risk scores fall in this range.
0.8 to 0.9: Good to strong discrimination.
Above 0.9: Excellent discrimination, though rarely achieved in real-world clinical prediction.

To put this in context, the Framingham Risk Score, one of the most established heart disease prediction tools in medicine, has a c-statistic around 0.75 for predicting cardiovascular events over 10 years. That’s solidly in the “acceptable” range, and it has been used in clinical practice for decades. Perfect prediction in biology is rare because human health involves too many unmeasured variables.

Concordance in Survival Analysis

The basic c-statistic works cleanly when the outcome is binary: something happened or it didn’t. But medical research often tracks time-to-event data, where you’re asking not just whether something happened but when. Patients drop out of studies, and some haven’t experienced the event by the time the study ends. This incomplete follow-up is called censoring, and it complicates the pairing process.

Harrell’s C-index, introduced in 1982, adapts the concordance concept for this situation. It only evaluates pairs where you can actually determine who had the event first. If one patient had a heart attack at year three and another was still event-free at year five, that’s a usable pair. But if a patient dropped out of the study at year two, you can’t compare them meaningfully to someone whose event happened at year four, because the first patient might have had an event at year three if they’d stayed in the study. Harrell’s C-index excludes these ambiguous pairs from the calculation. The interpretation stays the same: values near 0.5 mean poor discrimination, values near 1.0 mean strong discrimination.

What the C-Statistic Does Not Tell You

A high c-statistic means the model ranks patients well, putting higher-risk people above lower-risk people. But ranking is not the same as accuracy. A model could correctly rank everyone while consistently overestimating or underestimating the actual probability of the event. This distinction between discrimination (ranking) and calibration (accuracy of predicted probabilities) is one of the most important concepts in prediction modeling.

Calibration measures whether a model’s predicted probabilities match observed reality. If the model says a group of patients has a 20% risk of heart attack, roughly 20 out of 100 should actually have one. A model with a c-statistic of 0.80 but poor calibration might rank people correctly while telling a patient their risk is 5% when it’s actually 15%. That matters enormously for clinical decisions.

The c-statistic also has a well-documented insensitivity to incremental improvements. Because it only cares about rank order, adding a new biomarker or risk factor to an already-good model often barely moves the number. In the Framingham study, for example, adding HDL cholesterol to a model that already included other major risk factors only improved the c-statistic from 0.74 to 0.75. That tiny change could mask a clinically meaningful improvement in how patients are classified. Metrics like the Brier score and the Net Reclassification Improvement were developed partly to capture what the c-statistic misses.

How It’s Calculated in Practice

You don’t need to compute the c-statistic by hand. In R, the Cstat function in the DescTools package takes a logistic regression model (or a set of predicted probabilities with the actual outcomes) and returns the value directly. In Python, scikit-learn’s roc_auc_score function does the same thing for binary outcomes, since the AUC and c-statistic are equivalent in that setting. For survival data, the survival and survcomp packages in R compute Harrell’s C-index, and the lifelines library covers it in Python.

The output is always a single number on the 0.5 to 1.0 scale, often reported with a confidence interval. In published research, you’ll typically see something like “c-statistic 0.75 (95% CI: 0.73 to 0.77),” which tells you both the point estimate and the precision of that estimate given the sample size.

Why It Appears So Often in Medical Literature

The c-statistic became the default measure of model discrimination because it’s intuitive, doesn’t require choosing a specific cutoff for “high risk” versus “low risk,” and works across different types of prediction models, from traditional logistic regression to machine learning classifiers. Any method that produces a predicted probability or risk score can be evaluated with a c-statistic.

That universality is also why it’s sometimes over-relied upon. Reporting only the c-statistic gives an incomplete picture of how useful a prediction model will be in practice. The best evaluations pair it with calibration plots or other metrics that assess whether the predicted probabilities are trustworthy, not just well-ordered.