What Is Interrater Reliability? Definition & Examples

Interrater reliability is the degree to which two or more independent raters assign the same score, diagnosis, or category when evaluating the same thing. It answers a fundamental question in research: if you swap one observer for another, do you get the same result? When interrater reliability is high, the measurement reflects something real about the subject rather than the personal quirks of whoever happened to be doing the rating.

Why It Matters

Any time a measurement depends on human judgment, there’s room for disagreement. A radiologist reading an MRI, a teacher grading an essay, a psychologist diagnosing depression, a sports scientist scoring an athlete’s movement quality: all of these involve interpretation. Interrater reliability quantifies how much that interpretation varies from person to person. If two psychiatrists evaluate the same patient and reach different diagnoses, any study built on those diagnoses is standing on shaky ground.

This concept is especially important in fields where there’s no objective “answer key.” Blood pressure has a number. But whether a tissue sample looks cancerous, whether a child’s behavior qualifies as hyperactive, or whether an interview response signals PTSD all require a human call. Interrater reliability testing reveals whether those calls are consistent enough to trust.

How It Differs From Intrarater Reliability

Interrater reliability measures agreement between different people. Intrarater reliability measures whether the same person is consistent with themselves over time. In a typical intrarater study, one rater evaluates the same set of materials twice, separated by weeks or months, and researchers check whether the scores match. Both types matter, but they answer different questions. Interrater reliability tells you the measurement is objective enough to survive a change in personnel. Intrarater reliability tells you a single rater isn’t drifting.

Cohen’s Kappa: The Most Common Measure

The most widely used statistic for interrater reliability is Cohen’s kappa. It was designed specifically to solve a problem with simple percent agreement: two raters can agree a surprising amount of the time purely by chance. If 90% of patients in a study are healthy, two raters who both default to “healthy” will agree 90% of the time without actually exercising any judgment. Cohen’s kappa adjusts for this by comparing the observed agreement to the agreement you’d expect from random guessing alone.

The result is a number that typically falls between 0 and 1. A kappa of 0 means the raters agreed no more than chance would predict. A kappa of 1 means perfect agreement. Negative values are possible and indicate the raters agreed less often than random chance, which usually signals a systematic problem with how they’re applying the rating system.

Cohen’s kappa works when exactly two raters classify subjects into categories. When you have three or more raters, you need a different version called Fleiss’ kappa, introduced in 1971 to extend the same logic to larger groups. In Fleiss’ kappa, a fixed number of raters each assign every subject to one category, and the statistic captures overall agreement across the full panel.

The Landis and Koch Scale

The most commonly cited framework for interpreting kappa values comes from Landis and Koch (1977):

  • Below 0: Poor agreement
  • 0.00 to 0.20: Slight agreement
  • 0.21 to 0.40: Fair agreement
  • 0.41 to 0.60: Moderate agreement
  • 0.61 to 0.80: Substantial agreement
  • 0.81 to 1.00: Almost perfect agreement

These labels are convenient, but they’re not universally accepted. One criticism is that the scale allows fairly low actual agreement to be labeled “substantial.” A kappa of 0.65 sounds reassuring when you call it substantial, but it may still mean raters disagree on a meaningful number of cases. There’s also no consensus on a single minimum acceptable kappa for publication. In practice, many researchers accept kappa values that are lower than ideal, partly because the statistic doesn’t provide enough context on its own to judge whether the disagreement matters clinically.

Another limitation: kappa values can have wide confidence intervals. A reported kappa of 0.70 might have a confidence interval stretching from 0.50 to 0.90, which spans the range from moderate to almost perfect. That makes a single kappa number less definitive than it appears.

Intraclass Correlation Coefficient (ICC)

When raters assign scores on a continuous scale (like a pain rating from 0 to 10) rather than placing subjects into categories, the intraclass correlation coefficient is the preferred measure. The ICC is more flexible than kappa and comes in multiple forms depending on your study design. Choosing the right one requires answering four questions: Are the same raters evaluating every subject? Were the raters randomly selected from a larger pool or specifically chosen? Are you interested in the reliability of a single rater or the average of multiple raters? And do you care about absolute agreement (raters give the same number) or consistency (raters rank subjects in the same order, even if their numbers differ)?

These choices matter. Using the wrong ICC model can make reliability look better or worse than it actually is. Researchers have defined 10 distinct forms of ICC based on combinations of these factors, so if you’re reading a study that reports an ICC, it’s worth checking which version they used.

Real-World Examples in Psychiatry

Psychiatric diagnosis is one of the most studied applications of interrater reliability, because diagnoses depend heavily on clinical interpretation. A large systematic review and meta-analysis of psychiatric diagnoses found kappa values that generally fall in the “substantial” range but with notable variation. Psychotic disorders showed a pooled kappa of 0.70. Obsessive-compulsive disorder and eating disorders both came in at 0.73. Anxiety disorders reached 0.65, and PTSD was lower at 0.60, placing it at the border between moderate and substantial agreement.

These numbers mean that clinicians agree on most patients, but a meaningful minority of cases will receive different diagnoses depending on who conducts the evaluation. For a condition like PTSD, roughly four in ten evaluations may involve some level of diagnostic disagreement. That has real consequences for patients and for research trials that depend on accurate group assignment.

How to Improve Interrater Reliability

Low reliability isn’t a dead end. It’s a signal that the rating system, the training, or both need work. Evidence-based strategies for improving agreement follow a consistent pattern.

First, thorough item review. Raters go through every item on the rating scale with an instructor, clarifying definitions, terminology, and scoring conventions. Misunderstanding what a category actually means is one of the most common sources of disagreement, and it’s the easiest to fix. Second, practical co-rating exercises. Raters score the same cases simultaneously, then compare and discuss their results. A trainer observes, asks each rater to explain their reasoning, and guides the group toward shared conventions. This process of debating and justifying scores forces raters to make their implicit decision rules explicit. Third, the use of recorded examples that span a range of skill levels or severity, so raters practice with cases that are genuinely difficult to classify, not just the obvious ones.

The goal isn’t to pressure raters into artificial agreement. It’s to ensure they share a common understanding of what each score or category means, so that legitimate differences in observation are separated from differences in interpretation of the scale itself.

Choosing the Right Statistic

The choice between kappa and ICC depends on what kind of data your raters produce. If raters assign subjects to categories (yes/no, diagnosis A vs. B vs. C), use kappa. Cohen’s kappa works for two raters; Fleiss’ kappa works for three or more. If raters assign numerical scores on a scale, use the ICC, selecting the specific model that matches your study design.

Percent agreement alone is never sufficient, because it doesn’t account for chance. Two raters flipping coins will agree about half the time. Kappa and ICC both correct for this, which is why they’re the standard in published research. Whichever statistic you use, report the confidence interval alongside the point estimate. A kappa or ICC without a confidence interval tells you less than it should about how stable the agreement really is.