Interobserver reliability is the consistency of ratings or measurements when two or more people independently evaluate the same thing. If two radiologists look at the same scan and reach the same diagnosis, or two researchers watching a video code the same behaviors, that’s high interobserver reliability. If they frequently disagree, the measurement has a problem. It’s one of the most fundamental checks in science and medicine for ensuring that results reflect reality rather than the subjective judgment of whoever happened to be doing the rating.
Why Simple Agreement Isn’t Enough
The most intuitive way to check whether observers agree is to calculate a percentage: out of 100 cases, how many did both raters score the same way? This sounds reasonable, but it has a serious flaw. Some agreement will happen purely by chance. If two people are sorting items into just two categories (yes or no), they’d agree roughly 50% of the time even if they were guessing randomly. A 75% agreement rate sounds solid until you realize that chance alone accounts for half of it.
A real-world example from diagnostic imaging illustrates the problem clearly. In one study of ultrasound measurements in critically ill patients, observers agreed 90% of the time on one measurement, which sounds excellent. But when researchers applied a chance-corrected statistic, the result was negative 0.05, indicating agreement no better than random. The raw percentage was misleading because the category distribution made chance agreement very high to begin with. This is why researchers almost always use statistics that account for chance when reporting interobserver reliability.
Cohen’s Kappa for Two Raters
The most widely used chance-corrected statistic is Cohen’s kappa. It works by comparing the agreement that actually occurred between two raters against the agreement you’d expect from chance alone. The formula takes the observed agreement, subtracts the expected chance agreement, and divides by the maximum possible agreement beyond chance. A kappa of 1.0 means perfect agreement, 0 means the raters agreed only as often as chance would predict, and negative values mean they agreed less than chance would predict.
Kappa is designed for categorical judgments: yes/no, benign/malignant, stage I/II/III. For categories that have a natural order (like mild, moderate, severe), a weighted version of kappa gives partial credit for near-misses rather than treating all disagreements equally. The standard interpretation of kappa values, while somewhat arbitrary, generally follows this scale:
- Below 0.20: Slight agreement
- 0.21 to 0.40: Fair agreement
- 0.41 to 0.60: Moderate agreement
- 0.61 to 0.80: Substantial agreement
- 0.81 to 1.00: Almost perfect agreement
These cutoffs are guidelines, not hard rules. What counts as “acceptable” depends on the stakes. A kappa of 0.60 might be fine for a preliminary screening tool but unacceptable for a diagnostic test that determines whether someone gets surgery.
Fleiss’ Kappa for Three or More Raters
Cohen’s kappa only works for two raters. When a study involves three or more observers, Fleiss’ kappa, introduced in 1971, extends the same logic to larger groups. Each rater assigns every subject to exactly one category, and the statistic measures how much their collective agreement exceeds chance. This comes up frequently in research where multiple coders independently classify the same set of data, or when a panel of clinicians rates a batch of cases.
ICC for Continuous Measurements
When observers produce continuous numbers rather than categories (measuring a tumor’s diameter in millimeters, scoring pain on a 0-to-10 scale), kappa doesn’t apply. Instead, researchers use the intraclass correlation coefficient, or ICC. While kappa asks “did they pick the same category?”, ICC asks “how closely do these numerical ratings cluster together?”
ICC values range from 0 to 1, with higher values indicating stronger agreement. It accounts for both the consistency of ratings (do the raters rank subjects in the same order?) and absolute agreement (do they assign similar actual numbers?). The distinction matters. Two raters might consistently rank patients in the same order but one always scores two points higher than the other. That’s high consistency but poor absolute agreement, and different ICC models capture each scenario.
Interobserver vs. Intraobserver Reliability
Interobserver reliability measures agreement between different people. Intraobserver reliability measures whether the same person gets the same result when they repeat the measurement on a separate occasion. Both matter, and they often diverge. In a study of spinal cord ultrasound measurements, intraobserver reliability was moderate (ICC around 0.60 for each observer individually), while interobserver reliability for the same measurement method dropped to just 0.20, classified as slight agreement. The same observers were reasonably consistent with themselves but poor at agreeing with each other.
This pattern is common. Individual raters develop internal habits and tendencies that make their own repeated measurements fairly stable. But those habits differ from person to person, which is exactly what interobserver reliability detects. A measurement tool is only as useful as its ability to produce similar results regardless of who is using it.
Where Interobserver Reliability Matters Most
In diagnostic imaging, interobserver reliability directly affects patient care. A systematic review of variability studies in radiology found that about 48% of imaging tests studied showed acceptably low variability, with authors recommending them for clinical use. But 24% showed high variability, with some researchers calling urgently for consensus guidelines. The remaining studies fell somewhere in between, often with results that were difficult to interpret because different statistical measures told conflicting stories about the same data.
Beyond medicine, interobserver reliability is critical in any field where humans make subjective judgments. Behavioral scientists coding video observations, teachers grading essays, peer reviewers scoring grant applications, forensic analysts comparing fingerprints: all of these rely on human judgment, and all require evidence that different observers reach similar conclusions. Without that evidence, the measurements can’t be trusted to reflect anything beyond individual opinion.
How to Improve Agreement Between Raters
Low interobserver reliability usually traces back to vague criteria, insufficient training, or both. The most effective interventions are straightforward. A study on grant peer review developed a brief 11-minute training video and found measurable improvements. The training focused on five elements: explaining why accurate scoring matters, clarifying what each score on the rating scale means, defining what counts as a minor versus moderate versus major weakness, showing how small scoring differences affect real outcomes, and emphasizing the importance of reading evaluation criteria carefully.
The same principles apply across fields. Clear operational definitions reduce ambiguity: instead of telling raters to identify “significant” findings, specify exactly what qualifies. Practice sessions with feedback let raters calibrate against each other before the real data collection begins. Providing reference examples for each category gives raters concrete anchors rather than abstract descriptions. Even something as simple as addressing reverse coding (where lower numbers mean better outcomes, contrary to intuition) can reduce errors that look like disagreement but are really just confusion about the scale.
Training doesn’t need to be lengthy to be effective, but it does need to be specific. The goal is to ensure that every rater understands the task identically, so that differences in their ratings reflect genuine ambiguity in the data rather than differences in how they interpreted the instructions.

