What Is ICC in Statistics? Meaning, Types & Uses

The intraclass correlation coefficient (ICC) is a statistical measure that quantifies how consistently repeated measurements agree with each other. It ranges from 0 to 1, where values closer to 1 mean measurements are highly reproducible and values near 0 mean most of the variability comes from measurement error rather than real differences between subjects. ICC is the standard tool for assessing reliability in fields like medicine, psychology, and sports science, where you need to know whether raters, instruments, or repeated tests produce consistent results.

How ICC Works

At its core, ICC answers a simple question: of all the variability in a dataset, how much comes from genuine differences between subjects, and how much is just noise? Mathematically, it’s the ratio of true variance to total variance (true variance plus error variance). If you measure range of motion in 20 patients using two different clinicians, some of the variation in scores reflects the fact that patients genuinely differ in flexibility. The rest reflects inconsistency between the clinicians. ICC separates these two sources.

Consider a concrete example. Say the true variance in joint range of motion across patients is 9.6°, and the error variance from measurement inconsistency is 12.8°. The ICC would be 9.6 / (9.6 + 12.8) = 0.43. That’s a mediocre result, telling you that less than half the total variability reflects real patient differences. The rest is measurement noise.

Why Not Just Use Regular Correlation?

The Pearson correlation coefficient, which most people learn first, measures how strongly two variables move together in a straight line. It works well for asking “do X and Y have a linear relationship?” But it has a critical blind spot for reliability work: it ignores differences in the actual values. Two raters could consistently score patients 10 points apart from each other and still produce a Pearson correlation of 1.0, because their scores move in lockstep even though they never agree on an actual number.

ICC fixes this by accounting for differences in means between raters, not just the pattern of their scores. It captures both correlation and agreement in a single number. ICC also handles situations with three or more raters naturally, while Pearson correlation is limited to comparing exactly two variables at a time.

The Different Types of ICC

There isn’t one ICC. There are several versions, and picking the wrong one can meaningfully change your results. The variations come from two decisions you need to make about your study design.

Choosing a Model

The first decision is which statistical model fits how your raters were selected. There are three options, commonly called Model 1, Model 2, and Model 3:

One-way random (Model 1): Each subject is rated by a different set of raters, drawn randomly from a larger population. This is the least common setup but applies when, for instance, different technicians happen to be available for different patients.
Two-way random (Model 2): Every subject is rated by the same set of raters, and those raters are considered a random sample from a larger pool. You’d use this when your three raters were randomly chosen from all possible raters, and you want your results to generalize beyond just those three individuals.
Two-way mixed (Model 3): Every subject is rated by the same set of raters, but those specific raters are the only ones you care about. You’d use this when your study’s raters are the only ones who will ever use the measurement tool, or when raters were deliberately selected rather than randomly sampled.

Single Measures vs. Average Measures

The second decision is whether your final measurement in practice will come from a single rater or be averaged across multiple raters. If one clinician will assess each patient in the real world, you want the single-measures ICC. If the plan is to average scores from several raters (as in panel-based assessments or research protocols), you want the average-measures ICC. The average-measures version is always higher, because averaging naturally smooths out individual rater error.

Combining these choices produces six distinct ICC formulas, often labeled ICC(1,1), ICC(2,1), ICC(3,1) for single measures and ICC(1,k), ICC(2,k), ICC(3,k) for average measures. Statistical software like R’s psych package outputs all six at once, so the challenge isn’t computing them; it’s knowing which one to report.

Consistency vs. Absolute Agreement

For Models 2 and 3, there’s one more choice: do you care about consistency or absolute agreement? Consistency asks whether raters rank subjects in the same order, even if their scores differ by a fixed amount. Absolute agreement asks whether raters assign the same actual values. If one physical therapist consistently scores every patient 5 degrees higher than another, consistency ICC would still be high, but absolute agreement ICC would drop.

In most clinical and research contexts, absolute agreement is the more conservative and informative choice. If you’re trying to establish that a measurement tool gives the same result regardless of who uses it, you want to know the values themselves match, not just the rankings.

How to Interpret ICC Values

A widely used framework from Koo and Li (2016) breaks ICC into four reliability levels:

Below 0.50: Poor reliability
0.50 to 0.75: Moderate reliability
0.75 to 0.90: Good reliability
Above 0.90: Excellent reliability

These thresholds are guidelines, not hard rules. An ICC of 0.74 isn’t fundamentally different from 0.76. What matters more is the confidence interval around your ICC estimate. A reported ICC of 0.85 with a 95% confidence interval of 0.40 to 0.95 tells you the point estimate looks good but the true reliability could be anywhere from poor to excellent. Small sample sizes tend to produce wide confidence intervals, making the ICC estimate less trustworthy regardless of how high the number looks.

Context also matters. An ICC of 0.70 might be acceptable for a screening questionnaire used in early research, but inadequate for a diagnostic tool where individual patient decisions depend on the result.

Assumptions Behind ICC

ICC calculations rest on several assumptions that are worth knowing about, because violating them can distort your results. The data should be approximately normally distributed. The variance in measurement error should be roughly the same across all subjects (a property called homogeneous variance). And the subjects should represent a random sample from the population you want to generalize to.

In health and behavioral research, these assumptions are frequently violated but rarely checked. Measurement scales with floor or ceiling effects, for example, tend to produce non-normal distributions and uneven variance. Scores clustered near the extremes of a scale leave little room for raters to disagree, artificially inflating ICC. When these assumptions are seriously violated, alternative estimation methods exist, but the standard ICC calculation reported by most software assumes all conditions are met.

Common Uses in Practice

ICC shows up whenever researchers need to establish that a measurement is trustworthy before using it to answer a bigger question. In inter-rater reliability studies, two or more clinicians, coders, or judges rate the same subjects, and ICC quantifies how well they agree. In test-retest reliability studies, the same rater measures the same subjects at two time points, and ICC tells you how stable the measurement is over time. In instrument validation, researchers compare readings from a new device against an established one to see if the new tool is consistent enough for clinical use.

Specific examples include measuring joint range of motion across physical therapists, scoring psychiatric symptom scales across trained raters, grading radiological images across radiologists, and assessing movement quality in sports biomechanics. In each case, ICC provides a single number that captures whether the measurement process is reliable enough to trust the data it produces.