What Makes a Test Valid and Reliable, Explained

A test is valid when it measures what it claims to measure, and reliable when it produces consistent results each time it’s used. These two qualities are the foundation of any trustworthy assessment, whether it’s a classroom exam, a psychological screening, or a medical diagnostic. They’re related but distinct: reliability is about consistency, validity is about accuracy. A test needs both to be useful, but having one doesn’t guarantee the other.

Reliability: Consistency Across Conditions

Reliability asks a simple question: if you gave this test again under similar conditions, would you get the same results? A bathroom scale that reads 150 pounds one minute and 163 pounds the next is unreliable. It doesn’t matter whether either number is correct. The inconsistency alone makes it useless.

There are several ways to measure reliability, each capturing a different kind of consistency:

Test-retest reliability checks whether the same person gets similar scores when taking the test on two separate occasions. This is measured using correlation coefficients that quantify how closely the two sets of scores match. It works best for traits that shouldn’t change over short periods, like personality or cognitive ability.
Internal consistency checks whether the individual items on a test all measure the same underlying thing. If a 20-question anxiety questionnaire is well designed, someone who scores high on one question should tend to score high on the others. The most common measure for this is Cronbach’s alpha, a statistic that ranges from 0 to 1.
Inter-rater reliability applies when humans are doing the scoring. If two trained graders evaluate the same essay or observe the same patient interview, they should arrive at similar ratings. When they don’t, the problem is with the scoring system, not the person being assessed.
Split-half reliability divides a test into two halves (often odd-numbered and even-numbered items), scores each half separately, and checks whether the halves correlate. The correlation is then adjusted using a formula called the Spearman-Brown correction to estimate the reliability of the full-length test.

What Counts as “Reliable Enough”

For Cronbach’s alpha, acceptable values generally range from 0.70 to 0.95. Below 0.70, the test items probably aren’t measuring the same thing consistently. But higher isn’t always better. A score above 0.90 can actually suggest that some items are redundant, essentially asking the same question in slightly different words. That’s a sign the test could be shortened without losing useful information. For high-stakes decisions like clinical diagnoses, reliability closer to 0.90 is expected. For research surveys or classroom quizzes, 0.70 to 0.80 is often considered adequate.

Validity: Measuring the Right Thing

A test can be perfectly consistent and still completely wrong. Imagine measuring intelligence by recording shoe size. You’d get highly reliable numbers (shoe size doesn’t change day to day), but they’d tell you nothing about intelligence. That’s reliable but not valid. Validity is the harder quality to establish because it requires evidence that the test actually captures what it’s supposed to.

Validity comes in several forms, each providing a different type of evidence:

Content validity asks whether the test covers the full range of the topic it’s supposed to measure. A final exam for a biology course that only includes questions about cell structure, ignoring genetics and ecology entirely, has poor content validity. It’s an incomplete sample of the knowledge it should assess.
Construct validity asks whether the test measures the theoretical concept it’s designed to capture. A depression questionnaire has good construct validity if it actually reflects the psychological experience of depression, not just general unhappiness or fatigue. This is often the most difficult type of validity to demonstrate because it requires showing the test aligns with theoretical predictions about how the measured trait should behave.
Criterion validity asks whether test scores correspond to real-world outcomes. This breaks into two subtypes. Predictive validity checks whether scores predict something meaningful in the future: does a college admissions test predict first-year grades? Concurrent validity checks whether scores align with other established measures of the same thing right now: does a new anxiety scale produce scores similar to an older, well-validated one?
Convergent and discriminant validity work as a pair. Convergent validity shows that your test correlates with other measures of the same concept. Discriminant validity shows that it doesn’t correlate with measures of unrelated concepts. A good self-esteem scale should correlate with other self-esteem measures (convergent) but not strongly with, say, math ability (discriminant).

Why Reliability Comes First

Think of a target at a shooting range. Reliability is like grouping your shots tightly together. Validity is like grouping them around the bullseye. If your shots are scattered all over the target, you can’t claim to be hitting the center consistently, even if a few land there by chance. But if all your shots land in a tight cluster in the upper left corner, you’re precise (reliable) without being accurate (valid). You’d just need to adjust your aim.

This is why reliability is considered a prerequisite for validity. A test that produces inconsistent data cannot consistently produce true measurements. However, reliability alone doesn’t guarantee validity. You still need evidence that the consistent thing being measured is the right thing.

What Reduces Reliability and Validity

Several practical factors can degrade a test’s quality, many of them surprisingly mundane. Test length matters: shorter tests tend to be less reliable because each individual item has more influence on the total score, and random error has fewer chances to cancel itself out. Environmental conditions like noise, lighting, or time pressure can introduce variability that has nothing to do with what’s being measured. Fatigue is another factor. People responding more slowly tend to produce wider variation in their scores, and that variation often reflects tiredness rather than the trait being assessed.

Item quality plays a major role too. Questions that are ambiguously worded, too easy, or too difficult can reduce both reliability and validity. In timed tests, keeping unusual outlier responses (extremely slow or fast answers) rather than removing them has been shown to negatively affect both metrics. Even something as simple as participants still learning the instructions at the start of a test can introduce noise into the early items.

Unequal conditions across test-takers are another threat. If one group gets more time, different instructions, or a quieter room, the scores become harder to compare. The test might still be internally consistent, but the results no longer mean the same thing for everyone.

How Tests Are Built to Be Valid and Reliable

Good tests don’t happen by accident. The development process typically involves several deliberate steps designed to build validity and reliability in from the start.

Expert review is one of the first steps. A panel of typically five to seven experts evaluates each proposed item to determine whether it represents the topic the test is supposed to cover. Their agreement can be quantified using formal statistics, turning subjective judgment into measurable consensus. Some development teams use a structured method called the Delphi process, where experts independently rate items across multiple rounds until they converge on which questions best reflect the concept being measured.

Pilot testing comes next. Draft questions are given to a small group from the target population, not to collect scores, but to check understanding. A technique called cognitive interviewing asks respondents to talk through their thought process as they answer each question. This reveals misunderstandings, confusing wording, and items that people interpret differently than intended. Five to fifteen interviews across two or three rounds is considered ideal. In one development project, eight items were dropped after cognitive interviews because respondents found them unclear or unimportant, and several others were reworded based on feedback about grammar and answer options.

After pilot testing, factor analysis is used to confirm that the test items cluster into the expected groups. If you designed a wellbeing survey to measure three domains (physical health, mental health, and social connection), factor analysis checks whether the items actually sort into those three groups statistically, or whether the data suggests a different structure entirely.

Validity and Reliability in Medical Testing

In clinical settings, the concepts of validity and reliability translate into more specific metrics. A medical test’s validity is captured by two numbers: sensitivity and specificity. Sensitivity is the proportion of people who actually have a condition that the test correctly identifies as positive. Specificity is the proportion of people without the condition that the test correctly identifies as negative. A highly sensitive test rarely misses real cases. A highly specific test rarely flags healthy people.

These map onto validity because they measure whether the test is actually detecting what it claims to detect. A pregnancy test with high sensitivity catches nearly every pregnancy. One with high specificity almost never gives a false positive. Most medical tests involve a tradeoff between the two, and which one matters more depends on the consequences of being wrong.

Reliability in medical testing shows up as reproducibility. If the same blood sample is run through the same analyzer twice, the results should be nearly identical. If two radiologists read the same scan, they should reach the same conclusion. When they don’t, that’s a reliability problem, and it means the test results can’t be fully trusted regardless of how well the test was designed.

The Professional Standards

The formal guidelines governing test quality in the United States come from the Standards for Educational and Psychological Testing, jointly published by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. These organizations have collaborated on this document since 1966, with the current edition published in 2014 and now available as open access. A revision process began in 2024 with the naming of a new joint committee. These standards represent the gold standard for how tests should be developed, evaluated, and used, and they apply to everything from school achievement tests to clinical screening tools to employment assessments.