What Makes a Test Valid? Types, Threats, and Standards

A test is valid when it actually measures what it claims to measure. A spelling test that only includes vocabulary words isn’t measuring spelling ability. A job screening that asks questions unrelated to job performance isn’t measuring candidate quality. Validity is the single most important quality of any test, whether it’s a classroom exam, a psychological assessment, a hiring tool, or a medical diagnostic. Without it, the scores are meaningless.

The Three Core Types of Validity

Validity isn’t one thing you check off a list. It’s built from multiple types of evidence, and the three major categories are content validity, criterion validity, and construct validity. Each answers a different question about whether the test is doing its job.

Content validity asks whether the test items adequately cover the subject they’re supposed to measure. Think of a final exam for a biology course that only tests material from the first two weeks. The test content doesn’t represent the full universe of what was taught, so it has poor content validity. This type of validity is evaluated by having experts review each item and judge whether it belongs. A widely used method, developed by psychometrician C.H. Lawshe, has panels of experts rate each item as “essential” or “non-essential.” For a seven-member panel, at least 75% agreement is needed for an item to be considered statistically valid.

Criterion validity asks whether test scores correlate with real-world outcomes. It comes in two forms. Concurrent validity checks whether test scores align with a measure taken at roughly the same time, like comparing a new depression questionnaire against an established one. Predictive validity checks whether test scores forecast future performance, like whether an entrance exam predicts graduation rates years later. Both are measured using correlation coefficients. Meta-analyses have found that predictive validity coefficients tend to run about .07 lower than concurrent ones, partly because measuring future outcomes introduces more variability.

Construct validity asks whether the test truly captures the abstract trait it’s designed to measure. Intelligence, anxiety, leadership potential: these are constructs you can’t observe directly. Establishing construct validity requires accumulating evidence from multiple sources, not just a single correlation.

How Construct Validity Is Established

Two key pieces of evidence support construct validity: convergent and discriminant validity. Convergent validity means your test agrees with other tests that measure the same trait. If your new anxiety scale produces similar scores to three established anxiety measures, that’s convergent evidence. Discriminant validity means your test does not correlate strongly with tests measuring different traits. If your anxiety scale produces scores nearly identical to a depression scale, you can’t be sure it’s really measuring anxiety specifically.

The classic method for evaluating both comes from a 1959 framework by Campbell and Fiske called the multitrait-multimethod matrix. The idea is to measure several traits using several different methods (self-report, observer ratings, peer evaluations) and then examine the pattern of correlations. A trait should show high agreement across methods (convergent) and low overlap with other traits (discriminant). Modern researchers typically use a statistical technique called confirmatory factor analysis to do this more precisely, because it can separate the influence of the method from the influence of the trait itself.

Face Validity: Useful but Limited

Face validity is the simplest form: does the test look like it measures what it’s supposed to measure? If you hand someone a math test and it clearly contains math problems, it has high face validity. This matters because test-takers who feel a test is irrelevant may not engage with it seriously, which can skew results.

However, the research community considers face validity the weakest form of validity evidence because it’s based on subjective judgment rather than objective data. A test can appear valid on the surface while measuring something entirely different. For new or established assessments, face validity alone is not sufficient. Empirical testing is required. That said, structured expert panels using systematic evaluation processes can produce meaningful face validity judgments, particularly for new measures where empirical data doesn’t yet exist.

Validity in Medical and Diagnostic Tests

For medical tests, validity takes a more concrete form through two metrics: sensitivity and specificity. Sensitivity is the proportion of people with a condition who correctly test positive. A test with 95% sensitivity catches 95 out of every 100 people who actually have the disease. Specificity is the proportion of people without the condition who correctly test negative. A test with 90% specificity correctly clears 90 out of every 100 healthy people.

Neither number alone tells the full story. A test can be highly sensitive (it rarely misses a case) but have low specificity (it frequently flags healthy people as sick). The reverse is also possible. Positive predictive value fills in the gap by answering what percentage of positive results are actually true positives, which depends heavily on how common the condition is in the population being tested. A positive result on a highly specific test means much more when the disease is common than when it’s rare.

Why Reliability Is Necessary but Not Enough

Reliability and validity are related but not the same thing. Reliability means consistency: if someone takes the test twice under similar conditions, they get similar scores. Validity means accuracy: the test measures the right thing. A useful analogy is a target. A marksman who shoots a tight cluster far from the bullseye is reliable (consistent) but not valid (not hitting the target). Only the shooter who consistently hits the bullseye is both reliable and valid.

This distinction has a critical implication. Reliability is a prerequisite for validity. A test that gives wildly different scores each time cannot be measuring the intended trait accurately. But reliability alone does not guarantee validity. Researchers sometimes claim a test is valid simply because it produces reproducible scores, and this is a well-documented error. Consistency is the floor, not the ceiling.

Common Threats to Validity

Researchers have cataloged at least 37 distinct threats to validity in empirical studies. Several of the most common ones apply directly to testing situations.

  • Testing effects occur when the act of being tested itself changes the outcome. Weighing someone at the start of a weight-loss study might motivate them to lose weight regardless of the intervention, making it impossible to separate the treatment effect from the measurement effect.
  • Instrumentation changes happen when what the test measures shifts over time. If diagnostic criteria for a condition change between two measurement points, apparent differences in outcomes might reflect changes in the definition rather than real changes in the population.
  • Maturation refers to natural changes in the person being tested. A child’s reading ability might improve between two tests simply because they’re six months older, not because of a reading program.
  • Attrition occurs when people drop out of a study or testing process in non-random ways. If struggling students stop showing up for assessments, the remaining scores will look artificially high.
  • Ambiguous temporal precedence is essentially reverse causality. If you’re testing whether violence exposure causes mental health problems, but mental health problems also increase exposure to violence, the direction of the relationship is unclear.

Professional Standards for Test Validity

The gold standard for evaluating test validity in the United States and internationally is the Standards for Educational and Psychological Testing, jointly published by three major organizations: the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. These organizations have collaborated on the Standards since 1966, and each one’s governing body formally approves the document as representing best practice.

The most recent edition, published in 2014 and released as open access in 2021, emphasizes that validity is not a property of the test itself but of the interpretations drawn from test scores. A test might be valid for one purpose (screening job applicants for basic competency) but invalid for another (predicting long-term career success). The 2014 revision placed particular emphasis on fairness and accessibility, the role of testing in educational accountability, workplace credentialing, and the expanding role of technology in test delivery. These aren’t abstract concerns. A test that is valid for one population but systematically disadvantages another is not producing valid score interpretations for everyone.