Predictive validity is a measure of how well a test or assessment forecasts a future outcome. If a college entrance exam actually predicts freshman grades, or a hiring test predicts job performance six months later, those tools have predictive validity. The key ingredient is time: you measure something now and check whether it correlates with a real-world result later.
The American Psychological Association classifies predictive validity as one of three types of criterion validity, meaning it’s always measured against some concrete, observable outcome (the “criterion”) rather than against abstract concepts. That criterion could be college GPA, a clinical diagnosis, job performance ratings, or recidivism rates, depending on the context.
How Predictive Validity Works
Establishing predictive validity follows a straightforward logic. First, you administer a test or collect a measurement from a group of people. Then you wait. After a defined period, weeks, months, or even years, you measure the outcome you care about. Finally, you calculate the statistical correlation between the original scores and the later outcome. A strong positive correlation means the test does a good job of forecasting; a weak correlation means it doesn’t.
The waiting period is what distinguishes predictive validity from its close cousin, concurrent validity. With concurrent validity, the test and the outcome are measured at roughly the same time. A depression questionnaire that correlates with a psychiatrist’s diagnosis given on the same day shows concurrent validity. That same questionnaire predicting a depression diagnosis six months down the road shows predictive validity. Both are useful, but they answer different questions. Concurrent validity tells you whether the test captures what’s happening right now. Predictive validity tells you whether it can see into the future.
Reading the Numbers
Predictive validity is typically expressed as a correlation coefficient, a number between -1 and +1. The closer the value is to +1 (or -1 for inverse relationships), the stronger the prediction. In practice, the benchmarks vary by field, but a commonly used framework in psychology considers correlations below 0.3 weak, between 0.3 and 0.6 moderate, and above 0.7 strong. In medicine, the thresholds tend to be stricter: a correlation of 0.5 might only be considered “fair.”
These numbers are rarely as high as people expect. A correlation of 0.5 means the test explains about 25% of the variation in the outcome, which sounds unimpressive until you consider how many factors influence something like college grades or job performance. In complex real-world predictions, even moderate correlations can be practically valuable when no single factor dominates.
SAT Scores and College Grades
One of the most studied examples of predictive validity involves the SAT. College Board research reports a correlation of 0.53 between SAT scores and first-year college GPA across a broad sample. That falls in the moderate range. The correlation holds fairly steady across demographic groups: 0.50 for underrepresented minority students, 0.52 for non-underrepresented minority students, 0.49 for first-generation college students, and 0.53 for non-first-generation students.
Institutional selectivity nudges the numbers slightly. At more selective private colleges, the SAT-to-GPA correlation reaches 0.60. For STEM majors specifically, it climbs to 0.63, the strongest subgroup correlation reported. These figures suggest the SAT has meaningful but limited predictive power. It captures something real about academic preparation, but it leaves roughly two-thirds of the variation in grades unexplained, which is why most admissions offices combine test scores with high school GPA, essays, and other factors.
Depression Screening Tools
In healthcare, predictive validity determines whether a screening tool catches people who will go on to develop a condition. The PHQ-9, a widely used depression questionnaire, has been tested for its ability to predict post-stroke depression. A meta-analysis pooling data across studies found pooled sensitivity of 0.84 and specificity of 0.90. In plain terms, the tool correctly identified 84% of people who later developed depression and correctly cleared 90% of those who didn’t.
The specific cutoff score matters. At a cutoff of 5 (a lower threshold that casts a wider net), sensitivity rose to 0.90 and specificity to 0.91. At a cutoff of 10 (a stricter threshold), sensitivity dropped to 0.77 while specificity stayed at 0.85. Clinicians choose the cutoff based on whether they’d rather catch more true cases at the cost of some false alarms, or reduce false alarms at the risk of missing some cases.
Hiring and Job Performance
Employers rely on predictive validity when choosing which assessments to use in hiring. Cognitive ability tests have decades of research behind them, and structured interviews and work-sample tests consistently add predictive power beyond what a single cognitive test captures. This concept is called incremental validity: how much additional prediction a new measure provides on top of what you’re already using.
The most common way to test incremental validity is hierarchical regression, where researchers add one predictor at a time and check whether the overall prediction improves meaningfully. A personality inventory might correlate only modestly with job performance on its own, but if it captures something a cognitive test misses, like conscientiousness or emotional stability, it adds incremental validity to the hiring model. The practical takeaway for organizations is that combining different types of assessments almost always predicts better than relying on any single tool.
Why Predictive Validity Gets Underestimated
One of the most common problems in predictive validity research is range restriction. This happens when you can only measure outcomes for the people who were selected, not the ones who were screened out. If a medical school admits students based partly on an entrance exam, researchers can only track the grades of admitted students. The rejected applicants, who presumably scored lower, never generate outcome data. This artificially shrinks the range of scores in the study and pushes the correlation coefficient downward.
The effect is especially pronounced in highly selective settings. When only the top 10% of applicants are admitted, the admitted group looks relatively similar on the predictor, so the test appears to have weak predictive power even if it would show strong validity across the full applicant pool. Researchers have known about this problem for over half a century, yet many studies still report uncorrected correlations. Statistical corrections for range restriction exist, but they require assumptions about the distribution of the full applicant pool, which introduces its own uncertainties.
A second challenge is criterion contamination, where the outcome measure itself is biased. If job performance is rated by supervisors who already know an employee’s test scores, their ratings may be unconsciously influenced by that knowledge. Ratings can also be distorted by personal likability, stereotype-based assumptions, or a tendency to rate everyone in the middle of the scale. All of these errors add noise that weakens the apparent link between predictor and outcome, making the test look less valid than it actually is.
Predictive vs. Concurrent vs. Content Validity
Predictive validity is one piece of a larger validity framework. Concurrent validity checks whether a new test agrees with an established measure given at the same time. If you develop a new anxiety questionnaire and it correlates highly with an existing gold-standard anxiety scale administered the same day, it shows concurrent validity. This is faster and cheaper to establish than predictive validity because there’s no waiting period, but it can’t tell you anything about future outcomes.
Content validity, the third type commonly discussed alongside criterion validity, asks whether the test covers the right material. A final exam in biology has content validity if it tests the topics actually taught in the course. This is typically evaluated through expert judgment rather than statistical correlation.
In practice, a well-designed test needs multiple types of validity evidence. The SAT could have strong content validity (testing relevant academic skills) and moderate predictive validity (correlating with college GPA) but weak concurrent validity with, say, a completely different kind of aptitude measure. Each type of validity answers a distinct question about what the test does and how well it does it.

