An empirically derived test is a psychological assessment built by selecting questions based on how well they actually distinguish between groups of people, rather than on any theory about what the questions should measure. If a question reliably separates, say, people diagnosed with depression from those without it, the question stays on the test. If it doesn’t, it gets cut. The logic is straightforward: keep what works, discard what doesn’t, and worry about explaining why later.
How Empirical Tests Are Built
The construction process, often called empirical criterion keying, starts with two groups. One is the “criterion group,” people who share a known characteristic the test is trying to measure. The other is the “control group,” people who don’t share that characteristic. Both groups answer a large pool of questions, and researchers then compare how the two groups responded to each item.
Questions where the two groups answer differently get kept. Questions where both groups answer the same way get thrown out. In practice, researchers typically correlate each item with the outcome they care about and retain items that hit a minimum threshold (often a correlation of 0.10 or higher in either direction). The surviving items form the final test scale. This entire process is driven by data, not by a psychologist’s intuition about which questions “should” be relevant.
This stands in contrast to the rational or theoretical approach, where test developers start with a psychological construct (like extraversion or anxiety), identify the traits that define it, and then write questions specifically designed to tap into those traits. In that model, every question has a logical reason for being on the test. With empirical derivation, the reason a question works can be completely opaque. What matters is that it predicts the outcome.
The MMPI: The Classic Example
The most famous empirically derived test is the Minnesota Multiphasic Personality Inventory, or MMPI, introduced by Starke Hathaway and J.C. McKinley in 1940 at the University of Minnesota. They built it to assess clinical symptoms by differentiating people with mental health conditions from those without them.
The developers started with a massive pool of true/false statements and administered them to psychiatric patients with specific diagnoses (the criterion groups) and to visitors at the University of Minnesota hospital who had no psychiatric diagnosis (the control group). For each clinical scale, only items that actually discriminated between the relevant patient group and the control sample made the cut. The result was a test where some items seem, on the surface, to have nothing to do with the condition they help identify. An item about reading habits might end up on a scale for social introversion, not because reading is theoretically linked to introversion, but because the data showed introverted people answered it differently.
The intellectual roots of this approach go back even further. As early as 1938, researchers at Minnesota argued against the rational method of scale development because they believed some items could predict outcomes without any obvious content connection. Their position was simple: items needed proven utility before earning a spot on a scale.
The Strong Interest Inventory
Vocational testing offers another clear illustration. E.K. Strong Jr. developed the Strong Vocational Interest Blank in 1927 using what he called the “method of contrast groups.” He gave a large set of questions about likes and dislikes to satisfied members of a particular occupation and compared their responses to those of a general sample of employed adults across many fields.
Only items where the occupational group responded differently from the general group were kept for that occupation’s scale. Your score on, say, the engineering scale reflected how closely your pattern of interests matched the pattern of people who were happily working as engineers. The underlying assumption was practical: people in the same occupation tend to share interests, and if your interests match theirs, you’d probably find satisfaction in that career too.
Why Some Questions Seem Unrelated
One of the most distinctive features of empirically derived tests is that individual questions often lack face validity. A question about your favorite season might appear on a scale measuring a personality trait it has no obvious connection to. This happens because the method doesn’t care about surface logic. It cares about statistical relationships.
This can actually be useful. When test questions don’t have an obvious “right” answer, it becomes harder for someone to fake their way to a desired result. If you’re taking a hiring assessment and you can’t tell which answer makes you look better, you’re more likely to answer honestly. That said, research has shown this protection has limits. In one study, 94 people completed an empirically derived selection test under both honest and fake instructions. When told to simulate a highly motivated job applicant, participants produced significantly higher scores on 7 of 10 scales, enough to change their predicted job performance profiles.
Strengths of the Empirical Approach
The biggest advantage is predictive power. Because items are selected specifically for their ability to distinguish between groups, empirically derived tests tend to maximize prediction of the outcomes they target. One comparison of test construction methods found that selecting items to maximize prediction yielded a predictive validity of 0.40, compared to 0.33 when items were selected using a measurement-focused approach. That’s an 18% reduction in predictive accuracy when you optimize for internal consistency rather than real-world prediction.
Interestingly, the same comparison showed that high internal reliability (the standard measure of whether test items hang together as a coherent scale) isn’t a prerequisite for strong prediction. The prediction-optimized scale had notably lower reliability than the measurement-optimized one, yet it predicted the outcome just as well as the full original questionnaire. This is a core insight of the empirical philosophy: coherence and prediction are different goals, and sometimes they pull in opposite directions.
Key Limitations
The most persistent criticism is that empirically derived tests are atheoretical. It’s often difficult to understand why certain items ended up on a scale. This isn’t just an intellectual inconvenience. Without a theoretical framework explaining why items work, it’s harder to know whether they’ll continue working when you move to a different population, a different time period, or a different cultural context.
Related to this is the problem of overfitting. Because the method capitalizes on every statistical relationship between items and the criterion in the development sample, some of those relationships are due to chance. Initial validity numbers are inflated as a result. When the test is applied to a new group of people, performance typically drops. Research on cross-validation has consistently shown that models evaluated only on the data they were built from produce “highly overoptimistic” estimates of accuracy. This is why any well-constructed empirical test needs to be cross-validated on an independent sample before it can be trusted.
The sample size requirement is also significant. You need enough people in both your criterion and control groups to produce stable statistical relationships. Small or unrepresentative samples lead to keys that don’t generalize. And because the method is entirely dependent on the quality of the criterion groups, any errors in how those groups are defined (misdiagnoses in a clinical sample, for instance) flow directly into the test.
Empirical Methods in Modern Testing
The logic behind empirical test derivation has found a natural home in machine learning. Modern algorithms do essentially the same thing at massive scale: they identify patterns in data that predict outcomes, without requiring a theory about why those patterns exist. The machine learning approach treats the data as unknowns and focuses on predictive accuracy, relying on general-purpose learning algorithms to find patterns in large, complex datasets.
The parallel is striking. Just as a 1940s MMPI item might predict depression without any obvious content link, a modern algorithm might identify combinations of variables that predict a clinical outcome in ways no human would have hypothesized. The same trade-off applies too. These models are powerful predictors but can be difficult to interpret, and they require careful validation to ensure they work beyond the data they were trained on. Leave-one-source-out cross-validation, where models are tested on entirely new data sources rather than random splits of the original data, has emerged as the more reliable way to estimate real-world performance.
Whether the tool is a mid-century personality inventory or a contemporary neural network, the core principle is the same: let the data decide what predicts, and hold the result to strict standards of replication before trusting it.

