Concurrent validity measures how well a new test or tool agrees with an established one when both are administered at the same time. It’s a subtype of criterion validity, which broadly asks whether a test’s scores match up with some meaningful outside measure. The “concurrent” part refers specifically to timing: both measurements happen in the same session or time period, and the results are compared to see how closely they align.
How Concurrent Validity Works
The core idea is straightforward. You have a new measurement tool, and you want to know if it actually measures what it claims to measure. To find out, you give it alongside an existing tool that’s already trusted for measuring that same thing. Then you compare the two sets of scores.
That trusted tool is often called the “gold standard” or reference standard. For example, if you’ve designed a short questionnaire to assess skin pigmentation on the face, you’d compare its results against an objective device like a colorimeter that’s already validated for that purpose. If both tools produce similar results across a group of people, your new questionnaire has good concurrent validity.
Another example: researchers developed a set of quick physical tests to screen for early mobility decline in older adults. To check concurrent validity, they gave those tests alongside well-established reference measures like the 6-Minute Walk Test, the Stair Climbing Test, and the 30-Second Chair Rise Test. The correlation between the new screening tools and the reference tests ranged from 0.38 to 0.61, showing moderate agreement.
The Statistical Side
Concurrent validity is expressed as a correlation coefficient, typically labeled “r” or “rho.” This number falls between -1 and +1 and tells you how closely two sets of scores move together. The closer to 1 (or -1, for inverse relationships), the stronger the agreement.
The specific statistical method depends on the type of data. When both measures produce continuous data (like height or reaction time), Pearson’s correlation coefficient is the standard choice. When one or both measures use ranked or ordinal scales, such as a 0-to-10 pain rating, Spearman’s rank correlation is more appropriate because equal-looking gaps on those scales don’t necessarily represent equal changes in the real experience.
General benchmarks for interpreting the result:
- Strong (r above 0.7): High concurrent validity. The new measure is capturing the same construct as the established one.
- Moderate (r between 0.4 and 0.7): Acceptable concurrent validity. The tools overlap meaningfully but aren’t interchangeable.
- Weak (r below 0.4): Poor concurrent validity. The new measure may not be assessing the same thing.
Concurrent vs. Predictive Validity
Both concurrent and predictive validity fall under the umbrella of criterion validity, and both use correlations to compare a test against an outside criterion. The difference is timing. In concurrent validity, you collect both sets of scores at the same point in time. In predictive validity, you collect the test score now and the criterion score later, sometimes months or years down the road.
A study on adolescent marijuana use illustrates the distinction clearly. Researchers measured readiness to change using several tools. For concurrent validity, they compared those tools against marijuana involvement metrics collected at the same assessment. For predictive validity, they used the same readiness scores to predict marijuana use 6 and 12 months later. The concurrent analysis asks “does this tool agree with reality right now?” while the predictive analysis asks “does this tool forecast what happens next?”
Concurrent validity is generally faster and cheaper to establish because you don’t need to follow participants over time. That’s a practical advantage, but it also means you can’t draw conclusions about whether the tool predicts future outcomes.
What Counts as a Gold Standard
Choosing the right comparison measure is one of the trickiest parts of establishing concurrent validity. The whole analysis depends on the assumption that your reference tool is a trustworthy measure of the construct. If it isn’t, the correlation is meaningless regardless of how high it is.
An international panel of measurement experts (the COSMIN group) reached an important consensus on this point: for self-reported health outcomes, true gold standards essentially don’t exist. The one exception they identified is when a shortened version of a questionnaire is compared to its original long version. In that case, the long version can reasonably serve as the gold standard.
Outside that narrow exception, what researchers often call a “gold standard” is really just another well-established instrument. Comparing a new tool to a widely used questionnaire is more accurately described as construct validation, where you’re testing whether the two tools behave the way you’d expect if they measured the same thing. The COSMIN guidelines recommend that researchers state specific hypotheses in advance about the expected direction and strength of the correlation, rather than simply reporting whatever number they get.
Common Pitfalls
One issue is criterion contamination, where the criterion measure is influenced by factors unrelated to what you’re trying to measure. In education research, for example, using college GPA as a criterion for SAT validity is complicated by the fact that students choose different courses. Differences in course difficulty can inflate or deflate the apparent validity of the test. The criterion itself has noise baked in.
Range restriction is another problem. If your study sample doesn’t include the full range of ability or severity that the tool is designed to measure, the correlation will typically appear lower than it actually is. Testing a depression screening tool only on people with moderate symptoms, for instance, would obscure how well the tool distinguishes mild from severe cases.
There’s also the temptation to skip the hypothesis step. Running a correlation and then interpreting it after the fact makes it easy to rationalize a mediocre result. The COSMIN framework specifically calls for researchers to decide ahead of time what direction the correlation should go (positive or negative) and roughly how strong it should be. This keeps the validation process honest.
How Researchers Establish It in Practice
The typical process follows a logical sequence. First, you design or select the new instrument you want to validate. Then you identify an appropriate reference measure that’s already accepted for the same construct. You pilot test the new instrument and refine it based on feedback. Once it’s finalized, you administer both instruments to the same group of people at the same time, then calculate the correlation between the two sets of scores.
Sample size matters. Small samples produce unstable correlation estimates that can look misleadingly strong or weak. The mobility decline study mentioned earlier used 114 to 115 participants, which is fairly typical for this kind of validation work. Larger samples give more reliable estimates, particularly when you expect moderate rather than strong correlations.
The results are often presented as a correlation matrix showing how each subscale or component of the new tool relates to each component of the reference. In the mobility study, individual test-by-test correlations ranged from non-significant (the Global Physical Health questionnaire vs. some physical performance tests) to strong (the Lower Extremity Functional Scale vs. the 25-question Geriatric Locomotive Function Scale at rho = 0.68). That kind of granularity helps identify which parts of a new tool work well and which need improvement.

