Do Selection Tests Work Better Than We Think? A Numbers Debate

The Context

This post covers two connected pieces from the same issue of Industrial and Organizational Psychology: a focal article by Foster and colleagues (2024) and a commentary by Oh and Le (2024). Together they illustrate a recurring pattern in psychometric research: a methodologically clever reframing of familiar numbers that promises to change how we interpret selection test validity, followed by a measured but firm pushback from researchers who think the reframing creates more problems than it solves.

What Foster and Colleagues Propose

Foster and colleagues (2024) argue that the conventional way of evaluating selection test effectiveness systematically underestimates how well tests actually work. Their proposed correction is specific: rather than using observed correlations or corrected validities as the primary index of effectiveness, researchers and practitioners should square the observed correlation, then multiply by a factor of 4, or equivalently divide by one quarter.

The logic behind this is that job performance ratings contain multiple sources of variance, and only about one quarter of the total variance in performance ratings reflects genuine differences between ratees, the people being assessed. The remaining three quarters is noise: variance attributable to raters, occasions, and measurement error. If the validity coefficient is being evaluated against total performance variance rather than just the ratee-relevant portion, the argument goes, we are dividing by the wrong denominator and systematically undervaluing the test (Foster et al., 2024).

Applying the correction produces striking numbers. A selection procedure with an observed validity of .20 would conventionally be said to explain 4 percent of variance in performance. Under the Foster correction, it explains 16 percent of ratee-relevant variance. A validity of .40 would jump from explaining 16 percent to 64 percent. The conclusion Foster and colleagues draw is captured in their title: selection tests work better than we think they do, and have for years.

What Oh and Le Argue

Oh and Le (2024) agree with the conclusion, that selection tests are undervalued, but have significant conceptual and methodological objections to how Foster and colleagues get there.

Their core argument is that operational validity coefficients, the standard correlation-based indices currently used to evaluate selection procedures, remain the appropriate tool for the job. The proposed correction, while mathematically coherent on its own terms, involves redefining what is being predicted in a way that creates practical and interpretive problems (Oh & Le, 2024).

The practical issue is this: when organisations use selection tests, they are trying to predict actual job performance as it is observed and rated in the real world, total variance included. A statistic that tells you how well a test predicts the ratee-relevant portion of performance variance is answering a different question than the one practitioners need answered. It is technically interesting but operationally misleading, because the noise that Foster and colleagues propose to exclude is still present in the performance ratings that managers give and that organisations act on (Oh & Le, 2024).

There are also methodological concerns. The correction assumes that the one quarter figure for ratee main effects is stable and generalisable across jobs, organisations, and rating contexts. If that assumption does not hold, and there are reasons to think it will vary considerably, then multiplying by 4 in some contexts will produce figures that are not just reframed but genuinely inaccurate (Oh & Le, 2024).

Why This Matters Beyond the Technicalities

At one level this is an argument about denominators, and Paul’s characterisation of it as numbers play is not entirely unfair. But there is something substantive underneath the arithmetic.

The validity of selection tests has real consequences. It determines whether organisations use them, how much weight they are given relative to other selection methods, and how they are defended in legal and policy contexts. The Sackett and colleagues (2024) meta-analysis reviewed earlier in this series found that the corrected validity of general cognitive ability for job performance in contemporary data is around .22, considerably lower than the .51 figure that dominated the field for decades. If Foster and colleagues’ reframing were adopted, that .22 would become something closer to .19 multiplied by 4, a figure that looks considerably more impressive but may not be answering the question organisations actually need answered.

Oh and Le (2024) are essentially arguing that clarity and practical utility matter more than finding a framing that makes the numbers look better. Selection tests are useful. The existing validity coefficients, understood correctly, already demonstrate that. Adjusting the denominator to inflate the apparent effect size does not make the tests more useful; it risks making the evidence harder to interpret and easier to misuse.

References

Foster, J., Steel, P., Harms, P., O’Neill, T. O., & Wood, D. (2024). Selection tests work better than we think they do, and have for years. Industrial and Organizational Psychology. Advance online publication. https://doi.org/10.1017/iop.2024.10

Oh, I., & Le, H. (2024). Operational validity/correlation coefficients are still valid for evaluating selection procedure effectiveness. Industrial and Organizational Psychology. Advance online publication. https://doi.org/10.1017/iop.2024.13