The Question

Job performance ratings by supervisors are the most common criterion measure in personnel selection research. The validity coefficients used to evaluate selection tools, including all those discussed in this series, are typically correlations against supervisor ratings. If those ratings are unreliable, the validity estimates built on them are distorted. Zhou and colleagues (2024) conducted an updated meta-analysis of the interrater reliability of supervisory performance ratings to establish how much confidence the field should place in its primary outcome measure.

What They Found

Using an updated meta-analytic procedure that prevents large-sample studies from dominating results, the meta-analysis found an interrater reliability estimate of r = .65 across 132 independent samples (Zhou et al., 2024). This is higher than estimates from previous meta-analyses on the same question.

However, the headline figure conceals meaningful variation by job type. Interrater reliability was r = .57 for managerial positions and r = .68 for non-managerial positions, a difference substantial enough to matter practically (Zhou et al., 2024). The lower reliability for managerial roles likely reflects the greater complexity, discretion, and multidimensionality of managerial performance, which makes it harder for two supervisors observing the same person to agree on how well they are performing.

The Practical Implication

The authors argue directly against the use of a single grand mean reliability figure for all validity corrections. Using an overall average of .65 to correct validity estimates for jobs whose true reliability is .57 will systematically overcorrect and inflate apparent validity. Using it for jobs whose reliability is .68 will undercorrect (Zhou et al., 2024). The recommendation is to use job-specific or local reliabilities when making corrections for attenuation, a more demanding standard but one that produces more accurate estimates.

Why It Matters in Context

This finding matters particularly in light of the ongoing debate about the validity of general cognitive ability and other selection predictors reviewed elsewhere in this series. If the criterion measure against which all predictors are validated is itself unreliable, and if that unreliability varies systematically by job type in ways that corrections do not adequately account for, then the corrected validity estimates the field has been relying on are subject to error from multiple directions simultaneously (Zhou et al., 2024).

Reference

Zhou, Y., Sackett, P., Shen, W., & Beatty, A. (2024). An updated meta-analysis of the interrater reliability of supervisory performance ratings. Journal of Applied Psychology. Advance online publication. https://doi.org/10.1037/apl0001174