Computing inter‐rater reliability and its variance in the presence of high agreement

Top Cited Papers

1 May 2008

journal article
research article
Published by Wiley in British Journal of Mathematical and Statistical Psychology

Vol. 61 (1), 29-48
https://doi.org/10.1348/000711006x126600

Abstract

Pi (π) and kappa (κ) statistics are widely used in the areas of psychiatry and psychological testing to compute the extent of agreement between raters on nominally scaled data. It is a fact that these coefficients occasionally yield unexpected results in situations known as the paradoxes of kappa. This paper explores the origin of these limitations, and introduces an alternative and more stable agreement coefficient referred to as the AC₁ coefficient. Also proposed are new variance estimators for the multiple-rater generalized π and AC₁ statistics, whose validity does not depend upon the hypothesis of independence between raters. This is an improvement over existing alternative variances, which depend on the independence assumption. A Monte-Carlo simulation study demonstrates the validity of these variance estimators for confidence interval construction, and confirms the value of AC₁ as an improved alternative to existing inter-rater reliability statistics.

Keywords

This publication has 15 references indexed in Scilit:

Beyond kappa: A review of interrater agreement measures
The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 1999
Integration and generalization of kappas for multiple raters.
Psychological Bulletin, 1980
Kappa revisited.
Psychological Bulletin, 1977
Measuring nominal scale agreement among many raters.
Psychological Bulletin, 1971
Measures of response agreement for qualitative data: Some generalizations and alternatives.
Psychological Bulletin, 1971
Large sample standard errors of kappa and weighted kappa.
Psychological Bulletin, 1969
Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.
Psychological Bulletin, 1968
A Coefficient of Agreement for Nominal Scales
Educational and Psychological Measurement, 1960
Reliability of Content Analysis: The Case of Nominal Scale Coding
Public Opinion Quarterly, 1955

Cited by 1123 articles