Methods of assessing categorical agreement between correlated screening tests in clinical studies

9 June 2020

journal article
review article
Published by Taylor & Francis Ltd in Journal of Applied Statistics

Vol. 48 (10), 1861-1881
https://doi.org/10.1080/02664763.2020.1777394

Abstract

Advances in breast imaging and other screening tests have prompted studies to evaluate and compare the consistency between experts' ratings of existing with new screening tests. In clinical settings, medical experts make subjective assessments of screening test results such as mammograms. Consistency between experts' ratings is evaluated by measures of inter-rater agreement or association. However, conventional measures, such as Cohen's and Fleiss' kappas, are unable to be applied or may perform poorly when studies consist of many experts, unbalanced data, or dependencies between experts' ratings exist. Here we assess the performance of existing approaches including recently developed summary measures for assessing the agreement between experts' binary and ordinal ratings when patients undergo two screening procedures. Methods to assess consistency between repeated measurements by the same experts are also described. We present applications to three large-scale clinical screening studies. Properties of these agreement measures are illustrated via simulation studies. Generally, a model-based approach provides several advantages over alternative methods including the ability to flexibly incorporate various measurement scales (i.e. binary or ordinal), large numbers of experts and patients, sparse data, and robustness to prevalence of underlying disease.

Keywords

Funding Information

United States National Institutes of Health (R01-CA226805,R01-CA172463)

This publication has 36 references indexed in Scilit:

Variable Reliability of Endoscopic Findings With White-Light and Narrow-Band Imaging for Patients With Suspected Eosinophilic Esophagitis
Clinical Gastroenterology and Hepatology, 2011
On population‐based measures of agreement for binary classifications
The Canadian Journal of Statistics / La Revue Canadienne de Statistique, 2008
Accuracy of Soft-Copy Digital Mammography versus That of Screen-Film Mammography according to Digital Manufacturer: ACRIN DMIST Retrospective Multireader Study
Radiology, 2008
MISINTERPRETATION AND MISUSE OF THE KAPPA STATISTIC
American Journal of Epidemiology, 1987
Coefficient Kappa: Some Uses, Misuses, and Alternatives
Educational and Psychological Measurement, 1981
The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability
Educational and Psychological Measurement, 1973
Measuring nominal scale agreement among many raters.
Psychological Bulletin, 1971
Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.
Psychological Bulletin, 1968
A Coefficient of Agreement for Nominal Scales
Educational and Psychological Measurement, 1960

Cited by 1 article