Methods of assessing categorical agreement between correlated screening tests in clinical studies
- 9 June 2020
- journal article
- review article
- Published by Taylor & Francis Ltd in Journal of Applied Statistics
- Vol. 48 (10), 1861-1881
- https://doi.org/10.1080/02664763.2020.1777394
Abstract
Advances in breast imaging and other screening tests have prompted studies to evaluate and compare the consistency between experts' ratings of existing with new screening tests. In clinical settings, medical experts make subjective assessments of screening test results such as mammograms. Consistency between experts' ratings is evaluated by measures of inter-rater agreement or association. However, conventional measures, such as Cohen's and Fleiss' kappas, are unable to be applied or may perform poorly when studies consist of many experts, unbalanced data, or dependencies between experts' ratings exist. Here we assess the performance of existing approaches including recently developed summary measures for assessing the agreement between experts' binary and ordinal ratings when patients undergo two screening procedures. Methods to assess consistency between repeated measurements by the same experts are also described. We present applications to three large-scale clinical screening studies. Properties of these agreement measures are illustrated via simulation studies. Generally, a model-based approach provides several advantages over alternative methods including the ability to flexibly incorporate various measurement scales (i.e. binary or ordinal), large numbers of experts and patients, sparse data, and robustness to prevalence of underlying disease.Keywords
Funding Information
- United States National Institutes of Health (R01-CA226805,R01-CA172463)
This publication has 36 references indexed in Scilit:
- Variable Reliability of Endoscopic Findings With White-Light and Narrow-Band Imaging for Patients With Suspected Eosinophilic EsophagitisClinical Gastroenterology and Hepatology, 2011
- On population‐based measures of agreement for binary classificationsThe Canadian Journal of Statistics / La Revue Canadienne de Statistique, 2008
- Accuracy of Soft-Copy Digital Mammography versus That of Screen-Film Mammography according to Digital Manufacturer: ACRIN DMIST Retrospective Multireader StudyRadiology, 2008
- MISINTERPRETATION AND MISUSE OF THE KAPPA STATISTICAmerican Journal of Epidemiology, 1987
- Coefficient Kappa: Some Uses, Misuses, and AlternativesEducational and Psychological Measurement, 1981
- The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of ReliabilityEducational and Psychological Measurement, 1973
- Measuring nominal scale agreement among many raters.Psychological Bulletin, 1971
- Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit.Psychological Bulletin, 1968
- A Coefficient of Agreement for Nominal ScalesEducational and Psychological Measurement, 1960