Evidence base of clinical diagnosis: Designing studies to ensure that estimates of test accuracy are transferable

Abstract
True variability in test accuracy To interpret a test's results in different setting requires an understanding of whether and why the test's accuracy varies. Broadly speaking, measures of accuracy fall into two broad categories: measures of discrimination between people who are and who are not diseased, and measures of prediction used to estimate post-test probability of disease. Measures of discrimination Global measures of test accuracy assess only the ability of the test to discriminate between people with and without a disease. Common examples are the area under the receiver operating characteristic curve (ROC), and the odds ratio (OR), sometimes also referred to as the diagnostic odds ratio. Such results may suffice for some broad health policy decisions—for example, to decide whether a new test is in general better than an existing test for the target condition. Measures for prediction The measures used to estimate the probabilities of the target condition in people who have a particular test result require both discrimination and calibration. The predictive value—the proportion of people with a particular test result who have the disease of interest—is an example. It is clumsy and difficult to estimate disease rates for categories of patients who may have different pretest probabilities of having the disease. Therefore, the estimation is often done indirectly using Bayes's theorem, based on the pretest probability and measures of test characteristics such as sensitivity and specificity or likelihood ratios in specific patients. These measures of test performance require more than discrimination. They require tests to be calibrated. Transferability of test results The transferability of measures of test performance from one setting to another depends on which indicator of test performance is used. The figure shows the assumptions involved in transferability. The table indicates the relation between these assumptions and the transferability of the different measures of test performance. View larger version: In this window In a new window Distribution of test results in patients with and without the target disease. The numbers refer to assumptions for the transferability of test results (see text and table) View this table: In this window In a new window Assumptions for transferring different test performance characteristics (X=important; x=less important) The main assumptions in transferring tests across settings fall into six categories. The definition of disease is constant—Many diseases have ambiguous definitions. For example, there are no single reference standards for heart failure, Alzheimer's disease, or diabetes. Reference standards may differ because individual investigators' conceptual frameworks differ, or because it is difficult to apply the same framework in a standardised way. The same test is used—Although based on the same principle, tests may differ—for example, over time or if made by different manufacturers. The thresholds between categories of test result (for example, positive and negative) are constant—This is possible with a well standardised test that can be calibrated for different settings. However, there may be no accepted means of calibration—for example, different observers of imaging tests may have different thresholds for calling an image “positive.” The effect of different cut-off points is classically studied by use of a receiver operating characteristic curve. In some cases calibration may be improved by using category specific likelihood ratios rather than a single cut-off point. The distribution of test results in the disease group is constant in average (location) and spread (shape)—This assumption is not fulfilled if the spectrum of disease changes—if, for example, a screening setting is likely to include earlier disease, for which test results will be closer to those for a group without the disease (hence reducing sensitivity). The distribution of test results in the group without disease is constant in average (location) and spread (shape)—This assumption is not fulfilled if the spectrum of non-disease changes—if, for example, the secondary care setting involves additional causes of false positives due to comorbidity not seen in primary care. The ratio of disease to non-disease (pretest probability) is constant—If this were the case, we could use the post-test probability (“predictive” values) directly. However, this assumption is often not fulfilled—for example, the pretest probability is likely to be lowest with screening tests and greatest with tests in referred patients. This likely inconstancy is the reason for using Bayes's theorem to “adjust” the post-test probability for the pretest probability of each different setting. All measures of test performance need the first two assumptions to be fulfilled. The importance of the last four assumptions is shown in the table, although they may not be necessary in every instance; occasionally the assumptions may not be fulfilled but, because of compensating differences, transferability is still reasonable.