Abstract
The long caseAlthough there are numerous variations on the long case, traditionally a student is given unobserved time with a real patient in a clinical setting. During that time the student conducts an interview and performs a physical examination as appropriate. The student then presents his or her findings and plans to the examiners, who ask about the patient and related topics enabling them to judge the quality of the student's performance.StrengthsThe primary strength of the long case is that it evaluates the student's performance with a real patient. In the early stages of training, the objective structured clinical examination is often used, which assesses clinical skills well.1 However, the standardised or simulated patients that are part of the examination are limited in the number and complexity of the medical problems they can portray. Consequently, as students approach entry to practice, assessment also needs to be based on performance with real patients who can exhibit the range of conditions seen in the clinical setting.The long case also presents students with a complete and realistic clinical challenge. They are required to obtain all relevant information from the patient, structure the problem, synthesise their findings, and formulate a management plan. This contrasts with the typical objective structured clinical examination, with each station focusing on one aspect of the doctor-patient encounter.WeaknessesImplicit in the use of the long case is the assumption that if the student was examined again with another patient and different examiners, the results would be the same. Otherwise the scores could not be trusted to predict performance in practice, and it would make no sense to use them for assessment. The concept that test results should be able to be generalised or be repeatable is referred to as reliability or reproducibility.2Over the past 30 years it has become increasingly apparent that the long case does not yield results that achieve reasonable levels of reproducibility. For example, in the early 1970s the American Board of Internal Medicine's oral examination for cardiovascular disease consisted of two long cases, each with two examiners.3 Putting both cases together yielded a score that had a reproducibility coefficient of 0.39, meaning that 39% of the variability in scores was due to students' ability (signal) and 61% to errors of measurement (noise). When adjusted to prophesise what would happen with only one long case, the coefficient drops to 0.24, indicating that scores are composed of more than three times as much noise as signal. Other studies have obtained similar results, in stark contrast to clinical examinations such as the objective structured clinical examination, which often achieve reproducibility coefficients of 0.80 or better. 4 5 The table presents some typical findings for reproducibility of various formats.Three major factors explain why the long case has problems with reproducibility.8 In decreasing order of importance they are the case specificity of problem solving, differences between examiners, and variability in the aspects of an encounter evaluated.Case specificityFor the scores from the long case to be reproducible, students must perform at the same level regardless of the patient they examine, yet physician performance varies from case to case. The case specificity of problem solving was identified by Elstein and colleagues.9 It has been replicated in many studies.These findings should not be surprising. Physicians know that they do not perform uniformly across all patient problems or even across different patients with the same problem. They have areas of relative strength and weakness, they respond differently to patients depending on their personal and professional experiences, and patients respond differently to them depending on a variety of factors. Therefore an assessment device must sample broadly across patients to generate scores that will generalise to typical performance.Examiner effectsFor scores to be reproducible, examiners must apply the same standards. Research shows that even experienced educators differ when assessing the same event.10 This should not be surprising either. Physicians have legitimate and desirable differences in knowledge, standards, emphasis, and values. Likewise, they occasionally respond out of their own deficits of knowledge or the inappropriate influence of the patient's or student's style, appearance, race, sex, ethnicity, and so on. Further, patients' conditions naturally vary in difficulty, and the examiners must precisely compensate for the differences among them. These issues, and others, ensure that examiners differ when evaluating the same thing, undermining the reproducibility of the scores for the long case. An assessment device must sample across examiners to generate reproducible results.View this table: In this window In a new window Reproducibility of assessment formats studied by the American Board of Internal Medicine (estimates, based on three hours' testing time, will vary in other settings depending on quality of test material and heterogeneity of examinees)Aspects of a competence assessedFor scores to be reproducible, it is important to assess several aspects of the competence being elicited by the student-patient encounter. Specifically, a variety of studies show that the information obtained from measurements is increased when examiners are instructed to evaluate a standardised list of different features of a competence, or when they observe the student-patient encounter rather than make a single global assessment or base their judgments on interrogation alone.11–13 Again, these empirical results should not be surprising. Without specific instruction examiners will naturally attend to different aspects of an encounter, and this will be reflected in their evaluations. An assessment device must sample systematically across aspects of a competence to generate...