Auxiliary outcome data and the mean score method

Abstract
In medical research outcomes of interest, Y, are often difficult to ascertain on a sufficiently large number of study subjects. Cost is frequently an issue for example. A more feasible approach might to be ascertain an easily measured but less accurate surrogate outcome variable, A, and to supplement the study with a validation sample of observations for whom both Y and A have been measured. In the context of a regression model Pβ(Y∣X) with X a covariate vector, we propose a method called mean score to make inference about β using such data. This method does not require specification of the association between Y and A and is semiparametric in this sense. More-over, in contrast to previous work by Espeland and Odoroff (J. Amer. Statist. Assoc. 80 (1985), 663–670), and Buonaceorsi (J. Amer. Statist. Assoc. 85 (1990), 1075–1082), sampling of the true outcome can depend on both covariate and auxiliary data. Two illustrations in real medical contexts demonstrate that auxiliary data can substantially improve efficiency over standard statistical designs. Designs which incorporate auxiliary data may become increasingly useful as budgetary restrictions and health care management play a larger role in medical research. p]A third illustration demonstrates that the mean score method can be useful in the classical setting when observational datasets contain missing outcome data. Data need not be missing at random in the usual sense (Rubin (Multiple Imputation for Non-Response in Surveys (1987) Wiley, New York). Indeed the mean score method can adjust for biases induced by violation of the missing at random assumption in certain settings. We contend that the mean score method will be particularly useful in observational studies where it is possible, although perhaps inconvenient, to retrieve missing data.