Abstract
Imputation, the practice of 'filling in' missing data with plausible values, has long been recognized as an attractive approach to analysing incomplete data. For decades, survey statisticians have been imputing large databases by often elaborate means.1 From an operational standpoint, imputation solves the missing-data problem at the outset, enabling the analyst to proceed without further hindrance. From a statistical standpoint, however, a naive or unprincipled imputation method may create more problems than it solves, distorting estimates, standard errors and hypothesis tests, as documented by Little and Rubin2 and others. The question of how to obtain valid inferences from imputed data was addressed by Rubin3 in his book on multiple imputation (MI). MI is a Monte Carlo technique in which the missing values are replaced by m > 1 simulated versions, where m is typically small (say, 3-10). In Rubin's method for 'repeated imputation' inference, each of the simulated complete datasets is analysed by standard methods, and the results are later combined to produce estimates and confidence intervals that incorporate missing-data uncertainty. Rubin's text addresses potential uses of MI primarily for large public-use data files from sample surveys and censuses. With the advent of new computational methods and software for creating MI's, however, the technique has become increasingly attractive for researchers in the biomedical, behavioural, and social sciences whose investigations are hindered by missing data.