A Nonparametric Method for Dealing With Mismeasured Covariate Data

Abstract
Mismeasurement of covariate data is a frequent problem in statistical data analysis. However, when true and mismeasured data are obtained for a subsample of the observations, it is possible to estimate the parameters relating the outcome to the covariate of interest. Maximum likelihood methods that rely on parametric models for the mismeasurement have not met with much success. Realistic models for the mismeasurement process are difficult to construct; the form of the likelihood is often intractable and, more important, such methods are not robust to model misspecification. We propose an easily implemented method that is nonparametric with respect to the mismeasurement process and that is applicable when mismeasurement is due to the problem of missing data, errors in variables, or use of imperfect surrogate covariates. Specifically, denote the outcome variable by Y, the covariate data subject to mismeasurement by X, and the remaining covariates, including perhaps surrogates or mismeasured values of X, by Z. We consider a general regression model of the form Pβ (Y | X, Z). Suppose data regarding Y, X, and Z are available for a validation sample V , a random subsample of the total sample, whereas data regarding only Y and Z are available for the remainder, the nonvalidation sample V . We propose to base inference on the estimated likelihood for β, [Lcirc](β) = Π i∈v P β (Yi | X i , Z i ) Π j v [Pcirc] β (Y j | Z j ), where [Pcirc] β(Y j | Z j ) is estimated empirically using the validation sample covariate data. Asymptotic results are derived for the case in which the surrogate or mismeasured covariates are categorical. The asymptotic variance of the estimated score involves not only the second derivative of the log estimated likelihood but also a term that captures the variability induced by estimating the nonvalidation sample likelihood. An example and a small simulation study demonstrate that this method may be of value for the missing covariate data and covariate measurement error problems.