Statistical Development and Evaluation of Microarray Gene Expression Data Filters

1 May 2005

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 12 (4), 482-495
https://doi.org/10.1089/cmb.2005.12.482

Abstract

Filtering is a common practice used to simplify the analysis of microarray data by removing from subsequent consideration probe sets believed to be unexpressed. The m/ n filter, which is widely used in the analysis of Affymetrix data, removes all probe sets having fewer than m present calls among a set of n chips. The m/ n filter has been widely used without considering its statistical properties. The level and power of the m/ n filter are derived. Two alternative filters, the pooled p- value filter and the error- minimizing pooled p- value filter are proposed. The pooled p- value filter combines information from the present - absent p- values into a single summary p- value which is subsequently compared to a selected significance threshold. We show that the pooled p- value filter is the uniformly most powerful statistical test under a reasonable beta model and that it exhibits greater power than the m/ n filter in all scenarios considered in a simulation study. The error- minimizing pooled p- value filter compares the summary p- value with a threshold determined to minimize a total- error criterion based on a partition of the distribution of all probes' summary p- values. The pooled p- value and error- minimizing pooled p- value filters clearly perform better than the m/ n filter in a case- study analysis. The case- study analysis also demonstrates a proposed method for estimating the number of differentially expressed probe sets excluded by filtering and subsequent impact on the final analysis. The filter impact analysis shows that the use of even the best filter may hinder, rather than enhance, the ability to discover interesting probe sets or genes. S- plus and R routines to implement the pooled p- value and error- minimizing pooled p- value filters have been developed and are available from www. stjuderesearch. org/ depts/ biostats/ index. html.

Keywords

This publication has 12 references indexed in Scilit:

Improving false discovery rate estimation
Bioinformatics, 2004
Statistical significance for genomewide studies
Proceedings of the National Academy of Sciences of the United States of America, 2003
Estimating the occurrence of false positives and false negatives in microarray studies by approximating and partitioning the empirical distribution of p-values
Bioinformatics, 2003
Identifying differentially expressed genes using false discovery rate controlling procedures
Bioinformatics, 2003
A Direct Approach to False Discovery Rates
Journal of the Royal Statistical Society Series B: Statistical Methodology, 2002
Operating Characteristics and Extensions of the False Discovery Rate Procedure
Journal of the Royal Statistical Society Series B: Statistical Methodology, 2002
A mixture model approach for the analysis of microarray gene expression data
Computational Statistics & Data Analysis, 2002
A study of logspline density estimation
Computational Statistics & Data Analysis, 1991
Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting
Journal of the American Statistical Association, 1988
Ordinal Measures of Association
Journal of the American Statistical Association, 1958

Cited by 21 articles