Detecting hidden batch factors through data-adaptive adjustment for biological effects
Open Access
- 9 October 2017
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 34 (7), 1141-1147
- https://doi.org/10.1093/bioinformatics/btx635
Abstract
Batch effects are one of the major source of technical variations that affect the measurements in high-throughput studies such as RNA sequencing. It has been well established that batch effects can be caused by different experimental platforms, laboratory conditions, different sources of samples and personnel differences. These differences can confound the outcomes of interest and lead to spurious results. A critical input for batch correction algorithms is the knowledge of batch factors, which in many cases are unknown or inaccurate. Hence, the primary motivation of our paper is to detect hidden batch factors that can be used in standard techniques to accurately capture the relationship between gene expression and other modeled variables of interest. We introduce a new algorithm based on data-adaptive shrinkage and semi-Non-negative Matrix Factorization for the detection of unknown batch effects. We test our algorithm on three different datasets: (i) Sequencing Quality Control, (ii) Topotecan RNA-Seq and (iii) Single-cell RNA sequencing (scRNA-Seq) on Glioblastoma Multiforme. We have demonstrated a superior performance in identifying hidden batch effects as compared to existing algorithms for batch detection in all three datasets. In the Topotecan study, we were able to identify a new batch factor that has been missed by the original study, leading to under-representation of differentially expressed genes. For scRNA-Seq, we demonstrated the power of our method in detecting subtle batch effects. DASC R package is available via Bioconductor or at https://github.com/zhanglabNKU/DASC. Supplementary data are available at Bioinformatics online.Funding Information
- Natural Science Foundation of Tianjin (15JCYBJC18900)
- National Natural Science Foundation of China (31728013)
- National Science Foundation (DMS-1263932)
- Cancer Prevention and Research Institute of Texas (RP-170387)
- Houston Endowment
This publication has 36 references indexed in Scilit:
- The Cancer Genome Atlas Pan-Cancer analysis projectNature Genetics, 2013
- Topoisomerases facilitate transcription of long genes linked to autismNature, 2013
- Normalizing RNA-Sequencing Data by Modeling Hidden Covariates with Prior KnowledgePLOS ONE, 2013
- STAR: ultrafast universal RNA-seq alignerBioinformatics, 2012
- The sva package for removing batch effects and other unwanted variation in high-throughput experimentsBioinformatics, 2012
- A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression dataThe Pharmacogenomics Journal, 2010
- A flexible R package for nonnegative matrix factorizationBMC Bioinformatics, 2010
- A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL StudiesPLoS Computational Biology, 2010
- Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable AnalysisPLoS Genetics, 2007
- Ridge Regression: Biased Estimation for Nonorthogonal ProblemsTechnometrics, 1970