A joint deep learning model enables simultaneous batch effect correction, denoising, and clustering in single-cell transcriptomics
Open Access
- 25 May 2021
- journal article
- research article
- Published by Cold Spring Harbor Laboratory in Genome Research
- Vol. 31 (10), 1753-1766
- https://doi.org/10.1101/gr.271874.120
Abstract
Recent development of single-cell RNA-seq (scRNA-seq) technologies has led to enormous biological discoveries. As the scale of scRNA-seq studies increases, a major challenge in analysis is batch effects, which are inevitable in studies involving human tissues. Most existing methods remove batch effects in a low-dimensional embedding space. Although useful for clustering, batch effects are still present in the gene expression space, leaving downstream gene-level analysis susceptible to batch effects. Recent studies have shown that batch effect correction in the gene expression space is much harder than in the embedding space. Popular methods such as Seurat3.0 rely on the mutual nearest neighbor (MNN) approach to remove batch effects in gene expression, but MNN can only analyze two batches at a time and it becomes computationally infeasible when the number of batches is large. Here we present CarDEC, a joint deep learning model that simultaneously clusters and denoises scRNA-seq data while correcting batch effects both in the embedding and the gene expression space. Comprehensive evaluations spanning different species and tissues showed that CarDEC outperforms Scanorama, DCA + Combat, scVI, and MNN. With CarDEC denoising, non-highly variable genes offer as much signal for clustering as the highly variable genes (HVGs), suggesting that CarDEC substantially boosted information content in scRNA-seq. We also showed that trajectory analysis using CarDEC's denoised and batch corrected expression as input revealed marker genes and transcription factors that are otherwise obscured in the presence of batch effects. CarDEC is computationally fast, making it a desirable tool for large-scale scRNA-seq studies.Keywords
This publication has 30 references indexed in Scilit:
- Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighborsNature Biotechnology, 2018
- SCANPY: large-scale single-cell gene expression data analysisGenome Biology, 2018
- Missing data and technical variability in single-cell RNA-sequencing experimentsBiostatistics, 2017
- Improved Deep Embedded Clustering with Local Structure PreservationPublished by International Joint Conferences on Artificial Intelligence ,2017
- Single-cell transcriptomes identify human islet cell signatures and reveal cell-type–specific expression changes in type 2 diabetesGenome Research, 2016
- A Single-Cell Transcriptome Atlas of the Human PancreasCell Systems, 2016
- Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 DiabetesCell Metabolism, 2016
- De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome DataCell Stem Cell, 2016
- Gene expression profiling reveals the defining features of the classical, intermediate, and nonclassical human monocyte subsetsBlood, 2011
- Adjusting batch effects in microarray expression data using empirical Bayes methodsBiostatistics, 2006