Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias

Open Access

12 November 2019

journal article
research article
Published by Public Library of Science (PLoS) in PLoS Biology

Vol. 17 (11), e3000481
https://doi.org/10.1371/journal.pbio.3000481

Abstract

Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including reads per kilobase of transcript length per million reads (RPKM), Trimmed Mean of M values (TMM), relative log expression (RLE), and quantile and upper-quartile normalization. Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent functional misinterpretation of the data. Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g., extracellular matrix genes) are particularly prone to such false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. Consequently, using these normalization methods led to substantial reduction in GSEA false results while retaining true ones. In addition, we found that application of gene-set tests that take into account gene–gene correlations attenuates false positive rates caused by the length bias, but statistical power is reduced as well. Our results advocate the inspection and correction of sample-specific length biases as default steps in RNA-seq analysis pipelines and reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data.

Funding Information

Israel Science Foundation (2118/19)
DIP German-Israeli project cooperation
Koret-UC Berkeley-Tel Aviv University Initiative in Computational Biology and Bioinformatics
VWM Saxby project
Edmond J. Safra Center for Bioinformatics at Tel Aviv University
Sagol School of Neuroscience

This publication has 36 references indexed in Scilit:

Cancer transcriptome profiling at the juncture of clinical translation
Nature Reviews Genetics, 2017
Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2
Genome Biology, 2014
featureCounts: an efficient general purpose program for assigning sequence reads to genomic features
Bioinformatics, 2013
The limitations of simple gene set enrichment analysis assuming gene independence
Statistical Methods in Medical Research, 2012
GC-Content Normalization for RNA-Seq Data
BMC Bioinformatics, 2011
Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments
BMC Bioinformatics, 2010
Transcript length bias in RNA-seq data confounds systems biology
Biology Direct, 2009
Improving gene set analysis of microarray data by SAM-GS
BMC Bioinformatics, 2007
Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures
BMC Bioinformatics, 2007
Human housekeeping genes are compact
Trends in Genetics, 2003

Cited by 51 articles