RNA-Seq gene expression estimation with read mapping uncertainty

Top Cited Papers

Open Access

18 December 2009

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 26 (4), 493-500
https://doi.org/10.1093/bioinformatics/btp692

Abstract

Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically. Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20–25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed. Availability: An initial C++ implementation of our method that was used for the results presented in this article is available at http://deweylab.biostat.wisc.edu/rsem. Contact:cdewey@biostat.wisc.edu Supplementary information: Supplementary data are available at Bioinformatics on

Keywords

This publication has 16 references indexed in Scilit:

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome
Genome Biology, 2009
Statistical inferences for isoform expression in RNA-Seq
Bioinformatics, 2009
Cross-hybridization modeling on Affymetrix exon arrays
Bioinformatics, 2008
Exact Transcriptome Reconstruction from Short Sequence Reads
Lecture Notes in Computer Science, 2008
Substantial biases in ultra-short read data sets from high-throughput DNA sequencing
Nucleic Acids Research, 2008
Stem cell transcriptome profiling via massive-scale mRNA sequencing
Nature Methods, 2008
Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis
Cell, 2008
A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE
Genomics, 2008
The UCSC Known Genes
Bioinformatics, 2006
Statistical modeling of sequencing errors in SAGE libraries
Bioinformatics, 2004

Cited by 1026 articles