Intra-Exon Motif Correlations as a Proxy Measure for Mean Per-Tile Sequence Quality Data in RNA-Seq

Abstract
Given the wide variability in the quality of next-generation sequencing data submitted to public repositories, it is essential to identify methods that can perform quality control on these data sets when additional quality control data, such as mean tile data, are missing from public repositories. In this study, we present evidence that correlating counts of reads corresponding to pairs of motifs separated over specific distances on individual exons can be used as a proxy mean tile data in the data sets we analyzed and hence could be used when mean tile data are not available. As test data sets we use the Homo sapiens in vitro transcribed (IVT) data set, and a Drosophila melanogaster data set comprising wild and mutant types. We find that a FastQC analysis of the available parts of these data sets demonstrates that the per-tile sequencing quality is good for all the data sets apart from the mutant-type data where the mutant-r3 data are worse than the mutant-r2 data. Correspondingly, intra-exon motif correlations are reasonably large for all data sets except this latter case where the mutant-r2 correlations are low and the mutant-r3 correlations close to zero. We propose that these extremely low correlations are indicative of bias of technical origin, such as flowcell errors. In addition to this, the intra-exon motif correlations as a function of both guanosine-cytosine (GC) content parameters are somewhat higher and less dependent on the GC content parameters in the IVT-Plasmids messenger RNA (mRNA) selection free RNA-Seq sample (control) than in the other RNA-Seq samples that did undergo mRNA selection: both ribosomal depletion (IVT-Only) and PolyA selection (IVT-PolyA, wild type, and mutant).