Analysis and recognition of 5' UTR intron splice sites in human pre-mRNA
Open Access
- 13 February 2004
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 32 (3), 1131-1142
- https://doi.org/10.1093/nar/gkh273
Abstract
Prediction of splice sites in non‐coding regions of genes is one of the most challenging aspects of gene structure recognition. We perform a rigorous analysis of such splice sites embedded in human 5′ untranslated regions (UTRs), and investigate correlations between this class of splice sites and other features found in the adjacent exons and introns. By restricting the training of neural network algorithms to ‘pure’ UTRs (not extending partially into protein coding regions), we for the first time investigate the predictive power of the splicing signal proper, in contrast to conventional splice site prediction, which typically relies on the change in sequence at the transition from protein coding to non‐coding. By doing so, the algorithms were able to pick up subtler splicing signals that were otherwise masked by ‘coding’ noise, thus enhancing significantly the prediction of 5′ UTR splice sites. For example, the non‐coding splice site predicting networks pick up compositional and positional bias in the 3′ ends of non‐coding exons and 5′ non‐coding intron ends, where cytosine and guanine are over‐represented. This compositional bias at the true UTR donor sites is also visible in the synaptic weights of the neural networks trained to identify UTR donor sites. Conventional splice site prediction methods perform poorly in UTRs because the reading frame pattern is absent. The NetUTR method presented here performs 2–3‐fold better compared with NetGene2 and GenScan in 5′ UTRs. We also tested the 5′ UTR trained method on protein coding regions, and discovered, surprisingly, that it works quite well (although it cannot compete with NetGene2). This indicates that the local splicing pattern in UTRs and coding regions is largely the same. The NetUTR method is made publicly available at www.cbs.dtu.dk/services/NetUTR.Keywords
This publication has 23 references indexed in Scilit:
- Current methods of gene prediction, their strengths and weaknessesNucleic Acids Research, 2002
- Computational prediction of eukaryotic protein-coding genesNature Reviews Genetics, 2002
- Frequent Alternative Splicing of Human GenesGenome Research, 1999
- Initiation of translation in prokaryotes and eukaryotesGene, 1999
- Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence informationNucleic Acids Research, 1996
- Gene recognition via spliced sequence alignment.Proceedings of the National Academy of Sciences, 1996
- Cleaning the GenBank Arabidopsis thaliana data setNucleic Acids Research, 1996
- Selection of representative protein data setsProtein Science, 1992
- Prediction of human mRNA donor and acceptor sites from the DNA sequenceJournal of Molecular Biology, 1991
- Sequence logos: a new way to display consensus sequencesNucleic Acids Research, 1990