Analysis of canonical and non-canonical splice sites in mammalian genomes
Open Access
- 1 November 2000
- journal article
- research article
- Published by Oxford University Press (OUP) in Nucleic Acids Research
- Vol. 28 (21), 4364-4375
- https://doi.org/10.1093/nar/28.21.4364
Abstract
A set of 43 337 splice junction pairs was extracted from mammalian GenBank annotated genes. Expressed sequence tag (EST) sequences support 22 489 of them. Of these, 98.71% contain canonical dinucleotides GT and AG for donor and acceptor sites, respectively; 0.56% hold non-canonical GC-AG splice site pairs; and the remaining 0.73% occurs in a lot of small groups (with a maximum size of 0.05%). Studying these groups we observe that many of them contain splicing dinucleotides shifted from the annotated splice junction by one position. After close examination of such cases we present a new classification consisting of only eight observed types of splice site pairs (out of 256 a priori possible combinations). EST alignments allow us to verify the exonic part of the splice sites, but many non-canonical cases may be due to intron sequencing errors. This idea is given substantial support when we compare the sequences of human genes having non-canonical splice sites deposited in GenBank by high throughput genome sequencing projects (HTG). A high proportion (156 out of 171) of the human non-canonical and EST-supported splice site sequences had a clear match in the human HTG. They can be classified after corrections as: 79 GC-AG pairs (of which one was an error that corrected to GC-AG), 61 errors that were corrected to GT-AG canonical pairs, six AT-AC pairs (of which two were errors that corrected to AT-AC), one case was produced from non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two cases left of supported non-canonical splice sites. If we assume that approximately the same situation is true for the whole set of annotated mammalian non-canonical splice sites, then the 99.24% of splice site pairs should be GT-AG, 0.69% GC-AG, 0.05% AT-AC and finally only 0.02% could consist of other types of non-canonical splice sites. We analyze several characteristics of EST-verified splice sites and build weight matrices for the major groups, which can be incorporated into gene prediction programs. We also present a set of EST-verified canonical splice sites larger by two orders of magnitude than the current one (22 199 entries versus ~600) and finally, a set of 290 EST-supported non-canonical splice sites. Both sets should be significant for future investigations of the splicing mechanism.Keywords
This publication has 40 references indexed in Scilit:
- Ab initio Gene Finding in Drosophila Genomic DNAGenome Research, 2000
- Requirement of U12 snRNA for in Vivo Splicing of a Minor Class of Eukaryotic Nuclear Pre-mRNA IntronsScience, 1996
- The unusual 5? splicing border GC is used in myrosinase genes of the BrassicaceaePlant Molecular Biology, 1995
- RNA-RNA interactions in the spliceosome: Unraveling the ties that bindCell, 1994
- dbEST — database for “expressed sequence tags”Nature Genetics, 1993
- Human pre-mRNA splicing signalsJournal of Theoretical Biology, 1991
- Unusual splice sites revealed by mutagenic inactivation of an authentic splice site of the rabbit β-globin geneNature, 1983
- A catalogue of splice junction sequencesNucleic Acids Research, 1982
- Organization and Expression of Eucaryotic Split Genes Coding for ProteinsAnnual Review of Biochemistry, 1981
- Ovalbumin gene: evidence for a leader sequence in mRNA and DNA sequences at the exon-intron boundaries.Proceedings of the National Academy of Sciences, 1978