A Decision Tree System for Finding Genes in DNA
- 1 January 1998
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 5 (4), 667-680
- https://doi.org/10.1089/cmb.1998.5.667
Abstract
Morgan is an integrated system for finding genes in vertebrate DNA sequences. Morgan uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in Morgan are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that Morgan has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95%, with a correlation coefficient of 0.78, and a sensitivity and specificity for coding bases of 83% and 79%. In addition, Morgan identifies 58% of coding exons exactly; i.e., both the beginning and end of the coding regions are predicted correctly. This paper describes the Morgan system, including its decision tree routines and the algorithms for site recognition, and its performance on a benchmark database of vertebrate DNA.Keywords
This publication has 25 references indexed in Scilit:
- GeneGenerator--a flexible algorithm for gene prediction and its application to maize sequences.Bioinformatics, 1998
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997
- Finding Genes in DNA with a Hidden Markov ModelJournal of Computational Biology, 1997
- Gene recognition via spliced sequence alignment.Proceedings of the National Academy of Sciences, 1996
- PairWise and SearchWise: Finding the Optimal Alignment in a Simultaneous Comparison of a Protein Profile against All DNA Translation FramesNucleic Acids Research, 1996
- Evaluation of Gene Structure Prediction ProgramsGenomics, 1996
- Recognition of Genes in Human DNA SequencesJournal of Computational Biology, 1996
- Assessment of protein coding measuresNucleic Acids Research, 1992
- Prediction of gene structureJournal of Molecular Biology, 1992
- The prediction of exons through an analysis of spliceable open reading framesNucleic Acids Research, 1992