DNA sequence classification via an expectation maximization algorithm and neural networks: a case study
- 1 November 2001
- journal article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Systems, Man and Cybernetics, Part C (Applications and Reviews)
- Vol. 31 (4), 468-475
- https://doi.org/10.1109/5326.983930
Abstract
Presents new techniques for biosequence classification, with a focus on recognizing E. Coli promoters in DNA. Specifically, given an unlabeled DNA sequence S, we want to determine whether or not S is an E. Coli promoter. We use an expectation maximization (EM) algorithm to locate the -35 and -10 binding sites in an E. Coli promoter sequence. The EM algorithm differs from previously published EM algorithms in that, instead of assuming a uniform distribution for the lengths of the spacer between the -35 binding site and the -10 binding site as well as between the -10 binding site and the transcriptional start site, our algorithm deduces the probability distribution for these lengths. Based on the located binding sites, we select features in each E. Coli promoter sequence according to their information contents and represent the features using an orthogonal encoding method. We then feed the features to a neural network for promoter recognition. Empirical studies show that the proposed approach achieves good performance on different data sets.Keywords
This publication has 25 references indexed in Scilit:
- Finding patterns in three-dimensional graphs: algorithms and applications to scientific data miningIEEE Transactions on Knowledge and Data Engineering, 2002
- DNA Bendability—;A Novel Feature inE. coliPromoter RecognitionJournal of Biomolecular Structure and Dynamics, 1999
- Artificial neural networks for molecular sequence analysisComputers & Chemistry, 1997
- A statistical model for locating regulatory regions in genomic DNAJournal of Molecular Biology, 1997
- Complementary classification approaches for protein sequencesProtein Engineering, Design and Selection, 1996
- Knowledge-based artificial neural networksArtificial Intelligence, 1994
- Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragmentsJournal of Molecular Biology, 1992
- Sequence logos: a new way to display consensus sequencesNucleic Acids Research, 1990
- Recognition of characteristic patterns in sets of functionally equivalent DNA sequencesBioinformatics, 1987
- Rigorous pattern-recognition methods for DNA sequences: Analysis of promoter sequences from Escherichia coliJournal of Molecular Biology, 1985