Use of Runs Statistics for Pattern Recognition in Genomic DNA Sequences
- 1 January 2004
- journal article
- research article
- Published by Mary Ann Liebert Inc in Journal of Computational Biology
- Vol. 11 (1), 107-124
- https://doi.org/10.1089/106652704773416911
Abstract
In this article, the use of the finite Markov chain imbedding (FMCI) technique to study patterns in DNA under a hidden Markov model (HMM) is introduced. With a vision of studying multiple runs-related statistics simultaneously under an HMM through the FMCI technique, this work establishes an investigation of a bivariate runs statistic under a binary HMM for DNA pattern recognition. An FMCI-based recursive algorithm is derived and implemented for the determination of the exact distribution of this bivariate runs statistic under an independent identically distributed (IID) framework, a Markov chain (MC) framework, and a binary HMM framework. With this algorithm, we have studied the distributions of the bivariate runs statistic under different binary HMM parameter sets; probabilistic profiles of runs are created and shown to be useful for trapping HMM maximum likelihood estimates (MLEs). This MLE-trapping scheme offers good initial estimates to jump-start the expectation-maximization (EM) algorithm in HMM parameter estimation and helps prevent the EM estimates from landing on a local maximum or a saddle point. Applications of the bivariate runs statistic and the probabilistic profiles in conjunction with binary HMMs for pattern recognition in genomic DNA sequences are illustrated via case studies on DNA bendability signals using human DNA data.Keywords
This publication has 21 references indexed in Scilit:
- Parameter expansion to accelerate EM: the PX-EM algorithmBiometrika, 1998
- DNA structure in human RNA polymerase II promotersJournal of Molecular Biology, 1998
- Prediction of complete gene structures in human genomic DNAJournal of Molecular Biology, 1997
- On the conditional and unconditional distributions of the number of runs in a sample from a multisymbol alphabetCommunications in Statistics - Simulation and Computation, 1997
- On Runs and Longest Run Tests: A Method of Finite Markov Chain ImbeddingJournal of the American Statistical Association, 1996
- Over- and Underrepresentation of Short DNA Words in Herpesvirus GenomesJournal of Computational Biology, 1996
- The ECME algorithm: A simple extension of EM and ECM with faster monotone convergenceBiometrika, 1994
- Distribution Theory of Runs: A Markov Chain ApproachJournal of the American Statistical Association, 1994
- A POWER FUNCTION FOR TESTS OF RANDOMNESS IN A SEQUENCE OF ALTERNATIVESBiometrika, 1947
- The Distribution Theory of RunsThe Annals of Mathematical Statistics, 1940