Optimal structure for automatic processing of DNA sequences

Abstract
The faithful recovery of the base sequence in automatic DeoxyriboNucleic Acid (DNA) sequencing fundamentally depends on the underlying statistics of the DNA electrophoresis time series. Current DNA sequencing algorithms are heuristic in nature and modest in their use of statistical information. In this paper, a formal statistical model of the DNA time series is presented and then used to construct the optimal maximum-likelihood (ML) processor. The DNA-ML algorithm that is derived in this paper features Kalman prediction of peak locations, peak parameter estimation, whitened waveform comparison and multiple hypothesis processing using the M-algorithm. Properties of the algorithm are examined using both simulated and real data. Model parameters of critical importance and their impact on different types of error mechanisms, such as insertions and deletions, are pointed out. The statistical model of the DNA time-series and the structure of the DNA-ML algorithm provides a basis for future investigation and refinement of DNA sequencing techniques.

This publication has 15 references indexed in Scilit: