Aligning Sequences by Minimum Description Length

Open Access

1 January 2007

journal article
research article
Published by Springer Science and Business Media LLC in EURASIP Journal on Bioinformatics and Systems Biology

Vol. 2007 (1), 1-14
https://doi.org/10.1155/2007/72936

Abstract

This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.

Keywords

This publication has 34 references indexed in Scilit:

ApiDB: integrated resources for the apicomplexan bioinformatics resource center
Nucleic Acids Research, 2006
Homology assessment and molecular sequence alignment
Journal of Biomedical Informatics, 2006
The fragment assembly string graph
Bioinformatics, 2005
Statistical evaluation and comparison of a pairwise alignment algorithm that a priori assigns the number of gaps rather than employing gap penalties
Bioinformatics, 2004
Empirical determination of effective gap penalties for sequence comparison
Bioinformatics, 2002
Information Content of Individual Genetic Sequences
Journal of Theoretical Biology, 1997
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Comparison of methods for searching protein sequence databases
Protein Science, 1995
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice
Nucleic Acids Research, 1994
Sequence alignment and penalty choice: Review of concepts, case studies and implications
Journal of Molecular Biology, 1994

Cited by 2 articles