Combinatorial pattern discovery for scientific data

24 May 1994

journal article
Published by Association for Computing Machinery (ACM) in ACM SIGMOD Record

Vol. 23 (2), 115-125
https://doi.org/10.1145/191843.191863

Abstract

Suppose you are given a set of natural entities (e.g., proteins, organisms, weather patterns, etc.) that possess some important common externally observable properties. You also have a structural description of the entities (e.g., sequence, topological, or geometrical data) and a distance metric. Combinatorial pattern discovery is the activity of finding patterns in the structural data that might explain these common properties based on the metric. This paper presents an example of combinatorial pattern discovery: the discovery of patterns in protein databases. The structural representation we consider are strings and the distance metric is string edit distance permitting variable length don't cares. Our techniques incorporate string matching algorithms and novel heuristics for discovery and optimization, most of which generalize to other combinatorial structures. Experimental results of applying the techniques to both generated data and functionally related protein families obtained from the Cold Spring Harbor Laboratory show the effectiveness of the proposed techniques. When we apply the discovered patterns to perform protein classification, they give information that is complementary to the best protein classifier available today.

Keywords

This publication has 21 references indexed in Scilit:

Approximate Tree Matching in the Presence of Variable Length Don′t Cares
Journal of Algorithms, 1994
A system for approximate tree matching
IEEE Transactions on Knowledge and Data Engineering, 1994
Fast text searching
Communications of the ACM, 1992
Statistical estimators for aggregate relational algebra queries
ACM Transactions on Database Systems, 1991
The human genome project and informatics
Communications of the ACM, 1991
Database systems: achievements and opportunities
Communications of the ACM, 1991
New techniques for best-match retrieval
ACM Transactions on Information Systems, 1990
Computational approaches to discovering semantics in molecular biology
Proceedings of the IEEE, 1989
Fast parallel and serial approximate string matching
Journal of Algorithms, 1989
Multiple sequence alignment
Journal of Molecular Biology, 1986

Cited by 38 articles