On Distinguishing Unique Combinations in Biological Sequences

Abstract
The problem of defining combinations of variants unique to a sequence is efficiently addressed as a set covering computation. The unique-combinations method is introduced, which identifies patterns in biological sequence data that distinguish a sequence from a group of other sequences. This method is further developed to describe features consistently present in one group of sequences but not in a second group. The approach is incorporated into a novel analytical tool, designed for use in studies of polymorphic sequence data, such as mitochondrial, human leukocyte antigen (HLA), or viral pathogen sequences. The unique combinations method is well suited to applications in medical genetics and evolutionary genetics. An example implementation of the unique-combinations method yields greatly improved risk assessment for insulin-dependent diabetes mellitus (IDDM) from amino acid patterns isolated in an analysis of HLA class II DQA1-DQB1 patient and control genotypes.