Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

30 October 2007

journal article
research article
Published by American Chemical Society (ACS) in Journal of Chemical Information and Modeling

Vol. 47 (6), 2098-2109
https://doi.org/10.1021/ci700200n

Abstract

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations, where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features, such as labeled paths or labeled trees. These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure. Here, we combine statistical models of fingerprints with integer entropy codes, such as Golomb and Elias codes, to encode the indices or the run lengths of the fingerprints. After reordering the fingerprint components by decreasing frequency order, the indices are monotone-increasing and the run lengths are quasi-monotone-increasing, and both exhibit power-law distribution trends. We take advantage of these statistical properties to derive new efficient, lossless, compression algorithms for monotone integer sequences: monotone value (MOV) coding and monotone length (MOL) coding. In contrast to lossy systems that use 1024 or more bits of storage per molecule, we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule, close to the Shannon entropy limit, using a MOL Elias Gamma code for run lengths. The improvement in storage comes at a modest computational cost. Furthermore, because the compression is lossless, uncompressed similarity (e.g., Tanimoto) between molecules can be computed exactly from their compressed representations, leading to significant improvements in retrival performance, as shown on six benchmark data sets of druglike molecules.

Keywords

This publication has 22 references indexed in Scilit:

Mathematical Correction for Fingerprint Similarity Measures to Improve Chemical Retrieval
Journal of Chemical Information and Modeling, 2007
Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time
Journal of Chemical Information and Modeling, 2007
ChemDB: a public database of small molecules and related chemoinformatics resources
Bioinformatics, 2005
Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity
Bioinformatics, 2005
Encoding and Decoding Graphical Chemical Structures as Two-Dimensional (PDF417) Barcodes
Journal of Chemical Information and Modeling, 2005
Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures
Organic & Biomolecular Chemistry, 2004
Similarity Searching of Chemical Databases Using Atom Environment Descriptors (MOLPRINT 2D): Evaluation of Performance
Journal of Chemical Information and Computer Sciences, 2004
A Modification of the Jaccard–Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings
Technometrics, 2002
A branching process analysis of the stack algorithm for variable channel conditions
IEEE Transactions on Information Theory, 1992
The algebraic decoding of Goppa codes
IEEE Transactions on Information Theory, 1975

Cited by 43 articles