Entropy predicts sensitivity of pseudorandom seeds

Open Access

22 May 2023

journal article
Published by Cold Spring Harbor Laboratory in Genome Research

Vol. 33 (7), 1162-1174
https://doi.org/10.1101/gr.277645.123

Abstract

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness–sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.

Keywords

Funding Information

Swedish Research Council (2018-05973)
Swedish Research Council
Vetenskapsrådet (2021-04000)

This publication has 43 references indexed in Scilit:

A random-permutations-based approach to fast read alignment
BMC Bioinformatics, 2013
Genome sequence-based species delimitation with confidence intervals and improved distance functions
BMC Bioinformatics, 2013
Improving PacBio Long Read Accuracy by Short Read Alignment
PLOS ONE, 2012
Error Tolerant Indexing and Alignment of Short Reads with Covering Template Families
Journal of Computational Biology, 2010
Run Probabilities of Seed-Like Patterns and Identifying Good Transition Seeds
Journal of Computational Biology, 2008
Generalized Correlation Functions and Their Applications in Selection of Optimal Multiple Spaced Seeds for Homology Search
Journal of Computational Biology, 2007
Indel seeds for homology search
Bioinformatics, 2006
Designing Multiple Simultaneous Seeds for DNA Similarity Search
Journal of Computational Biology, 2005
Reducing storage requirements for biological sequence comparison
Bioinformatics, 2004
Basic local alignment search tool
Journal of Molecular Biology, 1990

Cited by 1 article