GenStore: a high-performance in-storage processing system for genome sequence analysis

22 February 2022

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

https://doi.org/10.1145/3503222.3507702

Abstract

Read mapping is a fundamental step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). Read mapping is costly because it needs to perform approximate string matching (ASM) on large amounts of data. To address the computational challenges in genome analysis, many prior works propose various approaches such as accurate filters that select the reads within a dataset of genomic reads (called a read set) that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the amount of expensive computation, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different properties such as read lengths and error rates, which highly depend on the sequencing technology, and 2) different degrees of genetic variation compared to the reference genome, which highly depends on the genomes that are being compared. Through rigorous analysis of read mapping processes of reads with different properties and degrees of genetic variation, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based solid-state drive (SSD). Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern NAND flash-based SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05× (1.52-3.32×) for read sets with high similarity to the reference genome and 1.45-33.63× (2.70-19.2×) for read sets with low similarity to the reference genome.

Keywords

This publication has 94 references indexed in Scilit:

CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions
BMC Bioinformatics, 2013
Field guide to next‐generation DNA sequencers
Molecular Ecology Resources, 2011
CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions
BMC Research Notes, 2010
Personalized copy number and segmental duplication maps using next-generation sequencing
Nature Genetics, 2009
Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse
PLoS Biology, 2009
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units
BMC Research Notes, 2009
A large genome center's improvements to the Illumina sequencing system
Nature Methods, 2008
Next-generation DNA sequencing
Nature Biotechnology, 2008
A Greedy Algorithm for Aligning DNA Sequences
Journal of Computational Biology, 2000
Basic local alignment search tool
Journal of Molecular Biology, 1990

Cited by 21 articles