SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE) in 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

p. 845-852
https://doi.org/10.1109/ccgrid.2015.55

Abstract

The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.

Keywords

This publication has 21 references indexed in Scilit:

CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReduce
PLOS ONE, 2014
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision
Bioinformatics, 2014
SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
Bioinformatics, 2013
BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
Bioinformatics, 2013
The Hadoop Distributed File System
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
Google’s MapReduce programming model — Revisited
Science of Computer Programming, 2008
Streaming Algorithms for Biological Sequence Alignment on GPUs
IEEE Transactions on Parallel and Distributed Systems, 2007
160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA)
BMC Bioinformatics, 2007
Biological Sequence Analysis
Published by Cambridge University Press (CUP) ,1998
Improved tools for biological sequence comparison.
Proceedings of the National Academy of Sciences of the United States of America, 1988

Cited by 25 articles