SparkSW: Scalable Distributed Computing System for Large-Scale Biological Sequence Alignment
- 1 May 2015
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE) in 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing
- p. 845-852
- https://doi.org/10.1109/ccgrid.2015.55
Abstract
The Smith-Waterman (SW) algorithm is universally used for a database search owing to its high sensitively. The widespread impact of the algorithm is reflected in over 8000 citations that the algorithm has received in the past decades. However, the algorithm is prohibitively high in terms of time and space complexity, and so poses significant computational challenges. Apache Spark is an increasingly popular fast big data analytics engine, which has been highly successful in implementing large-scale data-intensive applications on commercial hardware. This paper presents the first ever reported system that implements the SW algorithm on Apache Spark based distributed computing framework, with a couple of off-the-shelf workstations, which is named as SparkSW. The scalability and load-balancing efficiency of the system are investigated by realistic ultra-large database from the state-of-the-art UniRef100. The experimental results indicate that 1) SparkSW is load-balancing for parallel adaptive on workloads and scales extremely well with the increases of computing resource, 2) SparkSW provides a fast and universal option high sensitively biological sequence alignments. The success of SparkSW also reveals that Apache Spark framework provides an efficient solution to facilitate coping with ever increasing sizes of biological sequence databases, especially generated by second-generation sequencing technologies.Keywords
This publication has 21 references indexed in Scilit:
- CloudDOE: A User-Friendly Tool for Deploying Hadoop Clouds and Analyzing High-Throughput Sequencing Data with MapReducePLOS ONE, 2014
- SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precisionBioinformatics, 2014
- SeqPig: simple and scalable scripting for large sequencing data sets in HadoopBioinformatics, 2013
- BioPig: a Hadoop-based analytic toolkit for large-scale sequence dataBioinformatics, 2013
- The Hadoop Distributed File SystemPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2010
- Google’s MapReduce programming model — RevisitedScience of Computer Programming, 2008
- Streaming Algorithms for Biological Sequence Alignment on GPUsIEEE Transactions on Parallel and Distributed Systems, 2007
- 160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA)BMC Bioinformatics, 2007
- Biological Sequence AnalysisPublished by Cambridge University Press (CUP) ,1998
- Improved tools for biological sequence comparison.Proceedings of the National Academy of Sciences of the United States of America, 1988