SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop
Open Access
- 22 October 2013
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 30 (1), 119-120
- https://doi.org/10.1093/bioinformatics/btt601
Abstract
Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts. Availability and Implementation: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/ Contact:andre.schumacher@yahoo.com Supplementary information: Supplementary data are available at Bioinformatics online.Keywords
This publication has 11 references indexed in Scilit:
- BioPig: a Hadoop-based analytic toolkit for large-scale sequence dataBioinformatics, 2013
- The big challenges of big dataNature, 2013
- Cloudgene: A graphical execution platform for MapReduce programs on private and public cloudsBMC Bioinformatics, 2012
- Interactive analytical processing in big data systemsProceedings of the VLDB Endowment, 2012
- Hadoop-BAM: directly manipulating next generation sequencing data in the cloudBioinformatics, 2012
- SAMQA: error classification and validation of high-throughput sequenced read dataBMC Genomics, 2011
- SEAL: a distributed short read mapping and duplicate removal toolBioinformatics, 2011
- SeqWare Query Engine: storing and searching sequence data in the cloudBMC Bioinformatics, 2010
- The case for cloud computing in genome informaticsGenome Biology, 2010
- Searching for SNPs with cloud computingGenome Biology, 2009