SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

Open Access

22 October 2013

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 30 (1), 119-120
https://doi.org/10.1093/bioinformatics/btt601

Abstract

Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts. Availability and Implementation: Available under the open source MIT license at http://sourceforge.net/projects/seqpig/ Contact:andre.schumacher@yahoo.com Supplementary information: Supplementary data are available at Bioinformatics online.

Keywords

SOFTWARE DESIGN

This publication has 11 references indexed in Scilit:

BioPig: a Hadoop-based analytic toolkit for large-scale sequence data
Bioinformatics, 2013
The big challenges of big data
Nature, 2013
Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds
BMC Bioinformatics, 2012
Interactive analytical processing in big data systems
Proceedings of the VLDB Endowment, 2012
Hadoop-BAM: directly manipulating next generation sequencing data in the cloud
Bioinformatics, 2012
SAMQA: error classification and validation of high-throughput sequenced read data
BMC Genomics, 2011
SEAL: a distributed short read mapping and duplicate removal tool
Bioinformatics, 2011
SeqWare Query Engine: storing and searching sequence data in the cloud
BMC Bioinformatics, 2010
The case for cloud computing in genome informatics
Genome Biology, 2010
Searching for SNPs with cloud computing
Genome Biology, 2009

Cited by 76 articles