Kepler + Hadoop

16 November 2009

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

https://doi.org/10.1145/1645164.1645176

Abstract

MapReduce provides a parallel and scalable programming model for data-intensive business and scientific applications. MapReduce and its de facto open source project, called Hadoop, support parallel processing on large datasets with capabilities including automatic data partitioning and distribution, load balancing, and fault tolerance management. Meanwhile, scientific workflow management systems, e.g., Kepler, Taverna, Triana, and Pegasus, have demonstrated their ability to help domain scientists solve scientific problems by synthesizing different data and computing resources. By integrating Hadoop with Kepler, we provide an easy-to-use architecture that facilitates users to compose and execute MapReduce applications in Kepler scientific workflows. Our implementation demonstrates that many characteristics of scientific workflow management systems, e.g., graphical user interface and component reuse and sharing, are very complementary to those of MapReduce. Using the presented Hadoop components in Kepler, scientists can easily utilize MapReduce in their domain-specific problems and connect them with other tasks in a workflow through the Kepler graphical user interface. We validate the feasibility of our approach via a word count use case.

Keywords

Funding Information

Division of Biological Infrastructure (DBI 0619060)
Office of Cyberinfrastructure (OCI-0722079)
U.S. Department of Energy (DE-FC02-07ER25811)

This publication has 11 references indexed in Scilit:

Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: An Ecological Example
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
A MapReduce-Enabled Scientific Workflow Composition Framework
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2009
CloudBurst: highly sensitive read mapping with MapReduce
Bioinformatics, 2009
MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2008
Data-Intensive Computing in the 21st Century
Computer, 2008
Pegasus: Mapping Large-Scale Workflows to Distributed Resources
Published by Springer Science and Business Media LLC ,2007
Advanced data flow support for scientific grid workflow applications
Published by Association for Computing Machinery (ACM) ,2007
Introduction and evaluation of Martlet
Published by Association for Computing Machinery (ACM) ,2007
Taverna: a tool for the composition and enactment of bioinformatics workflows
Bioinformatics, 2004
Triana Applications within Grid Computing and Peer to Peer Environments
Journal of Grid Computing, 2003

Cited by 70 articles