Kepler + Hadoop
- 16 November 2009
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
MapReduce provides a parallel and scalable programming model for data-intensive business and scientific applications. MapReduce and its de facto open source project, called Hadoop, support parallel processing on large datasets with capabilities including automatic data partitioning and distribution, load balancing, and fault tolerance management. Meanwhile, scientific workflow management systems, e.g., Kepler, Taverna, Triana, and Pegasus, have demonstrated their ability to help domain scientists solve scientific problems by synthesizing different data and computing resources. By integrating Hadoop with Kepler, we provide an easy-to-use architecture that facilitates users to compose and execute MapReduce applications in Kepler scientific workflows. Our implementation demonstrates that many characteristics of scientific workflow management systems, e.g., graphical user interface and component reuse and sharing, are very complementary to those of MapReduce. Using the presented Hadoop components in Kepler, scientists can easily utilize MapReduce in their domain-specific problems and connect them with other tasks in a workflow through the Kepler graphical user interface. We validate the feasibility of our approach via a word count use case.Keywords
Funding Information
- Division of Biological Infrastructure (DBI 0619060)
- Office of Cyberinfrastructure (OCI-0722079)
- U.S. Department of Energy (DE-FC02-07ER25811)
This publication has 11 references indexed in Scilit:
- Accelerating Parameter Sweep Workflows by Utilizing Ad-hoc Network Computing Resources: An Ecological ExamplePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2009
- A MapReduce-Enabled Scientific Workflow Composition FrameworkPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2009
- CloudBurst: highly sensitive read mapping with MapReduceBioinformatics, 2009
- MRGIS: A MapReduce-Enabled High Performance Workflow System for GISPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2008
- Data-Intensive Computing in the 21st CenturyComputer, 2008
- Pegasus: Mapping Large-Scale Workflows to Distributed ResourcesPublished by Springer Science and Business Media LLC ,2007
- Advanced data flow support for scientific grid workflow applicationsPublished by Association for Computing Machinery (ACM) ,2007
- Introduction and evaluation of MartletPublished by Association for Computing Machinery (ACM) ,2007
- Taverna: a tool for the composition and enactment of bioinformatics workflowsBioinformatics, 2004
- Triana Applications within Grid Computing and Peer to Peer EnvironmentsJournal of Grid Computing, 2003