Towards Multi-site Metadata Management for Geographically Distributed Cloud Workflows

1 September 2015

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 294-303
https://doi.org/10.1109/cluster.2015.49

Abstract

With their globally distributed datacenters, clouds now provide an opportunity to run complex large-scale applications on dynamically provisioned, networked and federated infrastructures. However, there is a lack of tools supporting data intensive applications across geographically distributed sites. For instance, scientific workflows which handle many small files can easily saturate state-of-the-art distributed filesystems based on centralized metadata servers (e.g. HDFS, PVFS). In this paper, we explore several alternative design strategies to efficiently support the execution of existing workflow engines across multi-site clouds, by reducing the cost of metadata operations. These strategies leverage workflow semantics in a 2-level metadata partitioning hierarchy that combines distribution and replication. The system was validated on the Microsoft Azure cloud across 4 EU and US datacenters. The experiments were conducted on 128 nodes using synthetic benchmarks and real-life applications. We observe as much as 28% gain in execution time for a parallel, geo-distributed real-world application (Montage) and up to 50% for a metadata-intensive synthetic benchmark, compared to a baseline centralized configuration.

Keywords

This publication has 22 references indexed in Scilit:

TomusBlobs: scalable data‐intensive processing on Azure clouds
Concurrency and Computation: Practice and Experience, 2013
Chiron: a parallel engine for algebraic scientific workflows
Concurrency and Computation: Practice and Experience, 2013
Software as a service for data scientists
Communications of the ACM, 2012
Inter-datacenter bulk transfers with netstitcher
Published by Association for Computing Machinery (ACM) ,2011
BlobSeer: Next-generation data management for large scale infrastructures
Journal of Parallel and Distributed Computing, 2011
Efficient B-tree based indexing for cloud data processing
Proceedings of the VLDB Endowment, 2010
The case for a versatile storage system
ACM SIGOPS Operating Systems Review, 2010
Distributed Hash Table
Published by Springer Science and Business Media LLC ,2009
"One size fits all": an idea whose time has come and gone
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2005
The Google file system
ACM SIGOPS Operating Systems Review, 2003

Cited by 11 articles