Traffic-Aware Geo-Distributed Big Data Analytics with Predictable Job Completion Time
Open Access
- 8 November 2016
- journal article
- research article
- Published by Institute of Electrical and Electronics Engineers (IEEE) in IEEE Transactions on Parallel and Distributed Systems
- Vol. 28 (6), 1785-1796
- https://doi.org/10.1109/tpds.2016.2626285
Abstract
Big data analytics has attracted close attention from both industry and academic because of its great benefits in cost reduction and better decision making. As the fast growth of various global services, there is an increasing need for big data analytics across multiple data centers (DCs) located in different countries or regions. It asks for the support of a cross-DC data processing platform optimized for the geo-distributed computing environment. Although some recent efforts have been made for geo-distributed big data analytics, they cannot guarantee predictable job completion time, and would incur excessive traffic overthe inter-DC network that is a scarce resource shared by many applications. In this paper, we study to minimize the inter-DC traffic generated by MapReduce jobs targeting on geo-distributed big data, while providing predicted job completion time. To achieve this goal, we formulate an optimization problem by jointly considering input data movement and task placement. Furthermore, we guarantee predictable job completion time by applying the chance-constrained optimization technique, such that the MapReduce job can finish within a predefined job completion time with high probability. To evaluate the performance of our proposal, we conduct extensive simulations using real traces generated by a set of queries on Hive. The results show that our proposal can reduce 55 percent inter-DC traffic compared with centralized processing by aggregating all data to a single data center.Keywords
Funding Information
- JSPS KAKENHI (16K16038)
- NSFC (61572262)
- NSF of Jiangsu Province (BK20141427)
- Australian Research Council Discovery (A7921)
This publication has 30 references indexed in Scilit:
- Cross-Cloud MapReduce for Big DataIEEE Transactions on Cloud Computing, 2015
- Scheduling jobs across geo-distributed datacentersPublished by Association for Computing Machinery (ACM) ,2015
- Network-Aware Scheduling for Data-Parallel JobsPublished by Association for Computing Machinery (ACM) ,2015
- Decentralized task-aware scheduling for data center networksACM SIGCOMM Computer Communication Review, 2014
- Extending MapReduce across Clouds with BStreamIEEE Transactions on Cloud Computing, 2014
- Finishing flows quickly with preemptive schedulingACM SIGCOMM Computer Communication Review, 2012
- Deadline-aware datacenter tcp (D2TCP)ACM SIGCOMM Computer Communication Review, 2012
- Managing data transfers in computer clusters with orchestraACM SIGCOMM Computer Communication Review, 2011
- HiveProceedings of the VLDB Endowment, 2009
- Selected topics in robust convex optimizationMathematical Programming, 2007