Wrangler

3 November 2014

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

https://doi.org/10.1145/2670979.2671005

Abstract

Straggler tasks continue to be a major hurdle in achieving faster completion of data intensive applications running on modern data-processing frameworks. Existing straggler mitigation techniques are inefficient due to their reactive and replicative nature -- they rely on a wait-speculate-re-execute mechanism, thus leading to delayed straggler detection and inefficient resource utilization. Existing proactive techniques also over-utilize resources due to replication. Existing modeling-based approaches are hard to rely on for production-level adoption due to modeling errors. We present Wrangler, a system that proactively avoids situations that cause stragglers. Wrangler automatically learns to predict such situations using a statistical learning technique based on cluster resource utilization counters. Furthermore, Wrangler introduces a notion of a confidence measure with these predictions to overcome the modeling error problems; this confidence measure is then exploited to achieve a reliable task scheduling. In particular, by using these predictions to balance delay in task scheduling against the potential for idling of resources, Wrangler achieves a speed up in the overall job completion time. For production-level workloads from Facebook and Cloudera's customers, Wrangler improves the 99th percentile job completion time by up to 61% as compared to speculative execution, a widely used straggler mitigation technique. Moreover, Wrangler achieves this speed-up while significantly improving the resource consumption (by up to 55%).

Keywords

Funding Information

Defense Advanced Research Projects Agency (FA8750-12-2-0331)
Division of Computing and Communication Foundations (CCF-1139158)
LBNL Award (7076018)

This publication has 19 references indexed in Scilit:

The tail at scale
Communications of the ACM, 2013
Interactive analytical processing in big data systems
Proceedings of the VLDB Endowment, 2012
SkewTune
Published by Association for Computing Machinery (ACM) ,2012
Topology-aware resource allocation for data-intensive workloads
ACM SIGCOMM Computer Communication Review, 2011
Delay scheduling
Published by Association for Computing Machinery (ACM) ,2010
The WEKA data mining software
ACM SIGKDD Explorations Newsletter, 2009
Cost-sensitive boosting for classification of imbalanced data
Pattern Recognition, 2007
On the efficacy, efficiency and emergent behavior of task replication in large distributed systems
Parallel Computing, 2007
Improving Speedup and Response Times by Replicating Parallel Programs on a SNOW
Lecture Notes in Computer Science, 2005
Editorial
ACM SIGKDD Explorations Newsletter, 2004

Cited by 68 articles