Wrangler
- 3 November 2014
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM)
Abstract
Straggler tasks continue to be a major hurdle in achieving faster completion of data intensive applications running on modern data-processing frameworks. Existing straggler mitigation techniques are inefficient due to their reactive and replicative nature -- they rely on a wait-speculate-re-execute mechanism, thus leading to delayed straggler detection and inefficient resource utilization. Existing proactive techniques also over-utilize resources due to replication. Existing modeling-based approaches are hard to rely on for production-level adoption due to modeling errors. We present Wrangler, a system that proactively avoids situations that cause stragglers. Wrangler automatically learns to predict such situations using a statistical learning technique based on cluster resource utilization counters. Furthermore, Wrangler introduces a notion of a confidence measure with these predictions to overcome the modeling error problems; this confidence measure is then exploited to achieve a reliable task scheduling. In particular, by using these predictions to balance delay in task scheduling against the potential for idling of resources, Wrangler achieves a speed up in the overall job completion time. For production-level workloads from Facebook and Cloudera's customers, Wrangler improves the 99th percentile job completion time by up to 61% as compared to speculative execution, a widely used straggler mitigation technique. Moreover, Wrangler achieves this speed-up while significantly improving the resource consumption (by up to 55%).Keywords
Funding Information
- Defense Advanced Research Projects Agency (FA8750-12-2-0331)
- Division of Computing and Communication Foundations (CCF-1139158)
- LBNL Award (7076018)
This publication has 19 references indexed in Scilit:
- The tail at scaleCommunications of the ACM, 2013
- Interactive analytical processing in big data systemsProceedings of the VLDB Endowment, 2012
- SkewTunePublished by Association for Computing Machinery (ACM) ,2012
- Topology-aware resource allocation for data-intensive workloadsACM SIGCOMM Computer Communication Review, 2011
- Delay schedulingPublished by Association for Computing Machinery (ACM) ,2010
- The WEKA data mining softwareACM SIGKDD Explorations Newsletter, 2009
- Cost-sensitive boosting for classification of imbalanced dataPattern Recognition, 2007
- On the efficacy, efficiency and emergent behavior of task replication in large distributed systemsParallel Computing, 2007
- Improving Speedup and Response Times by Replicating Parallel Programs on a SNOWLecture Notes in Computer Science, 2005
- EditorialACM SIGKDD Explorations Newsletter, 2004