Sparrow

3 November 2013

conference paper
conference paper
Published by Association for Computing Machinery (ACM)

https://doi.org/10.1145/2517349.2522716

Abstract

Large-scale data analytics frameworks are shifting towards shorter task durations and larger degrees of parallelism to provide low latency. Scheduling highly parallel jobs that complete in hundreds of milliseconds poses a major challenge for task schedulers, which will need to schedule millions of tasks per second on appropriate machines while offering millisecond-level latency and high availability. We demonstrate that a decentralized, randomized sampling approach provides near-optimal performance while avoiding the throughput and availability limitations of a centralized design. We implement and deploy our scheduler, Sparrow, on a 110-machine cluster and demonstrate that Sparrow performs within 12% of an ideal scheduler.

Keywords

Funding Information

Facebook
Amazon Web Services
Ericsson
Microsoft
Defense Advanced Research Projects Agency (FA8750-12-2-0331)
Intel Corporation
Cisco Systems
Huawei Technologies
Oracle
Cloudera
Hortonworks
Samsung
VMware
U.S. Department of Defense
WANdisco
Hertz Foundation
Division of Computing and Communication Foundations (CCF-1139158)
General Electric
NetApp
Yahoo!
Google
SAP America
Clearstory Data
FitWave
Splunk

This publication has 15 references indexed in Scilit:

The tail at scale
Communications of the ACM, 2013
An update on the scalability limits of the Condor batch system
Journal of Physics: Conference Series, 2011
A generalization of multiple choice balls-into-bins
Published by Association for Computing Machinery (ACM) ,2011
Dremel
Proceedings of the VLDB Endowment, 2010
Quincy
Published by Association for Computing Machinery (ACM) ,2009
The power of two choices in randomized load balancing
IEEE Transactions on Parallel and Distributed Systems, 2001
The Power of Two Random Choices: A Survey of Techniques and Results
Published by Springer Science and Business Media LLC ,2001
How useful is old information?
IEEE Transactions on Parallel and Distributed Systems, 2000
Analysis and simulation of a fair queueing algorithm
Published by Association for Computing Machinery (ACM) ,1989
Adaptive load sharing in homogeneous distributed systems
IEEE Transactions on Software Engineering, 1986

Cited by 377 articles