Shard Manager
Open Access
- 26 October 2021
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM
Abstract
Sharding is widely used to scale an application. Despite a decade of effort to build generic sharding frameworks that can be reused across different applications, the extent of their success remains unclear. We attempt to answer a fundamental question: what barriers prevent a sharding framework from getting adopted by the majority of sharded applications? We analyze hundreds of sharded applications at Facebook and identify two major barriers: 1) lack of support for geo-distributed applications, which account for most of Facebook's applications, and 2) inability to maintain application availability during planned events such as software upgrades, which happen ≈1000 times more frequently than unplanned failures. A sharding framework that does not help applications to address these fundamental challenges is not sufficiently attractive for most applications to adopt it. Other adoption barriers include the burden of supporting many complex applications in a one-size-fit-all sharding framework and the difficulty in supporting sophisticated shard-placement requirements. Theoretically, a constraint solver can handle complex placement requirements, but in practice it is not scalable enough to perform near-realtime shard placement at a global scale. We have overcome these adoption barriers in Facebook's sharding framework called Shard Manager. Currently, Shard Manager is used by hundreds of applications running on over one million machines, which account for about 54% of all sharded applications at Facebook.Keywords
This publication has 27 references indexed in Scilit:
- Large-scale cluster management at Google with BorgPublished by Association for Computing Machinery (ACM) ,2015
- Reservation-based SchedulingPublished by Association for Computing Machinery (ACM) ,2014
- Multi-resource packing for cluster schedulersACM SIGCOMM Computer Communication Review, 2014
- QuasarPublished by Association for Computing Machinery (ACM) ,2014
- Apache Hadoop YARNPublished by Association for Computing Machinery (ACM) ,2013
- Untangling cluster management with HelixPublished by Association for Computing Machinery (ACM) ,2012
- Generalized resource allocation for the cloudPublished by Association for Computing Machinery (ACM) ,2012
- DepotACM Transactions on Computer Systems, 2011
- OrleansPublished by Association for Computing Machinery (ACM) ,2011
- BigtableACM Transactions on Computer Systems, 2008