Surviving switch failures in cloud datacenters
Published: 11 April 2021
ACM SIGCOMM Computer Communication Review , Volume 51, pp 2-9; https://doi.org/10.1145/3464994.3464996
Abstract: Switch failures can hamper access to client services, cause link congestion and blackhole network traffic. In this study, we examine the nature of switch failures in the datacenters of a large commercial cloud provider through the lens of survival theory. We study a cohort of over 180,000 switches with a variety of hardware and software configurations and find that datacenter switches have a 98% likelihood of functioning uninterrupted for over 3 months since deployment in production. However, there is significant heterogeneity in switch survival rates with respect to their hardware and software: the switches of one vendor are twice as likely to fail compared to the others. We attribute the majority of switch failures to hardware impairments and unplanned power losses. We find that the in-house switch operating system, SONiC, boosts the survival likelihood of switches in datacenters by 1% by eliminating switch failures caused by software bugs in vendor switch OSes.
Keywords: survival theory / router failures / datacenter networks
Scifeed alert for new publicationsNever miss any articles matching your research from any publisher
- Get alerts for new papers matching your research
- Find out the new papers from selected authors
- Updated daily for 49'000+ journals and 6000+ publishers
- Define your Scifeed now
Click here to see the statistics on "ACM SIGCOMM Computer Communication Review" .