Understanding and Detecting Software Upgrade Failures in Distributed Systems
Open Access
- 26 October 2021
- conference paper
- conference paper
- Published by Association for Computing Machinery (ACM) in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM
Abstract
Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics. This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.Keywords
This publication has 28 references indexed in Scilit:
- Do not blame users for misconfigurationsPublished by Association for Computing Machinery (ACM) ,2013
- Understanding and detecting real-world performance bugsPublished by Association for Computing Machinery (ACM) ,2012
- Detecting failures in distributed systems with the Falcon spy networkPublished by Association for Computing Machinery (ACM) ,2011
- An empirical study on configuration errors in commercial and open source systemsPublished by Association for Computing Machinery (ACM) ,2011
- How do fixes become bugs?Published by Association for Computing Machinery (ACM) ,2011
- Learning from mistakesPublished by Association for Computing Machinery (ACM) ,2008
- Modular Software Upgrades for Distributed SystemsLecture Notes in Computer Science, 2006
- Predicting problems caused by component upgradesPublished by Association for Computing Machinery (ACM) ,2003
- An empirical study of operating systems errorsPublished by Association for Computing Machinery (ACM) ,2001
- Symbolic execution and program testingCommunications of the ACM, 1976