Understanding and Detecting Software Upgrade Failures in Distributed Systems

Open Access

Abstract

Upgrade is one of the most disruptive yet unavoidable maintenance tasks that undermine the availability of distributed systems. Any failure during an upgrade is catastrophic, as it further extends the service disruption caused by the upgrade. The increasing adoption of continuous deployment further increases the frequency and burden of the upgrade task. In practice, upgrade failures have caused many of today's high-profile cloud outages. Unfortunately, there has been little understanding of their characteristics. This paper presents an in-depth study of 123 real-world upgrade failures that were previously reported by users in 8 widely used distributed systems, shedding lights on the severity, root causes, exposing conditions, and fix strategies of upgrade failures. Guided by our study, we have designed a testing framework DUPTester that revealed 20 previously unknown upgrade failures in 4 distributed systems, and applied a series of static checkers DUPChecker that discovered over 800 cross-version data-format incompatibilities that can lead to upgrade failures. DUPChecker has been requested by HBase developers to be integrated into their toolchain.

Keywords

This publication has 28 references indexed in Scilit:

Do not blame users for misconfigurations
Published by Association for Computing Machinery (ACM) ,2013
Understanding and detecting real-world performance bugs
Published by Association for Computing Machinery (ACM) ,2012
Detecting failures in distributed systems with the Falcon spy network
Published by Association for Computing Machinery (ACM) ,2011
An empirical study on configuration errors in commercial and open source systems
Published by Association for Computing Machinery (ACM) ,2011
How do fixes become bugs?
Published by Association for Computing Machinery (ACM) ,2011
Learning from mistakes
Published by Association for Computing Machinery (ACM) ,2008
Modular Software Upgrades for Distributed Systems
Lecture Notes in Computer Science, 2006
Predicting problems caused by component upgrades
Published by Association for Computing Machinery (ACM) ,2003
An empirical study of operating systems errors
Published by Association for Computing Machinery (ACM) ,2001
Symbolic execution and program testing
Communications of the ACM, 1976

Cited by 14 articles