Automating the debugging of datacenter applications with ADDA
- 1 June 2013
- conference paper
- conference paper
- Published by Institute of Electrical and Electronics Engineers (IEEE)
Abstract
Debugging data-intensive distributed applications running in datacenters is complex and time-consuming because developers do not have practical ways of deterministically replaying failed executions. The reason why building such tools is hard is that non-determinism that may be tolerable on a single node is exacerbated in large clusters of interacting nodes, and datacenter applications produce terabytes of intermediate data exchanged by nodes, thus making full input recording infeasible. We present ADDA, a replay-debugging system for datacenters that has lower recording and storage overhead than existing systems. ADDA is based on two techniques: First, ADDA provides control plane determinism, leveraging our observation that many typical datacenter applications consist of a separate “control plane” and “data plane”, and most bugs reside in the former. Second, ADDA does not record “data plane” inputs, instead it synthesizes them during replay, starting from the application's external inputs, which are typically persisted in append-only storage for reasons unrelated to debugging. We evaluate ADDA and show that it deterministically replays real-world failures in Hypertable and Memcached.Keywords
This publication has 10 references indexed in Scilit:
- Perfctr-XenPublished by Association for Computing Machinery (ACM) ,2011
- Execution synthesisPublished by Association for Computing Machinery (ACM) ,2010
- CoreDetPublished by Association for Computing Machinery (ACM) ,2010
- PRESPublished by Association for Computing Machinery (ACM) ,2009
- CapoPublished by Association for Computing Machinery (ACM) ,2009
- MPIWizPublished by Association for Computing Machinery (ACM) ,2009
- Execution replay of multiprocessor virtual machinesPublished by Association for Computing Machinery (ACM) ,2008
- Framework for instruction-level tracing and analysis of program executionsPublished by Association for Computing Machinery (ACM) ,2006
- Debugging Parallel Programs with Instant ReplayIEEE Transactions on Computers, 1987
- Distributed snapshotsACM Transactions on Computer Systems, 1985