Unified debugging of distributed systems with Recon

1 June 2011

conference paper
conference paper
Published by Institute of Electrical and Electronics Engineers (IEEE)

p. 85-96
https://doi.org/10.1109/dsn.2011.5958209

Abstract

To scale to today's complex distributed software systems, debugging and replaying techniques mostly focus on single facets of software, e.g., local concurrency, distributed messaging, or data representation. This forces developers to tediously combine different technologies such as instruction-level dynamic tracing, event log analysis, or global state reconstruction to gradually explain non-trivial defects. This paper proposes Recon, a debugging system that provides iterative and interactive homogeneous debugging services. As related systems, Recon promotes SQL-like queries for debugging distributed systems. Unlike other approaches, however, Recon allows for all system artifacts including nodes, communication channels, events, or instructions to be uniformly described by relations. Also, an application in Recon originally runs with a lightweight logger that only collects replay logs for individual nodes. Developers debug a complete program by replaying the execution with fine-grained instrumentation that is capable of exposing instruction-level information. We illustrate the effectiveness of Recon on programs as diverse as BerkeleyDB, i3/Chord, RandTree, and Pastry. Our evaluation includes executions in local clusters as well as in Amazon EC2 and exhibits an unreported bug in RandTree.

Keywords

This publication has 14 references indexed in Scilit:

Lightweight Task Graph Inference for Distributed Applications
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2010
Efficient online detection of dynamic control dependence
Published by Association for Computing Machinery (ACM) ,2007
Using model checking to find serious file system errors
ACM Transactions on Computer Systems, 2006
Implementing declarative overlays
Published by Association for Computing Machinery (ACM) ,2005
Relational queries over program traces
Published by Association for Computing Machinery (ACM) ,2005
Pin
Published by Association for Computing Machinery (ACM) ,2005
Pinpoint: problem determination in large, dynamic Internet services
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2003
ReVirt
Published by Association for Computing Machinery (ACM) ,2002
Chord
ACM SIGCOMM Computer Communication Review, 2001
The program dependence graph and its use in optimization
ACM Transactions on Programming Languages and Systems, 1987

Cited by 9 articles