PlanAlyzer

Abstract
Online experiments are an integral part of the design and evaluation of software infrastructure at Internet firms. To handle the growing scale and complexity of these experiments, firms have developed software frameworks for their design and deployment. Ensuring that the results of experiments in these frameworks are trustworthy---referred to as internal validity---can be difficult. Currently, verifying internal validity requires manual inspection by someone with substantial expertise in experimental design. We present the first approach for checking the internal validity of online experiments statically, that is, from code alone. We identify well-known problems that arise in experimental design and causal inference, which can take on unusual forms when expressed as computer programs: failures of randomization and treatment assignment, and causal sufficiency errors. Our analyses target PLANOUT, a popular framework that features a domain-specific language (DSL) to specify and run complex experiments. We have built PLANALYZER, a tool that checks PLANOUT programs for threats to internal validity, before automatically generating important data for the statistical analyses of a large class of experimental designs. We demonstrate PLANALYZER'S utility on a corpus of PLANOUT scripts deployed in production at Facebook, and we evaluate its ability to identify threats on a mutated subset of this corpus. PLANALYZER has both precision and recall of 92% on the mutated corpus, and 82% of the contrasts it generates match hand-specified data.

This publication has 13 references indexed in Scilit: