Sign In

Communications of the ACM

Research highlights

PlanAlyzer: Assessing Threats to the Validity of Online Experiments

dog paws and human hands at laptop computer keyboard

Credit: Getty Images

Online experiments are an integral part of the design and evaluation of software infrastructure at Internet firms. To handle the growing scale and complexity of these experiments, firms have developed software frameworks for their design and deployment. Ensuring that the results of experiments in these frameworks are trustworthy—referred to as internal validity—can be difficult. Currently, verifying internal validity requires manual inspection by someone with substantial expertise in experimental design.

We present the first approach for checking the internal validity of online experiments statically, that is, from code alone. We identify well-known problems that arise in experimental design and causal inference, which can take on unusual forms when expressed as computer programs: failures of randomization and treatment assignment, and causal sufficiency errors. Our analyses target PLANOUT, a popular framework that features a domain-specific language (DSL) to specify and run complex experiments. We have built PLANALYZER, a tool that checks PLANOUT programs for threats to internal validity, before automatically generating important data for the statistical analyses of a large class of experimental designs. We demonstrate PLANALYZER'S utility on a corpus of PLANOUT scripts deployed in production at Facebook, and we evaluate its ability to identify threats on a mutated subset of this corpus. PLANALYZER has both precision and recall of 92% on the mutated corpus, and 82% of the contrasts it generates match hand-specified data.

Back to Top

1. Introduction

Many organizations conduct online experiments to assist decision-making.3,13,21,22 These organizations often develop software components that make designing experiments easier, or that automatically monitor experimental results. Such systems may integrate with existing infrastructure that perform such tasks as recording metrics of interest or specializing software configurations according to features of users, devices, or other experimental subjects. One popular example is Facebook's PLANOUT: a domain-specific language for experimental design.2

A script written in PLANOUT is a procedure for assigning a treatment (e.g., a piece of software under test) to a unit (e.g., users or devices whose behavior—or outcomes—is being assessed). Treatments could be anything from software-defined bit rates for data transmission to the layout of a Web page. Outcomes are typically metrics of interest to the firm, which may include click-through rates, time spent on a page, or the proportion of videos watched to completion. Critically, treatments and outcomes must be recorded in order to estimate the effect of treatment on an outcome. By abstracting over the details of how units are assigned treatment, PLANOUT has the potential to lower the barrier to entry for those without a background in experimental design to try their hand at experimentation-driven development.


No entries found

Log in to Read the Full Article

Sign In

Sign in using your ACM Web Account username and password to access premium content if you are an ACM member, Communications subscriber or Digital Library subscriber.

Need Access?

Please select one of the options below for access to premium content and features.

Create a Web Account

If you are already an ACM member, Communications subscriber, or Digital Library subscriber, please set up a web account to access premium content on this site.

Join the ACM

Become a member to take full advantage of ACM's outstanding computing information resources, networking opportunities, and other benefits.

Subscribe to Communications of the ACM Magazine

Get full access to 50+ years of CACM content and receive the print version of the magazine monthly.

Purchase the Article

Non-members can purchase this article or a copy of the magazine in which it appears.