Computing Applications Research highlights

PlanAlyzer: Assessing Threats to the Validity of Online Experiments

By Emma Tosch, Eytan Bakshy, Emery D. Berger, David D. Jensen, and J. Eliot B. Moss

Posted Sep 1 2021

Abstract
1. Introduction
2. Language Characteristics
3. Validation of Statistical Conclusions
4. Validation of Experimental Designs
5. Planalyzer Static Analysis Tool
6. Planout Corpora
7. Evaluation
8. Conclusion
Acknowledgments
References
Authors
Footnotes

dog paws and human hands at laptop computer keyboard

Online experiments are an integral part of the design and evaluation of software infrastructure at Internet firms. To handle the growing scale and complexity of these experiments, firms have developed software frameworks for their design and deployment. Ensuring that the results of experiments in these frameworks are trustworthy—referred to as internal validity—can be difficult. Currently, verifying internal validity requires manual inspection by someone with substantial expertise in experimental design.

We present the first approach for checking the internal validity of online experiments statically, that is, from code alone. We identify well-known problems that arise in experimental design and causal inference, which can take on unusual forms when expressed as computer programs: failures of randomization and treatment assignment, and causal sufficiency errors. Our analyses target PLANOUT, a popular framework that features a domain-specific language (DSL) to specify and run complex experiments. We have built PLANALYZER, a tool that checks PLANOUT programs for threats to internal validity, before automatically generating important data for the statistical analyses of a large class of experimental designs. We demonstrate PLANALYZER’S utility on a corpus of PLANOUT scripts deployed in production at Facebook, and we evaluate its ability to identify threats on a mutated subset of this corpus. PLANALYZER has both precision and recall of 92% on the mutated corpus, and 82% of the contrasts it generates match hand-specified data.

1. Introduction

Many organizations conduct online experiments to assist decision-making.^3,13,21,22 These organizations often develop software components that make designing experiments easier, or that automatically monitor experimental results. Such systems may integrate with existing infrastructure that perform such tasks as recording metrics of interest or specializing software configurations according to features of users, devices, or other experimental subjects. One popular example is Facebook’s PLANOUT: a domain-specific language for experimental design.²

A script written in PLANOUT is a procedure for assigning a treatment (e.g., a piece of software under test) to a unit (e.g., users or devices whose behavior—or outcomes—is being assessed). Treatments could be anything from software-defined bit rates for data transmission to the layout of a Web page. Outcomes are typically metrics of interest to the firm, which may include click-through rates, time spent on a page, or the proportion of videos watched to completion. Critically, treatments and outcomes must be recorded in order to estimate the effect of treatment on an outcome. By abstracting over the details of how units are assigned treatment, PLANOUT has the potential to lower the barrier to entry for those without a background in experimental design to try their hand at experimentation-driven development.

Unfortunately, the state of the art for validating experimental designs (i.e., the procedure for conducting an experiment, here encoded as a PLANOUT program) is a manual human review. The most common experimental design on the Web is the A/B test, which entails a fairly simple analysis to estimate the treatment effect. However, more complex experiments may require more sophisticated analyses, and in general there is a many-to-many relationship between design and analyses. Many experiments written in a domain-specific language (DSL) such as PLANOUT can be cumbersome to validate manually, and they cannot be analyzed using existing automated methods. This is because experiments expressed as programs can have errors that are unique to the intersection of experimentation and software.

We present the first tool, PLANALYZER, for statically identifying the sources of statistical bias in programmatically defined experiments. Additionally, PLANALYZER automatically generates contrasts and conditioning sets for a large class of experimental designs (i.e., between-subjects designs that can be analyzed using average treatment effect (ATE) or conditional average treatment effect (CATE); because ATE is a special case of CATE, when the distinction between the two is not necessary, we will refer to them collectively as (C)ATE). We make the following contributions:

Software for the static analysis of experiments. PLANALYZER produces three key pieces of information: (1) a list of the variables in the environment that are actually being randomly assigned; (2) the variables that are recorded for analysis; and (3) the variables that may be legitimately compared when computing causal effects. These three pieces of information are required in order to determine whether there are any valid statistical analyses of the recorded results of an experiment, and, when possible, what those analyses are.

Characterizing errors and bad practices unique to programmatically defined experiments. Traditional errors in offline experimentation can take on unusual forms in programmatically defined experiments. Additionally, some coding practices can lead to faults during downstream statistical analysis, highlighting the potential utility of defining “code smells” for bad practices in experiments.⁸ We introduce errors and code smells that arise from the intersection of experiments and software.

Empirical analysis of real experiments. We report PLANALYZER’S performance on a corpus of real-world PLANOUT scripts from Facebook. Due to the vetting process at Facebook, few errors exist naturally in the corpus. Therefore, we perform mutation analysis to approximate a real-world distribution of errors. We also consider the set of author-generated contrasts (the set of variable values that reallowed to be compared, necessary for estimating causal effects) for each script. We demonstrate PLANALYZER’S effectiveness in finding major threats to validity and in automatically generating contrasts.

2. Language Characteristics

As a DSL is built by domain experts, PLANOUT implements functionality only relevant to experimentation. Consequently, PLANOUT is not Turing complete: it lacks loops, recursion, and function definition. It has two control flow constructs (if/else and return) and a small core of built-in functions (e.g., weightedChoice, bernoulli-Trial, and length).

Although not required for an experimentation language, PLANOUT also allows for runtime binding of external function calls and variables. This allows for easy integration with existing (but more constrained) systems for experimentation, data recording, and configuration. We expect PLANOUT scripts to be run inside another execution environment, such as a Web browser, and have access to the calling context in order to bind free variables and functions.

PLANOUT abstracts over the sampling mechanism, providing an interface that randomly selects from pre-populated partitions of unit identifiers, corresponding to samples from the population of interest. The PLANOUT framework provides a mechanism for extracting the application parameters manipulated by a PLANOUT script and hashes them, along with the current experiment name, to one or more samples. The mapping avoids clashes between concurrently running experiments, which is one of the primary challenges of online experimentation.^{12, 13} Readers interested in the specifics of PLANOUT’S hashing method for scaling concurrent experiments can refer to an earlier paper²; it is not relevant to PLANALYZER’S analyses.

On its surface, PLANOUT may appear to share features with probabilistic programming languages (PPLs).^{11, 16} PPLs completely describe the data generating process; by contrast, PLANOUT programs specify only one part of the data generating process—how to randomly assign treatments—and this code is used to control aspects of a product or service that is the focus of experimentation.

There are two critical features of PLANOUT that differentiate it from related DSLs, such as PPLs: (1) the requirement that all random functions have an explicit unit of randomization, and (2) built-in control of data recording via the truth value of PLANOUT’S return. Only named variables on paths that terminate in return true are recorded. This is similar to the discarded executions in the implementation of conditional probabilities in PPLs. A major semantic difference between PLANOUT and PPLs is that we expect PLANOUT to have deterministic execution for an input. Variability in PLANOUT arises from the population of inputs; variability in PPLs comes from the execution of the program itself.

3. Validation of Statistical Conclusions

Statistical conclusions of a randomized experiment typically estimate the effect of a treatment T on an outcome Y for some population of units. The function that estimates the causal effect of T on Y may take many forms. Nearly all such functions can be distilled into estimating the true difference between an outcome under one treatment and its potential outcome(s) under another treatment.

In the case of a randomized experiment, if T is assigned completely at random, for example, according to:

T = uniformChoice(choices=[400, 750], unit= userid);

then the causal effect of T (the average treatment effect (ATE)) can be estimated by simply taking the difference of the average outcome for units assigned to T = 400 and T = 750: Avg(Y | T = 400) – Avg(Y | T = 750). Such an experiment could be useful for learning how some outcome Y (e.g., video watch time) differs for equivalent individuals experiencing videos at the 400 or 750kbps setting.

It is not uncommon to use different probabilities of treatment for different kinds of users; we refer to the partition of users as a subgroup S. We can still estimate causal effects, but must instead compute the difference in means separately for different values of the variables in S. This is often referred to as subgroup analysis. This estimand is known as the conditional average treatment effect (CATE). The variables that define the subgroup are referred to as the conditioning set and can be thought of as a constraint on the units that can be compared for any given contrast. Average effect estimators like (C)ATE over finite sets of treatments can be expressed in terms of their valid contrasts: knowing the assignment probabilities of T = 400 versus T = 750 is sufficient to describe how to compute the treatment effect.

Typically, experts must manually verify that the estimators comport with the experimental design. There are some exceptions: some systems for automatically monitoring very simple experiments like A/B tests, where the treatment is a single variable that takes on one of the two values and the estimand is ATE.

As a DSL, PLANOUT provides a mechanism for more complex experimental designs. Control-flow operators, calls to external services, and in-language mechanisms for data recording prohibit simple automatic variable monitoring. For example, an experiment that sets variables differently on the basis of the current country of the user cannot naïvely aggregate results across all the participants in the experiment. Such an experiment would require additional adjustment during post-experiment analysis, because a user’s current country is a confounder (i.e., a variable that causes both the treatment and outcome). PLANALYZER automatically produces the appropriate analyses, including the contrasts and conditioning sets.

4. Validation of Experimental Designs

Shadish et al.¹⁹ enumerate a taxonomy of nine well-understood design errors for experimentation, referred to as threats to internal validity—that is, the degree to which valid causal conclusions can be drawn within the context of the study. Seven of these errors can be avoided when the researcher employs a randomized experiment that behaves as expected. The two remaining threats to validity that are not obviated by randomization are attrition and testing. Attrition may not have a meaningful definition in the context of online experiments, especially when outcomes are measured shortly after treatment exposure. Testing in experimental design refers to taking an initial measurement and then using the test instrument to conduct an experiment. Analysis may not be able to differentiate between the effect that a test was designed to measure and the effect of subjects learning the test itself. Testing is a form of within-subjects analysis that is not typically employed in online field experiments and whose analyses are outside the scope of this work. Therefore, failed randomized assignment is the primary threat to internal validity that we consider. Randomization failures in programs manifest differently from randomization failures in the physical world: for example, a program cannot disobey an experimental protocol, but data flow can break randomization if a probability is erroneously set to zero.

We characterize the ways in which syntactically valid PLANOUT programs can fail to randomize treatment assignment. Note that because there is currently no underlying formalism for the correctness of online field experiments that maps cleanly to a programming language context, we cannot define a soundness theorem for programmatically defined experiments. Some of the threats described here would be more properly considered code smells, rather than outright errors.⁸

4.1. Randomization failures

There are three ways a PLANOUT program may contain a failure of randomization: when it records data along a path that is not randomized, when the units of randomization have low cardinality, and when it encounters path-induced determinism. PLANALYZER detects all three automatically.

Recording data along nonrandomized paths occurs when there exists at least one recorded path through the program that is randomized and at least one recorded path through the program that is not randomized:

if (inExperiment (userid=userid)) {

T = bernoulliTrial (p=0.5, unit=userid);

} else {

T = true;

}

return true;

Such programs can typically be fixed by adding a return false for the appropriate path(s).

Units of randomization, such as userid or deviceid, must have significantly higher cardinality than experimental treatments to ensure that each treatment is assigned a sufficient number of experimental units to make valid statistical inferences about the population. If the unit is an external variable unfamiliar to PLANALYZER, it will assume that the variable has low cardinality. PLANALYZER allows user-defined annotations to make its analyses more precise. Therefore, PlanOut users can correct their programs by either annotating the unit of randomization as having high cardinality, or reassessing their choice of unit.

Data-flow failures of randomization occur when inappropriate computations flow into units. PLANOUT allows units to be the result of arbitrary computations. For example, one PLANOUT script in our evaluation corpus sets the unit of randomization to be userid * 2. A PLANOUT user might want to do this when rerunning an experiment, to ensure that at least some users are assigned to a new treatment. However, this feature can lead to deterministic assignment when used improperly. The following is a syntactically valid PLANOUT program that triggers an error in PLANALYZER:

T1 = uniformChoice (choices= [400, 900], unit= userid);

T2 = bernoulliTrial (p=0.3, unit=T1);

When writing this code, the researcher may believe that there are four possible assignments for the pair of variables. However, because the assignment of input units to a particular value is the result of a deterministic hashing function, every user who is assigned T1=400 is assigned the same value of T2 because the input to the hash function for bernoulliTrial is always 400. Therefore, they will never record both (400, true) and (400, false) in the data, which likely contradicts the programmer’s intent.

4.2. Treatment assignment failures

PLANALYZER requires that all assigned treatments along a path have the possibility of being assigned to at least one unit and that at least some treatments may be compared. There are three ways a PLANOUT program may contain a failure of treatment assignment: when some treatment has a zero probability of being assigned, when there are fewer than two treatments that may be compared along a path, and when dead code blocks contain treatment assignment.

Detecting the latter two cases are standard tasks in static program analysis. We note that for the first case, syntactically correct PLANOUT code permits authors to set probabilities or weights to zero, either directly or as the result of evaluation. Detecting this kind of value-dependent behavior is not unusual in program analysis either, but the reason why we wish to avoid it may not be obvious: to establish a causal relationship between variables, there must be at least two alternative treatments under comparison.

4.3. Causal sufficiency errors

One of the main assumptions underlying causal reasoning is causal sufficiency, or the assumption that there are no unmeasured confounders in the estimate of treatment effect. Barring runtime failures, we have a complete picture of the assignment mechanism in PLANOUT programs. Unfortunately, a PLANOUT program may allow an unrecorded variable to bias treatment assignment.

Consider a program that assigns treatment on the basis of user country, accessed via a getUserCountry function:

if (getUserCountry (userid=userid) == 'US') {

T = uniformChoice (choices= [7, 9], unit=userid);

} else {

T = uniformChoice(choices= [4, 7, 9], unit=userid);

}

Treatment assignment of T depends on user country, so the user country is a potential confounder. Because this variable does not appear in the input program text, it cannot be recorded by the PLANOUT framework’s data recording system. Therefore, the program and resulting analyses will violate the causal sufficiency assumption.

If PLANALYZER encounters a static error or threat, it reports that the script failed to pass validation and gives a reason to the user. Some of the fixes are easy to determine from the error and could be applied automatically. We leave this to future work. Other errors require a more sophisticated understanding of the experiment the script represents and can only be determined by the script’s author.

5. Planalyzer Static Analysis Tool

PLANALYZER is a command-line tool written in OCaml that performs two main tasks: it checks whether the input script represents a randomized experiment, and it generates all valid contrasts and their associated conditioning sets for scripts that can be analyzed using (C)ATE. Figure 1 provides an overview of the PLANALYZER system.

Figure 1. The PlanAlyzer system transforms input PlanOut programs, possibly with user-provided variable labels, into a normalized form before translating the program to the intermediate representation (IR). PlanAlyzer produces a data dependency graph in order to generate the data dependence graph (DDG) and resulting estimators. At each step in the analyses, PlanAlyzer may produce errors. When there is insufficient information to produce estimators, but the input program has no known threats to validity, PlanAlyzer provides as much partial output as possible.

5.1. PlanOut intermediate representation (IR)

Upon parsing, PLANALYZER performs several routine program transformations, including converting variables to an identification scheme similar to SSA, performing constant propagation, and rewriting functions and relations (such as equality) in A-normal form.^1,4,15,18 Because it may not be possible to reason about the final values of a variable defined in a PLANOUT program due to the presence of external function calls, PLANALYZER reasons about intermediate values instead and reports results over a partially evaluated program.¹⁰

After these routine transformations, PLANALYZER splits the program into straight line code via tail duplication such that every path through the program may be evaluated in isolation of the others. Although this transformation is exponential in the number of conditional branches, in practice the branching factor of PLANOUT programs is quite small.

PLANALYZER then converts guards into assertions and uses the Z3 SMT solver to ensure variables assigned along paths are consistent with these assertions.⁵ For each assertion, PLANALYZER queries Z3 twice—first to obtain a satisfying solution, and then to test whether this solution is unique. Evaluation of the intermediate representation may contain unevaluated code, so if there is more than one solution, PLANALYZER keeps the code chunk abstract.

PLANALYZER uses SSA and A-normal form because they aid in contrast generation: a single execution of a PLANOUT program corresponds to the assignment of a unit to a treatment. However, additional intermediate variables can have somewhat ambiguous semantics when attempting to model a programmatically defined experiment causally; although they aid in, for example, the detection of causal sufficiency errors, they make reasoning about causal inference using methods such as causal graphical models quite difficult.

5.2. Variable labels for causal inference

The PLANOUT language contains only some of the necessary features for reasoning about the validity of experiments. Given only programs written in PLANOUT, PLANALYZER may not be able to reason about some common threats to internal validity. The interaction between random operators and control flow can cause variables to lose either their randomness or their variation. Furthermore, we need some way of guaranteeing that external operators do not introduce confounding.

To expresses this missing information, we introduce a 4-tuple of variable labels (rand, card, tv, corry) that PLANALYZER attempts to infer and propagate for each PLANOUT program it encounters.^17,6 Unsurprisingly, inference may be overly conservative for programs with many external functions or variables. To increase the scope of experiments PLANALYZER can analyze, users may supply PLANALYZER with global and local configuration files that specify labels.

Randomness (rand). PLANOUT may be used with existing experimentation systems; this means that there may already be sources of randomness available and familiar to users. Furthermore, as PLANOUT was designed to be extensible, users may freely add new random operators.

Cardinality (card). The size of variables’ domains (cardinality) impacts an experiment’s validity. Simple pseudorandom assignment requires high cardinality units of randomization to properly balance the assignment of units into conditions.

Time Variance (tv). For the duration of a particular experiment, a given variable may be constant or time-varying. Clearly, some variables are always constant or always time-varying. For example, date-of-birth is constant, whereas days-since-last-login is time-varying. However, there are many variables that cannot be globally categorized as either constant or time-varying. The tv label allows experimenters to specify whether they expect a variable to be constant or time-varying over the duration of a given experiment.

Because (C)ATE assumes subjects receive only one treatment value for the duration of the experiment, PLANALYZER cannot use them to estimate the causal effect of treatments or conditioning set variables having a tv label. A PLANOUT program may contain other valid contrasts assigned randomly, and independently from the time-varying contrasts; PLANALYZER will still identify these treatments and their conditioning sets as eligible for being analyzed via (C)ATE.

Covariates and Confounders (corry). Many experiments use features of the unit to assign treatment, which may introduce confounding. PLANALYZER automatically marks external variables and the direct results of nonrandom external calls as correlated with outcome (i.e., Y). This signals that, if the variable is used for treatment assignment, either their values must be recorded or sufficient downstream data must be recorded to recover their values.

5.3. Data dependence graph (DDG)

PLANALYZER builds a DDG to propagate variable label information.⁷ Because PLANOUT only has a single, global scope, its data dependence analysis is straightforward:

Assignment induces a directed edge from the references on the right-hand side to the variable name.
Sequential assignment of var_i and var_i+1 induces no dependencies between var_i and var_i+1, unless the r-value of var_i+1 includes a reference to var_i.
For an if-statement, PLANALYZER adds an edge from each of the references in the guard to all assignments in the branches.
In the case of an early return, PLANALYZER adds edges from the variables in dependent guards to all variables defined after the return.

Random, independent assignment implies independence between potential causes, so long as the (possibly empty) conditioning set has been identified and recorded. PLANALYZER computes the DDG for the full script and uses the full DDG to determine when it is possible to marginalize over some variables.

Propagating variable labels. PLANALYZER marks variables directly assigned by built-in random functions or external random functions as random. The randomness label takes a tuple of identifiers as its argument. This tuple denotes the unit(s) of randomization, used for reasoning about causal estimators. Any node with a random ancestor is marked as random (with the exception of variables that do not vary), with units of randomization corresponding to the union of the ancestors’ units.

If a random operator uses a low-cardinality unit of randomization, it will be marked as nonrandom. Note, however, that if the unit of randomization for a random function is a tuple with at least one high cardinality variable, then the resulting variable will remain random.

PLANALYZER propagates time-varying labels in the same manner as random labels. Unlike randomness, there is no interaction between the time-varying label and any other labels.

Converting DDGs to causal graphical models. Readers familiar with graphical models may wonder whether the DDG can be transformed into a directed graphical model. Programmatically defined experiments have two features that, depending on context, make such a transformation either totally inappropriate or difficult to extract: (1) deterministic dependence and (2) conditional branching. These two features can induce what is known as “context-sensitive independence,” which limits the effectiveness of existing algorithms that would otherwise make graphical models an appealing target semantics. Although some work has sought to remedy branching, treatment of context-sensitive independence in graphical models more broadly is an open research problem.¹⁴ Furthermore, from a practical perspective, it is unclear how the versioned variables in the DDG ought to be unified, and some variables simply do not belong to a CGM (e.g., userid).

6. Planout Corpora

We analyze a corpus of PlanOut scripts from Facebook to evaluate PlanAlyzer. We also make use of a corpus of manually specified contrasts that were used in the analysis of the deployed experimentation scripts. Scripts do not contain any user data, but may contain deidentified IDs (such as those of employees testing the scripts). Each experiment may have a temporary (but syntactically valid) representation captured by a snapshotting system, leading to multiple versions of a single experiment. Although we do not have access to the custom analyses of more complex experiments (e.g., database queries, R code, etc.), we can infer some characteristics of the intended analysis by partitioning the corpus into three subcorpora. Although we analyzed all three, we focus on just one here:

PlanOut-A. This corpus contains scripts that were analyzed using some form of ATE , where the variables T₁, … T_n were manually specified and automatically recorded during the duration of the experiment. Users may manually specify that a subset of the recorded variables be continuously monitored for pairwise ATE. Neither the recording nor the data analysis tools have any knowledge of PLANOUT. This is the main corpus we will use for evaluating PLANALYZER, because the goal of PLANALYZER is to automate analyses that firms such as Facebook must now do manually.

Note that users of PlanOut at Facebook are typically either experts in the domain of the hypotheses being tested or they are analysts working directly with domain experts.

6.1. Characterizing representative PLANOUT programs

We designed PLANALYZER’S analyses on the basis of the universe of syntactically valid PLANOUT programs and our domain knowledge of experimentation. We built PLANALYZER from the perspective that (1) PLANOUT is the primary means by which experimenters design and deploy experiments, but (2) they can use other systems, if they exist. Facebook uses many experimentation systems and has a variety of human and code-review methods for the functionality that PLANALYZER provides. Therefore, we wanted to know: what are some characteristics of PLANOUT programs that people actually write and deploy?

We found that engineers and data scientists at Facebook used PLANOUT in a variety of surprising ways and had coding habits that were perhaps indicative of heterogeneity in the programming experience of authors. Through conversations with engineers at Facebook, we have come to understand that most PLANOUT authors can be described along the two axes depicted in Table 1.

Table 1. Experience matrix for PlanOut authors.

Table 2 enumerates the errors raised by PLANALYZER over the three corpora. Each warning does not necessarily indicate an error during deployment or analysis, due to the fact that there are preexisting mechanisms and idiosyncratic usages of PLANOUT.

Table 2. The counts of code smells, static script errors, and tool failures found when running PlanAlyzer on the PlanOut-A corpus A PlanAlyzer error does not necessarily indicate the experiment was run in error.

PLANOUT-A contains our highest quality data: all scripts were vetted by experts before deployment, with some component analyzed using ATE. Figure 2 provides a lightly anonymized example program that PLANALYZER identified as having a potential error. Its style and structure is a good representation of real-world PLANOUT programs.

Figure 2. A representative, lightly edited and anonymized experiment written in PlanOut. This script mixes testing code with experimentation code. Lines 5-12 set values for the author of the script whose userid is AUTHOR_ID and records those values. The actual experiment is in lines 14-26. It is only conducted on the population defined by the external predicate and the user being recorded in (represented here when the userid is 0). PlanAlyzer raises an error for this script.

We found the following coding practices in PLANOUT-A:

Ambiguous Semantics and Type Errors. Because PLANALYZER must initially perform type inference, it found 87 scripts in PLANOUT-A that had type errors, which suggest there might be some utility in providing our type checking facility to users of PLANOUT.

We also found three scripts from one experiment that applied the modulus operator to a fraction; because PLANOUT uses the semantics of its enclosing environment for numeric computation, this script will return different values if it is run using languages with different semantics for modulus, such as PHP versus JavaScript.

Modifying deployment settings within experimentation logic. Some of the scripts marked as not experiments begin with return false and had an unreachable and fully specified experiment below the return statement. PLANALYZER flags dead code in PLANOUT programs, because it can be the result of a randomly assigned variable causing unintended downstream control flow behavior. However, every dead code example we found had the form condition = false; if (condition) … These features occurred exclusively in experiments that had multiple scripts associated with them that did not raise these errors. After discussing our findings with engineers at Facebook, we believe that this might be a case of PLANOUT authors modifying the experiment although it is running to control deployment, rather than leaving dead-code in by accident, as it appears from PLANALYZER’S perspective.

Using PlanOut for Application Configuration. One of the most surprising characteristics we found in PLANOUT-A was the prevalence of using PLANOUT for application configuration, à la Akamai’s ACMS system or Facebook’s Gatekeeper.^20,21 When these scripts set variables, but properly turned off data recording (i.e., returned false), PLANALYZER marked them as not being experiments. When they did not turn off logging, they were marked as recording paths without randomization. Some instances of application configuration involved setting the support of a randomly assigned variable to a constant or setting a weight to zero. Because experiments require variation for comparison, PLANALYZER raises an error if the user attempts to randomly select from a set of fewer than two choices. Three scripts contained expressions of the form uniformChoice (choices = [v], unit=userid) for some constant value v.

As a result, engineers who aim to use PLANOUT as a configuration system have no need for PLANALYZER, but anyone writing experiments would consider these scripts buggy.

Mixing external calls to other systems. Almost 20% of the scripts (106) include calls to external experimentation systems. In a small number of cases, PLANOUT is used exclusively for managing these other systems, with no calls to its built-in random operators.

Nonread-only units. One of the other firms we spoke to that uses PLANOUT treats units of randomization as read-only, unlike other variables in PLANOUT programs. Facebook does not do this. Therefore, programs that modify the unit of randomization may be valid; for instance, the aforementioned instance where the unit was set to userid * 2. We also observed a case where the unit was set to be the result of an external call—without knowing the behavior of this external call, it is assumed to be low cardinality. In this case, the experiment was performing cluster random assignment, which is not covered by ATE and out of scope for PLANALYZER.

6.2. PlanAlyzer performance

All analyses were run on a MacBook Air with a 1.6 GHz Intel Core i5 processor having four logical cores. The longest runtime for any analysis was approximately three min; runtime scales linearly with the number of “paths” through the program, where a path is defined according to the transformed internal representation of the input PLANOUT program and is related to the number of conditioning sets. PLANALYZER uses the Z3 SMT solver⁵ to ensure conditioning sets are satisfied and to generate treatments,^{23, 9} so both the number of variables in the program and the number of paths in the internal representation could cause a blowup in runtime. We found that runtime increases linearly with the number of internal paths, but possibly exponentially with the number of variables, as depicted in Figure 3.

Figure 3. Wall-clock timing data for the PlanOut corpus. Plots in column (a) depict the empirical CDF of all scripts on a log-scale. Plots in columns (b) and (c) show the relationship between the running time and features of the PlanOut script we might expect to affect running time, on log-scale on both axes. Plots in column (b) show both the number of variables in the input PlanOut script, and the number of variables in the transformed, intermediate representation of the PlanOut program. Plots in column (c) depict the relationship between the number of paths through PlanOut programs and their running time. The times depicted in both (b) and (c) are averages over scripts satisfying the x-axis value, and the size of the points are proportional to the number of scripts used to compute that average. We chose this representation, rather than reporting error bars, because the data is not iid.

PLANALYZER produces meaningful contrasts that are comparable with the human-specified gold standard, automatically generating 82% of our eligible gold-standard contrasts. PLANALYZER runs in a reasonably short amount of time; likely due to PLANOUT’S generally small program sizes.

Summary. We did not expect to see any real causal sufficiency errors due to the expert nature of the authors of PLANOUT-A. Rather, we expect to see some false positives due to the fact that PLANALYZER is aggressive about flagging potential causal sufficiency errors. We made this design choice because the cost of unrecorded confounders can be very high.

PLANOUT scripts in deployment at Facebook represent a range of experimental designs. We observed factorial designs, conditional assignment, within-subjects experiments, cluster random assignment, and bandits experiments in the scripts we examined.

7. Evaluation

Real-world PLANOUT scripts unsurprisingly contained few errors, because they were primarily written and overseen by experts in experimental design. Therefore, to test how well PLANALYZER finds errors, we selected a subset of fifty scripts from PLANOUT-A and mutated them. We then validated a subset of the contrasts PLANALYZER produced against a corpus of hand-selected contrasts monitored and compared by an automated tool used at Facebook. Finally, we reported on PLANALYZER’S performance, because its effectiveness requires accurately identifying meaningful contrasts within a reasonable amount of time.

7.1. Mutation methodology

We first identified scripts that were eligible for this analysis. We modified the PLANOUT-A scripts that raised errors when it was appropriate to do so. For example, we updated a number of the scripts that erroneously raised causal sufficiency errors so that they would not raise those errors anymore. We excluded scripts that, for example, contained testing code or configuration code. This allowed us to be reasonably certain that most of the input scripts were correct.

All of our mutations operate over input PLANOUT programs, rather than the intermediate representation. We believed this approach would better stress PLANALYZER. We perform one mutation per script.

We considered two approaches when deciding how to perform the mutations:

Randomly select a mutation type, and then randomly select from the eligible AST points for that mutation.
Generate all of the eligible AST points for all of the mutations, and then randomly select from this set.

Method 1 leads to an even split between the classes of mutations in the test corpus; method 2 leads to frequencies that are proportional to the frequencies of the eligible AST nodes. We chose the latter because we believed it would lead to a more accurate representation of real programming errors.

To select the subset of scripts to evaluate, we sampled 50 experiments and then selected a random script version from that experiment. We then manually inspected the mutated script and compared the output of the mutation with the original output.

Findings: fault identification over mutated scripts. When analyzing our sample of 50 mutated scripts, PlanAlyzer produced only one false positive and only one false negative. The precision and recall were both 92%. On the one hand, this is very surprising, given both the false positive rate in the PLANOUT-A corpus for causal sufficiency errors (CSE) at 8%, and the proportion of CSE mutations in this sample (28%). However, we found that most of the CSE mutations caused the program to exit before random assignment, causing PLANALYZER to raise legitimate errors about recorded paths with no randomization. The rest were true causal sufficiency errors (i.e., they would cause bias in treatment). The one false negative we observed occurred in a script that redefined the treatment variable for two userids, in what appears to be testing code. The mutation wrapped the redefined treatment, so this is a case where PLANALYZER should have raised a “no randomization error” in both the input script as well as the mutated script.

7.2. Validation against human-generated contrasts

We decided whether an experiment should be in the subset according to the following three criteria: (1) all variables in the human-generated contrasts appeared in the original script; (2) PLANALYZER was able to produce at least one contrast for the experiment; and (3) PLANALYZER produced identical contrasts across all versions of the experiment. Criteria (1) and (2) ensure that analysis does not require knowledge unavailable to PLANALYZER. Criteria (3) is necessary because the tool that monitors contrasts logs them per experiment, not per version. If the possible contrasts change between versions, we cannot be sure which version corresponded to the data. Ninety-five of the unique experiments met these criteria.

Findings: contrast generation. PLANALYZER found equivalent contrasts for 78 of the 95 experiments. For 14 experiments, it produced either partial contrasts or no contrasts. In each of these cases, the desired contrast required summing over some of the variables in the program (marginalization), or more sophisticated static analysis than the tool currently supports. Because it is computationally expensive to produce every possible subset of marginalized contrasts, we consider the former to be an acceptable shortcoming of the tool. Finally, three experiments had issues with their human-generated contrasts (no contrasts, or ambiguous or unparsable data).

8. Conclusion

The state of the art for auditing experiments and for generating their associated statistical analyses is almost entirely a manual process. This is the first work that analyzes field experiments statically. We propose a new class of errors and threats unique to the programmatic specification of experimentation. We have implemented a tool that, for the most common class of experiments, automatically identifies threats and generates statistical analyses. We compare the output of PLANALYZER against human-generated analyses of real PLANOUT scripts and find that PLANALYZER produces comparable results.

Acknowledgments

This research was in part conducted while Emma Tosch was an employee of Facebook. Although at the University of Massachusetts Amherst, Emma Tosch was supported in part by the United States Air Force under Contract No. FA8750-17-C-0120. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of Facebook nor the United States Air Force.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

PlanAlyzer: Assessing Threats to the Validity of Online Experiments

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/3474385

September 2021 Issue

Published: September 1, 2021

Vol. 64 No. 9

Pages: 108-116

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Dec 20 2024

Strengthening Security Throughout the ML/AI Lifecycle

Alex Vakulov

Artificial Intelligence and Machine Learning

News Dec 18 2024

iBuyers, AI, and Real Estate

Gregory Goth

Architecture and Hardware

BLOG@CACM Dec 17 2024

Zero-Trust Security in Software Development

Harikrishna Kundariya

Computing Profession

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

1. Introduction

2. Language Characteristics

3. Validation of Statistical Conclusions

4. Validation of Experimental Designs

5. Planalyzer Static Analysis Tool

6. Planout Corpora

7. Evaluation

8. Conclusion

Acknowledgments

PlanAlyzer: Assessing Threats to the Validity of Online Experiments

DOI

September 2021 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.