Abstracting the Geniuses Away from Failure Testing

Ordinary users need tools that automate the selection of custom-tailored faults to inject.

Posted Jan 1 2018

Introduction
The Future Is Disorder
We Don't Need Another Hero
A Blast from the Past
Rumors from the Future
Conclusion
References
Authors

Abstracting the Geniuses Away from Failure Testing, illustration

The heterogeneity, complexity, and Scale of cloud applications make verification of their fault tolerance properties challenging. Companies are moving away from formal methods and toward large-scale testing in which components are deliberately compromised to identify weaknesses in the software. For example, techniques such as Jepsen apply fault-injection testing to distributed data stores, and Chaos Engineering performs fault injection experiments on production systems, often on live traffic. Both approaches have captured the attention of industry and academia alike.

Unfortunately, the search space of distinct fault combinations that an infrastructure can test is intractable. Existing failure-testing solutions require skilled and intelligent users who can supply the faults to inject. These superusers, known as Chaos Engineers and Jepsen experts, must study the systems under test, observe system executions, and then formulate hypotheses about which faults are most likely to expose real system-design flaws. This approach is fundamentally unscalable and unprincipled. It relies on the superuser’s ability to interpret how a distributed system employs redundancy to mask or ameliorate faults and, moreover, the ability to recognize the insufficiencies in those redundancies—in other words, human genius.

This article presents a call to arms for the distributed systems research community to improve the state of the art in fault tolerance testing. Ordinary users need tools that automate the selection of custom-tailored faults to inject. We conjecture that the process by which superusers select experiments—observing executions, constructing models of system redundancy, and identifying weaknesses in the models—can be effectively modeled in software. The article describes a prototype validating this conjecture, presents early results from the lab and the field, and identifies new research directions that can make this vision a reality.

The Future Is Disorder

Providing an “always-on” experience for users and customers means that distributed software must be fault tolerant—that is to say, it must be written to anticipate, detect, and either mask or gracefully handle the effects of fault events such as hardware failures and network partitions. Writing fault-tolerant software—whether for distributed data management systems involving the interaction of a handful of physical machines, or for Web applications involving the cooperation of tens of thousands—remains extremely difficult. While the state of the art in verification and program analysis continues to evolve in the academic world, the industry is moving very much in the opposite direction: away from formal methods (however, with some noteworthy exceptions,⁴¹) and toward approaches that combine testing with fault injection.

Here, we describe the underlying causes of this trend, why it has been successful so far, and why it is doomed to fail in its current practice.

The Old Gods. The ancient myth: Leave it to the experts. Once upon a time, distributed systems researchers and practitioners were confident that the responsibility for addressing the problem of fault tolerance could be relegated to a small priesthood of experts. Protocols for failure detection, recovery, reliable communication, consensus, and replication could be implemented once and hidden away in libraries, ready for use by the layfolk.

This has been a reasonable dream. After all, abstraction is the best tool for overcoming complexity in computer science, and composing reliable systems from unreliable components is fundamental to classical system design.³³ Reliability techniques such as process pairs¹⁸ and RAID⁴⁵ demonstrate that partial failure can, in certain cases, be handled at the lowest levels of a system and successfully masked from applications.

Unfortunately, these approaches rely on failure detection. Perfect failure detectors are impossible to implement in a distributed system,^9,15 in which it is impossible to distinguish between delay and failure. Attempts to mask the fundamental uncertainty arising from partial failure in a distributed system—for example, RPC (remote procedure calls⁸) and NFS (network file system⁴⁹)—have met (famously) with difficulties. Despite the broad consensus that these attempts are failed abstractions,²⁸ in the absence of better abstractions, people continue to rely on them to the consternation of developers, operators, and users.

In a distributed system—that is, a system of loosely coupled components interacting via messages—the failure of a component is only ever manifested as the absence of a message. The only way to detect the absence of a message is via a timeout, an ambiguous signal that means either the message will never come or that it merely has not come yet. Timeouts are an end-to-end concern^28,48 that must ultimately be managed by the application. Hence, partial failures in distributed systems bubble up the stack and frustrate any attempts at abstraction.

While the state of the art in verification and program analysis continues to evolve in the academic world, the industry is moving in the opposite direction: away from formal methods and toward approaches that combine testing with fault injection.

The Old Guard. The modern myth: Formally verified distributed components. If we cannot rely on geniuses to hide the specter of partial failure, the next best hope is to face it head on, armed with tools. Until quite recently, many of us (academics in particular) looked to formal methods such as model checking^{16,20,29,39,40,53,54} to assist “mere mortal” programmers in writing distributed code that upholds its guarantees despite pervasive uncertainty in distributed executions. It is not reasonable to exhaustively search the state space of large-scale systems (one cannot, for example, model check Netflix), but the hope is that modularity and composition (the next best tools for conquering complexity) can be brought to bear. If individual distributed components could be formally verified and combined into systems in a way that preserved their guarantees, then global fault tolerance could be obtained via composition of local fault tolerance.

Unfortunately, this, too, is a pipe dream. Most model checkers require a formal specification; most real-world systems have none (or have not had one since the design phase, many versions ago). Software model checkers and other program-analysis tools require the source code of the system under study. The accessibility of source code is also an increasingly tenuous assumption. Many of the data stores targeted by tools such as Jepsen are closed source; large-scale architectures, while typically built from open source components, are increasingly polyglot (written in a wide variety of languages).

Finally, even if you assume that specifications or source code are available, techniques such as model checking are not a viable strategy for ensuring that applications are fault tolerant because, as mentioned, in the context of timeouts, fault tolerance itself is an end-to-end property that does not necessarily hold under composition. Even if you are lucky enough to build a system out of individually verified components, it does not follow the system is fault tolerant—you may have made a critical error in the glue that binds them.

The Vanguard. The emerging ethos: YOLO. Modern distributed systems are simply too large, too heterogeneous, and too dynamic for these classic approaches to software quality to take root. In reaction, practitioners increasingly rely on resiliency techniques based on testing and fault injection.^{6,14,19,23,27,35} These “black box” approaches (which perturb and observe the complete system, rather than its components) are (arguably) better suited for testing an end-to-end property such as fault tolerance. Instead of deriving guarantees from understanding how a system works on the inside, testers of the system observe its behavior from the outside, building confidence that it functions correctly under stress.

Two giants have recently emerged in this space: Chaos Engineering⁶ and Jepsen testing.²⁴ Chaos Engineering, the practice of actively perturbing production systems to increase overall site resiliency, was pioneered by Netflix,⁶ but since then LinkedIn,⁵² Microsoft,³⁸ Uber,⁴⁷ and PagerDuty⁵ have developed Chaos-based infrastructures. Jepsen performs black box testing and fault injection on unmodified distributed data management systems, in search of correctness violations (for example, counterexamples that show an execution was not linearizable).

Both approaches are pragmatic and empirical. Each builds an understanding of how a system operates under faults by running the system and observing its behavior. Both approaches offer a pay-as-you-go method to resiliency: the initial cost of integration is low, and the more experiments that are performed, the higher the confidence that the system under test is robust. Because these approaches represent a straightforward enrichment of existing best practices in testing with well-understood fault injection techniques, they are easy to adopt. Finally, and perhaps most importantly, both approaches have been shown to be effective at identifying bugs.

Unfortunately, both techniques also have a fatal flaw: they are manual processes that require an extremely sophisticated operator. Chaos Engineers are a highly specialized subclass of site reliability engineers. To devise a custom fault injection strategy, a Chaos Engineer typically meets with different service teams to build an understanding of the idiosyncrasies of various components and their interactions. The Chaos Engineer then targets those services and interactions that seem likely to have latent fault tolerance weaknesses. Not only is this approach difficult to scale since it must be repeated for every new composition of services, but its critical currency—a mental model of the system under study—is hidden away in a person’s brain. These points are reminiscent of a bigger (and more worrying) trend in industry toward reliability priest-hoods,⁷ complete with icons (dashboards) and rituals (playbooks).

Jepsen is in principle a framework that anyone can use, but to the best of our knowledge all of the reported bugs discovered by Jepsen to date were discovered by its inventor, Kyle Kingsbury, who currently operates a “distributed systems safety research” consultancy.²⁴ Applying Jepsen to a storage system requires the superuser carefully read the system documentation, generate workloads, and observe the externally visible behaviors of the system under test. It is then up to the operator to choose—from the massive combinatorial space of “nemeses,” including machine crashes and network partitions—those fault schedules that are likely to drive the system into returning incorrect responses.

A human in the loop is the kiss of death for systems that need to keep up with software evolution. Human attention should always be targeted at tasks that computers cannot do! Moreover, the specialists that Chaos and Jepsen testing require are expensive and rare. Here, we show how geniuses can be abstracted away from the process of failure testing.

We Don’t Need Another Hero

Rapidly changing assumptions about our visibility into distributed system internals have made obsolete many if not all of the classic approaches to software quality, while emerging “chaos-based” approaches are fragile and unscalable because of their genius-in-the-loop requirement.

We present our vision of automated failure testing by looking at how the same changing environments that hastened the demise of time-tested resiliency techniques can enable new ones. We argue the best way to automate the experts out of the failure-testing loop is to imitate their best practices in software and show how the emergence of sophisticated observability infrastructure makes this possible.

The order is rapidly fadin.’ For large-scale distributed systems, the three fundamental assumptions of traditional approaches to software quality are quickly fading in the rearview mirror. The first to go was the belief that you could rely on experts to solve the hardest problems in the domain. Second was the assumption that a formal specification of the system is available. Finally, any program analysis (broadly defined) that requires that source code is available must be taken off the table. The erosion of these assumptions helps explain the move away from classic academic approaches to resiliency in favor of the black box approaches described earlier.

What hope is there of understanding the behavior of complex systems in this new reality? Luckily, the fact that it is more difficult than ever to understand distributed systems from the inside has led to the rapid evolution of tools that allow us to understand them from the outside. Callgraph logging was first described by Google;⁵¹ similar systems are in use at Twitter,⁴ Netflix,¹ and Uber,⁵⁰ and the technique has since been standardized.⁴³ It is reasonable to assume that a modern microservice-based Internet enterprise will already have instrumented its systems to collect call-graph traces. A number of startups that focus on observability have recently emerged.^21,34 Meanwhile, provenance collection techniques for data processing systems^11,22,42 are becoming mature, as are operating system-level provenance tools.⁴⁴ Recent work^12,55 has attempted to infer causal and communication structure of distributed computations from raw logs, bringing high-level explanations of outcomes within reach even for uninstrumented systems.

Regarding testing distributed systems. Chaos Monkey, like they mention, is awesome, and I also highly recommend getting Kyle to run Jepsen tests.

—Commentator on HackerRumor

Away from the experts. While this quote is anecdotal, it is difficult to imagine a better example of the fundamental unscalability of the current state of the art. A single person cannot possibly keep pace with the explosion of distributed system implementations. If we can take the human out of this critical loop, we must; if we cannot, we should probably throw in the towel.

The first step to understanding how to automate any process is to comprehend the human component that we would like to abstract away. How do Chaos Engineers and Jepsen superusers apply their unique genius in practice? Here is the three-step recipe common to both approaches.

Step 1: Observe the system in action. The human element of the Chaos and Jepsen processes begins with principled observation, broadly defined.

A Chaos Engineer will, after studying the external API of services relevant to a given class of interactions, meet with the engineering teams to better understand the details of the implementations of the individual services.²⁵ To understand the high-level interactions among services, the engineer will then peruse call-graph traces in a trace repository.³

A Jepsen superuser typically begins by reviewing the product documentation, both to determine the guarantees that the system should uphold and to learn something about the mechanisms by which it does so. From there, the superuser builds a model of the behavior of the system based on interaction with the system’s external API. Since the systems under study are typically data management and storage, these interactions involve generating histories of reads and writes.³¹

The first step to understanding what can go wrong in a distributed system is watching things go right: observing the system in the common case.

Step 2. Build a mental model of how the system tolerates faults. The common next step in both approaches is the most subtle and subjective. Once there is a mental model of how a distributed system behaves (at least in the common case), how is it used to help choose the appropriate faults to inject? At this point we are forced to dabble in conjecture: bear with us.

Fault tolerance is redundancy. Given some fixed set of faults, we say that a system is “fault tolerant” exactly if it operates correctly in all executions in which those faults occur. What does it mean to “operate correctly”? Correctness is a system-specific notion, but, broadly speaking, is expressed in terms of properties that are either maintained throughout the system’s execution (for example, system invariants or safety properties) or established during execution (for example, liveness properties). Most distributed systems with which we interact, though their executions may be unbounded, nevertheless provide finite, bounded interactions that have outcomes. For example, a broadcast protocol may run “forever” in a reactive system, but each broadcast delivered to all group members constitutes a successful execution.

By viewing distributed systems in this way, we can revise the definition: A system is fault tolerant if it provides sufficient mechanisms to achieve its successful outcomes despite the given class of faults.

Step 3: Formulate experiments that target weaknesses in the façade. If we could understand all of the ways in which a system can obtain its good outcomes, we could understand which faults it can tolerate (or which faults it could be sensitive to). We assert that (whether they realize it or not!) the process by which Chaos Engineers and Jepsen superusers determine, on a system-by-system basis, which faults to inject uses precisely this kind of reasoning. A target experiment should exercise a combination of faults that knocks out all of the supports for an expected outcome.

Carrying out the experiments turns out to be the easy part. Fault injection infrastructure, much like observability infrastructure, has evolved rapidly in recent years. In contrast to random, coarse-grained approaches to distributed fault injection such as Chaos Monkey,²³ approaches such as FIT (failure injection testing)¹⁷ and Gremlin³² allow faults to be injected at the granularity of individual requests with high precision.

Step 4. Profit! This process can be effectively automated. The emergence of sophisticated tracing tools described earlier makes it easier than ever to build redundancy models even from the executions of black box systems. The rapid evolution of fault injection infrastructure makes it easier than ever to test fault hypotheses on large-scale systems. Figure 1 illustrates how the automation described in this here fits neatly between existing observability infrastructure and fault injection infrastructure, consuming the former, maintaining a model of system redundancy, and using it to parameterize the latter. Explanations of system outcomes and fault injection infrastructures are already available. In the current state of the art, the puzzle piece that fits them together (models of redundancy) is a manual process. LDFI (as we will explain) shows that automation of this component is possible.

Figure 1. Our vision of automated failure testing.

A Blast from the Past

In previous work, we introduced a bug-finding tool called LDFI (lineage-driven fault injection).² LDFI uses data provenance collected during simulations of distributed executions to build derivation graphs for system outcomes. These graphs function much like the models of system redundancy described earlier. LDFI then converts the derivation graphs into a Boolean formula whose satisfying assignments correspond to combinations of faults that invalidate all derivations of the outcome. An experiment targeting those faults will then either expose a bug (that is, the expected outcome fails to occur) or reveal additional derivations (for example, after a timeout, the system fails over to a backup) that can be used to enrich the model and constrain future solutions.

The rapid evolution of fault injection infrastructure makes it easier than ever to test fault hypotheses on large-scale systems.

At its heart, LDFI reapplies well-understood techniques from data management systems, treating fault tolerance as a materialized view maintenance problem.^2,13 It models a distributed system as a query, its expected outcomes as query outcomes, and critical facts such as “replica A is up at time t” and “there is connectivity between nodes X and Y during the interval i … j” as base facts. It can then ask a how-to query:³⁷ What changes to base data will cause changes to the derived data in the view? The answers to this query are the faults that could, according to the current model, invalidate the expected outcomes.

The idea seems far-fetched, but the LDFI approach shows a great deal of promise. The initial prototype demonstrated the efficacy of the approach at the level of protocols, identifying bugs in replication, broadcast, and commit protocols.^2,46 Notably, LDFI reproduced a bug in the replication protocol used by the Kafka distributed log²⁶ that was first (manually) identified by Kingsbury.³⁰ A later iteration of LDFI is deployed at Netflix,¹ where (much like the illustration in Figure 1) it was implemented as a microservice that consumes traces from a call-graph repository service and provides inputs for a fault injection service. Since its deployment, LDFI has identified 11 critical bugs in user-facing applications at Netflix.¹

Rumors from the Future

The prior research presented earlier is only the tip of the iceberg. Much work still needs to be undertaken to realize the vision of fully automated failure testing for distributed systems. Here, we highlight nascent research that shows promise and identifies new directions that will help realize our vision.

Don’t overthink fault injection. In the context of resiliency testing for distributed systems, attempting to enumerate and faithfully simulate every possible kind of fault is a tempting but distracting path. The problem of understanding all the causes of faults is not directly relevant to the target, which is to ensure that code (along with its configuration) intended to detect and mitigate faults performs as expected.

Consider Figure 2: The diagram on the left shows a microservice-based architecture; arrows represent calls generated by a client request. The right-hand side zooms in on a pair of interacting services. The shaded box in the caller service represents the fault tolerance logic that is intended to detect and handle faults of the callee. Failure testing targets bugs in this logic. The fault tolerance logic targeted in this bug search is represented as the shaded box in the caller service, while the injected faults affect the callee.

Figure 2. Fault injection and fault-tolerant code.

The common effect of all faults, from the perspective of the caller, is explicit error returns, corrupted responses, and (possibly infinite) delay. Of these manifestations, the first two can be adequately tested with unit tests. The last is difficult to test, leading to branches of code that are infrequently executed. If we inject only delay, and only at component boundaries, we conjecture that we can address the majority of bugs related to fault tolerance.

Explanations everywhere. If we can provide better explanations of system outcomes, we can build better models of redundancy. Unfortunately, a barrier to entry for systems such as LDFI is the unwillingness of software developers and operators to instrument their systems for tracing or provenance collection. Fortunately, operating system-level provenance-collection techniques are mature and can be applied to uninstrumented systems.

Moreover, the container revolution makes simulating distributed executions of black box software within a single hypervisor easier than ever. We are actively exploring the collection of system call-level provenance from unmodified distributed software in order to select a custom-tailored fault injection schedule. Doing so requires extrapolating application-level causal structure from low-level traces, identifying appropriate cut points in an observed execution, and finally synchronizing the execution with fault injection actions.

We are also interested in the possibility of inferring high-level explanations from even noisier signals, such as raw logs. This would allow us to relax the assumption that the systems under study have been instrumented to collect execution traces. While this is a difficult problem, work such as the Mystery Machine¹² developed at Face-book shows great promise.

Toward better models. The LDFI system represents system redundancy using derivation graphs and treats the task of identifying possible bugs as a materialized-view maintenance problem. LDFI was hence able to exploit well-understood theory and mechanisms from the history of data management systems research. But this is just one of many ways to represent how a system provides alternative computations to achieve its expected outcomes.

A shortcoming of the LDFI approach is its reliance on assumptions of determinism. In particular, it assumes that if it has witnessed a computation that, under a particular contingency (that is, given certain inputs and in the presence of certain faults), produces a successful outcome, then any future computation under that contingency will produce the same outcome. That is to say, it ignores the uncertainty in timing that is fundamental to distributed systems. A more appropriate way to model system redundancy would be to embrace (rather than abstracting away) this uncertainty.

Distributed systems are probabilistic by nature and are arguably better modeled probabilistically. Future directions of work include the probabilistic representation of system redundancy and an exploration of how this representation can be exploited to guide the search of fault experiments. We encourage the research community to join in exploring alternative internal representations of system redundancy.

Turning the explanations inside out. Most of the classic work on data provenance in database research has focused on aspects related to human-computer interaction. Explanations of why a query returned a particular result can be used to debug both the query and the initial database—given an unexpected result, what changes could be made to the query or the database to fix it? By contrast, in the class of systems we envision (and for LDFI concretely), explanations are part of the internal language of the reasoner, used to construct models of redundancy in order to drive the search through faults.

The container revolution makes simulating distributed executions of black-box software within a single hypervisor easier than ever.

Ideally, explanations should play a role in both worlds. After all, when a bug-finding tool such as LDFI identifies a counterexample to a correctness property, the job of the programmers has only just begun—now they must undertake the onerous job of distributed debugging. Tooling around debugging has not kept up with the explosive pace of distributed systems development. We continue to use tools that were designed for a single site, a uniform memory, and a single clock. While we are not certain what an ideal distributed debugger should look like, we are quite certain that it does not look like GDB (GNU Project debugger).³⁶ The derivation graphs used by LDFI show how provenance can also serve a role in debugging by providing a concise, visual explanation of how the system reached a bad state.

This line of research can be pushed further. To understand the root causes of a bug in LDFI, a human operator must review the provenance graphs of the good and bad executions and then examine the ways in which they differ. Intuitively, if you could abstractly subtract the (incomplete by assumption) explanations of the bad outcomes from the explanations of the good outcomes,¹⁰ then the root cause of the discrepancy would be likely to be near the “frontier” of the difference.

Conclusion

A sea change is occurring in the techniques used to determine whether distributed systems are fault tolerant. The emergence of fault injection approaches such as Chaos Engineering and Jepsen is a reaction to the erosion of the availability of expert programmers, formal specifications, and uniform source code. For all of their promise, these new approaches are crippled by their reliance on superusers who decide which faults to inject.

To address this critical shortcoming, we propose a way of modeling and ultimately automating the process carried out by these superusers. The enabling technologies for this vision are the rapidly improving observability and fault injection infrastructures that are becoming commonplace in the industry. While LDFI provides constructive proof that this approach is possible and profitable, it is only the beginning. Much work remains to be done in targeting faults at a finer grain, constructing more accurate models of system redundancy, and providing better explanations to end users of exactly what went wrong when bugs are identified. The distributed systems research community is invited to join in exploring this new and promising domain.

Abstracting the Geniuses Away from Failure Testing

View in the ACM Digital Library

Copyright held by owners/authors. Publication rights licensed to ACM.
Request permission to publish from permissions@acm.org

DOI

10.1145/3152483

January 2018 Issue

Published: January 1, 2018

Vol. 61 No. 1

Pages: 54-61

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Oct 3 2025

Restoring Reliability in the AI-Aided Software Development Life Cycle

Rahul Chandel

Artificial Intelligence and Machine Learning

News Oct 2 2025

Will AI Take Your Job?

Logan Kugler

Artificial Intelligence and Machine Learning

News Oct 1 2025

Can AI Make the Team?

Esther Shein

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

The Future Is Disorder

We Don’t Need Another Hero

A Blast from the Past

Rumors from the Future

Conclusion

Abstracting the Geniuses Away from Failure Testing

DOI

January 2018 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.