Fault Injection in Production

When we build Web infrastructures at Etsy, we aim to make them resilient. This means designing them carefully so they can sustain their (increasingly critical) operations in the face of failure. Thankfully, there have been a couple of decades and reams of paper spent on researching how fault tolerance and graceful degradation can be brought to computer systems. That helps the cause.

To make sure the resilience built into Etsy systems is sound and that the systems behave as expected, we have to see the failures being tolerated in production.

Why production? Why not simulate this in a QA or staging environment? The reason is the existence of any differences in those environments brings uncertainty to the exercise, as well as because the risk of not recovering has no consequences, which can bring unforeseen assumptions into the fault-tolerance design and into recovery. The goal is to reduce uncertainty, not increase it.

Forcing failures to happen, or even designing systems to fail on their own, generally is not easily sold to management. Engineers are not conditioned to embrace their ability to respond to emergency situations; they aim to avoid them altogether. Taking a detailed look at how to respond better to failure is essentially accepting that failure will happen, which you might think is counter to what you want in engineering, or in business.

Take, for example, what you would normally think of as a simple case: the provisioning of a server or cloud instance from zero to production:

Bare metal (or cloud-compute instance) is made available.
Base operating system is installed via PXE (preboot execution environment), or machine image.
Operating-system-level configurations are put into place (via configuration management or machine image).
Application-level configurations are put into place (via configuration management, app deployment, or machine image).
Application code is put into place and underlying services are started correctly (via configuration management, app deployment, or machine image).
Systems integration takes place in the network (load balancers, VLANs, routing, switching, DNS, among others).

This is probably an oversimplification, and each step or layer is likely to represent a multitude of CPU cycles; disk, network and/or memory operations; and various amounts of software mechanisms. All of these come together to bring a node into production.

Operability means that you can have confidence in this node coming into production, possibly joining a cluster, and serving live traffic seamlessly every time it happens. Furthermore, you want and expect to have confidence that if the underlying power, configuration, application, or compute resources (CPU, disk, memory, network, and so on) experience a fault, then you can survive such a fault by some means: allowing the application to degrade gracefully, rebuild itself, take itself out of production, and alert on the specifics of the fault.

Building this confidence typically comes in a number of ways:

Hardware burn-in testing. You can run extreme tests on the various hardware components in a node in order to confirm that none of them would experience faults at the onset of load. This may not be necessary or feasible in a cloud-compute instance.
Unit testing of components. Each service can be easily tested in isolation, and configuration can be check-summed to assure expectations.
Functional testing of integrations. Each execution path (usually based on an application feature) can be explored with some form of automated procedure to assure expected results.

Traditionally, these sensible measures to gain confidence are made before systems or applications reach production. Once in production, the traditional approach is to rely on monitoring and logging to confirm that everything is working correctly. If it is behaving as expected, then you do not have a problem. If it is not, and it requires human intervention (troubleshooting, triage, resolution, and so on), then you need to react to the incident and get things working again as fast as possible.

This implies that once a system is in production, “Don’t touch it!”—except, of course, when it’s broken, in which case touch it all you want, under the time pressure inherent in an outage response.

This approach is not as fruitful as it could be, on a number of levels.

In the field, you need to prepare for ill-behaved circumstances. Power can get cut abruptly. Changes to application or configuration can produce unforeseen behaviors, no matter how full the coverage of testing. Application behavior under various resource-contention conditions (think traffic spikes from news events or firehose-like distributed denial-of-service attacks) can have surprising results. This is not a purely academic curiosity; these types of faults can (and will) affect production and, therefore, in Etsy’s case, our sellers and our business. These types of events, however, are difficult to model and simulate with an accuracy that would inspire confidence surrounding the behavior with unknown failure pathologies.

The challenge is that Web systems (like many “complex” systems) are largely intractable, meaning that:

To be fully described, there are many details, not few;
The rate of change is high; the systems change before a full description (and therefore understanding) can be completed;
How components function is partly unknown, as they resonate with each other across varying conditions; and
Processes are heterogeneous and possibly irregular.

In other words, while testing outside of production is a very proper approach, it is incomplete because some behaviors can be seen only in production, no matter how identical a staging environment can be made.

Therefore, another option must be added to the confidence-gaining arsenal: fault injection exercises sometimes referred to as GameDay. The goal is to make these faults happen in production in order to anticipate similar behaviors in the future, understand the effects of failures on the underlying systems, and ultimately gain insight into the risks they pose to the business.

Causing failures to happen in complex systems is not a new concept. Organizations such as fire departments have been running full-scale disaster drills for decades. Web engineering has an advantage over these types of drills in that the systems engineers can gather a massive amount of detail on any fault at an extremely high resolution, wield a very large amount of control over the intricate mechanisms of failures, and learn how to recover very quickly from them.

Fault injection

Constructing a GameDay exercise at Etsy follows this pattern:

Imagine a possible untoward event in your infrastructure.
Figure out what is needed to prevent that event from affecting your business, and implement that.
Cause the event to happen in production, ultimately to prove the non-effect of the event and gain confidence surrounding it.

The greatest advantage of a GameDay exercise is figuring out how to prevent a failure from affecting the business. It is difficult to overstate the importance of steps 1 and 2. The idea is to get a group of engineers together to brainstorm the various failure scenarios that a particular application, service, or infrastructure could experience. This will help remove complacency in the safety of the overall system. Complacency is an enemy of resilience. If a system has a period of little or no degradation, then there is a real risk of it drifting toward failure on multiple levels, because engineers can be convinced—falsely—that the system is experiencing no surprising events because it is inherently safe.

Imagining failure scenarios and asking, “What if…?” can help combat this thinking and bring a constant sense of unease to the organization. This is a hallmark characteristic of high-reliability organizations. Think of it as continuously deploying a business continuity plan (BCP).

Business Justification

In theory, the idea of GameDay exercises may seem sound: you make an explicit effort to anticipate failure scenarios, prepare for handling them gracefully, and then confirm this behavior by purposely injecting those failures into production. In practice, this idea may not seem appealing to the business: it brings risk to the fore-front; and without context, the concept of making failures happen on purpose may seem crazy. What if something goes wrong?

The traditional view of failure in production is avoidance at all costs. The assumption is that failure is entirely preventable, and if it does happen, then find the persons responsible (usually those most proximate to the code or systems) and fire them, in the belief that getting rid of “bad apples” is how you bring safety to an organization.

This perspective is, of course, ludicrous. Fault injection and GameDay scenarios can revert this view into a more pragmatic and realistic one.

When approaching Etsy’s executive team with the idea of GameDay exercises, I explained that it is not that we want to cause failures out of some perverse need to watch infrastructure crumble; it is because we know that parts of the system will inevitably fail, and we need to gain confidence that the system is resilient enough to handle it gracefully.

The concept, I explained to the executives, is that building resilient systems requires experience with failure, and that we want to anticipate and confirm our expectations surrounding failure more often, not less often. Shying away from the effects of failure in a misguided attempt to reduce risk will result in poor designs, stale recovery skills, and a false sense of safety.

In other words, it is better to prepare for and cause failures to happen in production while we are watching, instead of relying on a strategy of hoping the system will behave correctly when we are not watching. The worst-case scenario with a GameDay exercise is that something will go wrong during the exercise. In that case, an entire team of engineers is ready to respond to the surprises, and the system will become stronger as a result.

The worst-case scenario in the absence of a GameDay exercise is that something in production will fail that was not anticipated or prepared for, and it will happen when the team is not expecting or watching closely for it.

How can you assure that injecting faults into a live production system doesn’t affect actual traffic, revenue, and the end-user experience? This can be done by treating the fault-tolerating and graceful degradation mechanisms as if they were features. This means bringing all of the other confidence-building techniques (unit and functional testing, staging hardware environments, among others) to these resilience measures until you are satisfied. Just as with every other feature of the application, it is not finished until you have deployed it to production and have verified that it is working correctly.

Case: Payments System

Earlier this year Etsy rolled out a new payment system (http://www.etsy.com/blog/news/2012/announcing-direct-checkout/) to provide more flexibility and reliability for buyers and sellers on the site. Obviously, resilience was of paramount importance to the success of the project. As with many Etsy features, the rollout to production was done in a gradual ramp-up. Sellers interested in allowing this new payment method could opt in, and Etsy would turn the functionality on for buckets of sellers at a time.

As you might imagine, the payment system is not particularly simple. It has fraud-detection components, audit trails, security mechanisms, processing-state machines, and other components that need to interact with each other. Thus, Etsy has a mission-critical system with a significant amount of complexity and whose expectations for being resilient are very high.

To confirm its ability to withstand failures gracefully, Etsy put together a list of reasonable scenarios to prepare for, develop against, and test in production, including the following:

One of the app servers dies (power cable yanked out);
All of the app servers leave the load-balancing pool;
One of the app servers gets wiped clean and needs to be fully rebuilt from scratch;
Database dies (power cable yanked out and/or process is killed ungracefully);
Database is fully corrupt and needs full restore from backup;
Offsite database replica is needed to investigate/restore/replay single transactions; and
Connectivity to third-party sites is cut off entirely.

The engineers then put together all of the expectations for how the system would behave if these scenarios occurred in production, and how they could confirm these expectations with logs, graphs, and alerts. Once armed with these scenarios, they worked on how to make these failures either:

Not matter at all (transparently recover and continue on with processing);
Matter only temporally (gracefully degrade with no data loss and provide constructive feedback to the user); or
Matter only to a minimal subset of users (including an audit log for reconstructing and recovering quickly and possibly automatically).

After these mechanisms were written and tested in development, the time came to test them in production. The Etsy team was cognizant of how much activity the system was seeing; the support and product groups were on hand to help with any necessary communication; and team members went through each of the scenarios, gathering answers to questions such as:

Were they successful in transparently recovering, through redundancy, replication, queuing?
How long did each process take—in the case of rebuilding a node automatically from scratch, recovering a database?
Could they confirm that no data was lost during the entire exercise?
Were there any surprises?

The team was able to confirm most of the expected behaviors, and the Etsy community (sellers and buyers) was able to continue with its experience on the site, unimpeded by failure.

There were some surprises along the way, however, which the Etsy team took as remediation items coming out of the meeting. First, during the payments process, a third-party fraud-detection service was contacted with information about the transaction. While Etsy uses a number of external APIs (fraud or device reputation), this particular service had no specified timeout on the external call. When testing the inability to contact the service, the Etsy team used firewall rules both to hard close the connection and to attempt to hang it open. Having no specified timeout meant they were relying on the default, which was much too long at 60 seconds. The intended behavior was to fail open, which meant the transaction could continue if the external service was down. This worked, but only after the 60-second timeout, which caused live payments to take longer than necessary during the exercise.

Forcing failures to happen, or even designing systems to fail on their own, generally is not easily sold to management.

This was both a surprise and a relatively easy piece to fix, but it was nonetheless an oversight that affected production during the test.

Recovering from database corruption also took longer than expected. The GameDay exercise was performed on one side of a master-master pair of databases, and while the recovery happened on the corrupted server, the remaining server in the pair took all reads and writes for production. While no production data loss occurred, exposure with reduced capacity occurred for longer than expected, so the Etsy team began to profile and then try to reduce this time of recovery.

The cultural effect of the exercise was palpable. It greatly decreased anxiety surrounding the ramp-up of the payments system; it exposed a few darker-than-desired corners of the code and infrastructure to improve; and it brought an overall increase in confidence in the system. Complacency is not an immediate threat to the system as a result.

Limitations

The goal of fault injection and GameDay exercises is to increase confidence in an otherwise complicated or complex system’s ability to stay resilient, but they have limitations.

First, the exercises are not meant to inform how engineering teams handle working under time pressure with escalating and sometimes disorienting scenarios. That needs to come from the postmortems of actual incidents, not from handling faults that have been planned and designed for.

The faults and failure modes are contrived. They reflect the fault designer’s imagination and therefore cannot be viewed to be comprehensive enough to gain perfect coverage of the system’s safety. While any increase in the confidence of the system’s resilient abilities is positive, it is still just that: an increase, not a completion of perfect confidence. Any complex system can (and will) fail in surprising ways, no matter how many different types of faults you inject and recover from.

Some have suggested that continually introducing failures automatically is a more efficient way to gain confidence in the adaptability of the system than manually running GameDay exercises as an engineering-team event. Both approaches have the same limitation mentioned here, in that they result in an increase in confidence but cannot be used to achieve sufficient safety coverage.

Automated fault injection can carry with it a paradox. If the faults that are injected (even at random) are handled in a transparent and graceful way, then they can go unnoticed. You would think this was the goal: for failures not to matter whatsoever when they occur. This masking of failures, however, can result in the very complacency they intend (at least should intend) to decrease. In other words, when you have randomly generated and/or continual fault injection and recovery happening successfully, care must be taken to raise the detailed awareness that this is happening—when, how, where. Otherwise, the failures themselves become another component that increases complexity in the system while still having limitations to their functionality (because they are still contrived and therefore sufficient).

Fear

A lot of what I am proposing should simply be an extension of the confidence-building tools that organizations already have. Automated quality assurance, fault tolerance, redundancy, and A/B testing are all in the same category of GameDay scenarios, although likely with less drama.

Shying away from the effects of failure in a misguided attempt to reduce risk will result in poor designs, stale recovery skills, and a false sense of safety.

Should everything have an associated GameDay exercise? Maybe, or maybe not, depending on the level of confidence you have in the components, interactions, and levels of complexity found in your application and infrastructure. Even if your business does not think that GameDay exercises are warranted, however, they ought to have a place in your engineering toolkit.

Safety Vaccines

Why would you introduce faults into an otherwise well-behaved production system? Why would that be useful?

First, these failure-inducing exercises can serve as “vaccines” to improve the safety of a system—a small amount of failure injected to help the system learn to recover. It also keeps a concern of failure alive in the culture of engineering teams, and it keeps complacency at bay.

It gathers groups of people who might not normally get together to share in experiencing failures and to build fault tolerance. It can also help bring the concept of operability in production closer to developers who might not be used to it.

At a high level, production fault injection should be considered one of many approaches used to gain confidence in the safety and resilience of a system. Similar to unit testing, functional testing, and code review, this approach is limited as to which surprising events it can prevent, but it also has benefits, many of which are cultural. We certainly cannot imagine working without it.