Failure is part of engineering any large-scale system. One of Facebook's cultural values is embracing failure. This can be seen in the posters hung around the walls of our Menlo Park headquarters: "What Would You Do If You Weren't Afraid?" and "Fortune Favors the Bold."
To keep Facebook reliable in the face of rapid change we study common patterns in failures and build abstractions to address them. These abstractions ensure best practices are applied across our entire infrastructure. To guide our work in building reliability abstractions, we must understand our failures. We do this by building tools to diagnose issues and by creating a culture of reviewing incidents in a way that pushes us to make improvements that prevent future failures.
While every failure has a unique story, many failures boil down to a small number of fundamental root causes.
Individual machine failures. Often an individual machine will run into an isolated failure that does not affect the rest of the infrastructure. For example, maybe a machine's hard drive has failed, or a service on a particular machine has experienced a bug in code, such as memory corruption or a deadlock.
The key to avoiding individual machine failure is automation. Automation works best by combining known failure patterns (such as a hard drive with S.M.A.R.T. errors) with a search for symptoms of an unknown problem (for example, by swapping out servers with unusually slow response times). When automation finds symptoms of an unknown problem, manual investigation can help develop better tools to detect and fix future problems.
Legitimate workload changes. Sometimes Facebook users change their behavior in a way that poses challenges for our infrastructure. During major world events, for example, unique types of workloads may stress our infrastructure in unusual ways. When Barack Obama won the 2008 U.S. Presidential election, his Facebook page experienced record levels of activity. Climactic plays in major sporting events such as the Super Bowl or World Cup result in an extremely high number of posts. Load testing, including "dark launches" where a feature is activated but not visible to the user, helps ensure new features are able to handle load.
Statistics gathered during such events often provide a unique perspective on a system's design. Oftentimes, major events cause changes in user behavior (for example, by creating focused activity around a particular object). Data about these changes often points to design decisions that will allow smoother operation in subsequent events.
Human error. Given that Facebook encourages engineers to "Move Fast and Break Things"—another one of the posters that adorns the offices— one might expect many errors are caused by humans. Our data suggests that human error is a factor in our failures. Figure 1 includes data from an analysis of the timing of events severe enough to be considered a service-level agreement (SLA) violation. Each violation indicates an instance where our internal reliability goals were not met and caused an alert to be generated. Because our goals are strict most of these incidents are minor and not noticeable to users of the site. Figure 1a shows how incidents happened substantially less on Saturday and Sunday even though traffic to the site remains consistent throughout the week. Figure 1b shows a six-month period during which there were only two weeks with no incidents: the week of Christmas and the week when employees are expected to write peer reviews for each other.
These two data points seem to suggest that when Facebook employees are not actively making changes to infrastructure because they are busy with other things (weekends, holidays, or even performance reviews), the site experiences higher levels of reliability. We believe this is not a result of carelessness on the part of people making changes but rather evidence our infrastructure is largely self-healing in the face of non-human causes of errors such as machine failure.
While failures have different root causes, we have found three common pathologies that amplify failures and cause them to become widespread. For each pathology, we have developed preventative measures that mitigate widespread failure.
Rapidly deployed configuration changes. Configuration systems tend to be designed to replicate changes quickly on a global scale. Rapid configuration change is a powerful tool that can let engineers quickly manage the launch of new products or adjust settings. However, rapid configuration change means rapid failure when bad configurations are deployed. We use a number of practices to prevent configuration changes from causing failure.
For reliability purposes, however, A/B tests do not satisfy all of our needs. A change deployed to a small number of users, but causing implicated servers to crash or run out of memory, will obviously create impact that goes beyond the limited users in the test. A/B tests are also time consuming. Engineers often wish to push out minor changes without the use of an A/B test. For this reason, Facebook infrastructure automatically tests out new configurations on a small set of servers. For example, if we wish to deploy a new A/B test to 1% of users, we will first deploy the test to 1% of the users that hit a small number of servers. We monitor these servers for a short amount of time to ensure they do not crash or have other highly visible problems. This mechanism provides a basic "sanity check" on all changes to ensure they do not cause widespread failure.
Hard dependences on core services. Developers tend to assume that core services—such as configuration management, service discovery, or storage systems—never fail. Even brief failures in these core services, however, can turn into large-scale incidents.
Increased latency and resource exhaustion. Some failures result in services having increased latency to clients. This increase in latency could be small (for example, think of a human configuration error that results in increased CPU usage that is still within the service's capacity), or it could be nearly infinite (a service where the threads serving responses have deadlocked). While small amounts of additional latency can be easily handled by Facebook's infrastructure, large amounts of latency lead to cascading failures. Almost all services have a limit to the number of outstanding requests. This limit could be due to a limited number of threads in a thread-per-request service, or it could be due to limited memory in an event-based service. If a service experiences large amounts of extra latency, then the services that call it will exhaust their resources. This failure can be propagated through many layers of services, causing widespread failure.
To guide our work in building reliability abstractions, we must understand our failures. We do this by building tools to diagnose issues and by creating a culture of reviewing incidents in a way that pushes us to make improvements that prevent future failures.
Resource exhaustion is a particularly damaging mode of failure because it allows the failure of a service used by a subset of requests to cause the failure of all requests. For example, imagine a service calls a new experimental service that is only launched to 1% of users. Normally requests to this experimental service take 1 millisecond, but due to a failure in the new service the requests take 1 second. Requests for the 1% of users using this new service might consume so many threads that requests for the other 99% of users are unable to run.
We have found a number of techniques that can avoid this type of buildup with a low false positive rate.
In this algorithm, if the queue has not been empty for the last
N milliseconds, then the amount of time spent in the queue is limited to
M milliseconds. If the service has been able to empty the queue within the last
N milliseconds, then the time spent in the queue is limited to
N milliseconds. This algorithm prevents a standing queue (because the
lastEmptyTime will be in the distant past, causing an
M-ms queuing timeout) while allowing short bursts of queuing for reliability purposes. While it might seem counterintuitive to have requests with such short timeouts, this process allows requests to be quickly discarded rather than build up when the system is not able to keep up with the rate of incoming requests. A short timeout ensures the server always accepts just a little bit more work than it can actually handle so it never goes idle.
An attractive property of this algorithm is the values of
N tend not to need tuning. Other methods of solving the problem of standing queues, such as setting a limit on the number of items in the queue or setting a timeout for the queue, have required tuning on a per-service basis. We have found a value of five milliseconds for
M and 100ms for
N tends to work well across a wide set of use cases. Facebook's open source Wangle library5 provides an implementation of this algorithm, which is used by our Thrift4 framework.
Despite the best preventative measures, some failures will always occur. During outages the right tools can quickly lead to the root cause, minimizing the duration of the failure.
High-density dashboards with Cubism. When handling an incident, it is important to have quick access to information. Good dashboards allow engineers to quickly assess the types of metrics that might be abnormal and then use this information to hypothesize a root cause. We found, however, that our dashboards grew so large it was difficult to navigate them quickly, and that charts shown on those dashboards had too many lines to read at a glance.
To address this, we built our top-level dashboards using Cubism,2 a framework for creating horizon charts—line charts that use color to encode information more densely, allowing for easy comparison of multiple similar data series. For example, we use Cubism to compare metrics between different data centers. Our tooling around Cubism allows for easy keyboard navigation so engineers can view multiple metrics quickly. Figure 3 shows the same dataset at various heights using area charts and horizon charts. In the area chart version, the 30-pixel version is difficult to read. On the other hand, the horizon chart makes it extremely easy to find the peak value, even at a height of 30 pixels.
What just changed? Since one of the top causes of failure is human error, one of the most effective ways of debugging failures is to look for what humans have changed recently. We collect information about recent changes ranging from configuration changes to deployments of new software in a tool called OpsStream. However, we have found over time this data source has become extremely noisy. With thousands of engineers making changes, there are often too many to evaluate during an incident.
To solve this problem, our tools attempt to correlate failures with relevant changes. For example, when an exception is thrown, in addition to outputting the stack trace, we output any configuration settings read by that request that have had their values changed recently. Often, the cause of an issue that generates many stack traces is one of these configuration values. We can then quickly respond to the issue—for example, by reverting the configuration and involving the engineer who made the change.
After failures happen, our incident-review process helps us learn from these incidents.
The goal of the incident-review process is not to assign blame. Nobody has been fired because an incident he or she caused came under review. The goal of the review is to understand what happened, remediate situations that allowed the incident to happen, and put safety mechanisms in place to reduce the impact of future incidents.
A methodology for reviewing incidents. Facebook has developed a methodology called DERP (for detection, escalation, remediation, and prevention) to aid in productive incident reviews.
DERP helps analyze every step of the incident at hand. With the aid of this analysis, even if you cannot prevent this type of incident from happening again, you will at least be able to recover faster the next time.
A "move-fast" mentality does not have to be at odds with reliability. To make these philosophies compatible, Facebook's infrastructure provides safety valves: our configuration system protects against rapid deployment of bad configurations; our core services provide clients with hardened APIs to protect against failure; and our core libraries prevent resource exhaustion in the face of latency. To deal with the inevitable issues that slip through the cracks, we build easy-to-use dashboards and tools to help find recent changes that might cause the issues under investigation. Most importantly, after an incident we use lessons learned to make our infrastructure more reliable.
Automating Software Failure Reporting
Improving Performance on the Internet
1. CoDel (controlled delay) algorithm; http://queue.acm.org/detail.cfm?id=2209336.
2. Cubism; https://square.github.io/cubism/.
3. HipHop Virtual Machine (HHVM); http://bit.ly/1Qw68bz
4. Thrift framework; https://github.com/facebook/fbthrift.
5. Wangle library; https://github.com/facebook/wangle/blob/master/wangle/concurrent/Codel.cpp.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2015 ACM, Inc.
No entries found