From the EDVAC to WEBVACs

By now everyone has heard of cloud computing and realized it is changing how traditional enterprise IT and emerging startups are building solutions for the future. Is this trend toward the cloud just a shift in the complicated economics of the hardware and software industry, or is it a fundamentally different way of thinking about computing? Having worked in the industry, I can confidently say it is both.

Most articles on cloud computing focus too much on the economic aspects of the shift and miss the fundamental changes in thinking. This article attempts to fill the gap and help a wider audience better appreciate some of the more fundamental issues related to cloud computing. Much of what is written here should not be earth shattering to those working with these systems day to day, but the article may encourage even expert practitioners to look at their day-to-day issues in a more nuanced way.

Here are the key points to be covered:

A cloud computer is composed of so many components that systematically dealing with the failure of those components is an essential factor to consider when thinking about the software that runs on a cloud computer or interacts with one.
A common architectural pattern used to deal with transient failures is to divide the system into a pure computational layer and a separate layer than maintains critical system state. This provides reliability, scalability, and simplicity.
A well-established existing best practice is to have systems expose idempotent interfaces so simple retry logic can be used to mask most transient failures.
Simple analytic techniques can allow quantitative statements about various retry policies and compare how they impact reliability, worst-case latency, and average-case latency under an idealized failure model.

The first point about dealing with failure may seem new to many who are now hosting even small applications on large multitenant cloud computers in order to benefit from economies of scale. This is actually a very old issue, however, so the discussion should begin not by talking about the latest trends but by going back to the early years of the electronic computer.

Lessons from EDVAC and Von Neumann

In 1945 John von Neumann described the computational model of the first fully electronic stored program computer. This was a side effect of him acting as a consultant with ENIAC inventors John Mauchly and J. Presper Eckert. Although he was not the originator of many of the key ideas, his name is associated with the design approach, and the von Neumann architecture was soon a standard design used for building electronic computers. The original EDVAC (electronic discrete variable automatic computer) draft report⁶ contains these interesting passages from von Neumann:

1.4 The remarks of 1.2 on the desired automatic functioning of the device must, of course, assume it functions faultlessly. Malfunctioning of any device has, however, always a finite probability—and for a complicated device and a long sequence of operations it may not be possible to keep this probability negligible. Any error may vitiate the entire output of the device. For the recognition and correction of such malfunctions intelligent human intervention will in general be necessary.

However, it may be possible to avoid even these phenomena to some extent. The device may recognize the most frequent malfunctions automatically, indicate their presence and location by externally visible signs, and then stop. Under certain conditions it might even carry out the necessary correction automatically and continue …

3.3 In the course of this discussion the viewpoints of 1.4, concerned with the detection, location, and under certain conditions even correction, of malfunctions must also receive some consideration. That is, attention must be given to facilities for checking errors. We will not be able to do anything like full justice to this important subject, but we will try to consider it at least cursorily whenever this seems essential.⁵

The practical problems that concerned von Neumann and the designers of the EDVAC in 1945 were the reliability of vacuum tubes and the main memory stored with mercury delay lines. (A modern hard drive is an amazing electromechanical device as well, which finally is starting to be replaced with solid-state memory.) The invention of the transistor, integrated circuit, and error-correcting codes makes von Neumann’s concerns seem quaint today. Single-bit errors and even multi-bit errors in computer systems, while still possible, are sufficiently rare that these problems are considered inessential. The ability to consider failure of system components, however, can no longer be ignored with the advent of cloud computers that fill acres of space with commodity servers.

Murphy’s Law Triumphs over Moore’s Law

A cloud computer is composed of so many components, each with a finite probability of failure, that the probability of all the components running without error at any point in time is close to zero. Failure and automated recovery are hence essential areas of concern not only at the hardware layer but also with the software components. In short, we are at the point where Murphy’s Law has conquered Moore’s Law. Assuming all the components make a system work flawlessly is a luxury that is no longer possible. Fortunately, techniques for handling bit-level corruptions can be adjusted and scaled to cloud computers, so most bit-level errors can be detected, if not fixed. The types of failures we are worried about, however, are those of a server or whole groups of servers. We are also at a point where the rates of certain failures are so high that von Neumann’s suggestion that the system simply detect the error and wait for a human operator to intervene is no longer economically sensible.

You might ask if we can work harder to build more reliable systems, but when the reliability of your main power supply is inversely proportional to the density of wire-eating squirrels in your region or the probability that a worker will drop an uninsulated wrench into a power-distribution cabinet, it is difficult to imagine a cost-effective and systematic approach to address the reliability of data centers that commonly house as many as 100,000 servers.

From EDVAC to WEBVAC

One very popular approach to dealing with frequent server-level failures in a data center is to decompose the system into one or more tiers of servers that process requests on a best-effort basis and store any critical application state in a dedicated storage tier. Typically, there is a request load balancer in front of each tier so that the individual servers in the tier can fail and have requests rerouted automatically. The key aspect of this design is a complete separation between long-term system state and computation. This is the same separation that exists between the processor and memory in the EDVAC design. For lack of a better term let’s call these systems WEBVACs, for Worldwide Elastic Big Very Automated Clusters.

These WEBVACs are not conceptually different from the farms of Web servers seen today in a traditional data centers. WEBVACs use a proven architecture that provides resiliency, scalability, and a very familiar programming model based on stateless HTTP requests. The chief innovation is the degree and ease of configurability, as well as elasticity and scale. One important feature of EDVAC that distinguished it from earlier computers such as ENIAC was that EDVAC executed a program that was stored in its memory as data, while ENIAC was programmed by physically rewiring it for different problems. Figure 1 shows the conceptual similarity to both WEBVACs and the EDVAC.

Like EDVAC, modern cloud computers allow for the automatic configuration of a complete server farm with a few simple artifacts. This eliminates the need for tedious and error-prone manual configuration of servers, as is often done in more traditional systems.

Building a reliable storage tier that meets the needs of the compute tier is a challenging task. Requests in the storage tier need to be replicated across several servers using complex distributed consensus protocols. There is a wide range of approaches to building storage tiers, as well as a great diversity in their APIs and consistency models. (It is difficult to do justice to this topic in the limited space here, so see the additional reading suggestions at the end of this article.) In the end, however, the storage tier is just an abstraction that is callable from the compute tier. The compute tier can rely on the guarantees provided by the storage tier and therefore uses a much simpler programming model.

This simpler programming model, in which all the important state of the system is stored in a generic storage tier, also simplifies disaster-recovery scenarios since simple backup and restore of the storage tier is often sufficient to restore an entire system into a working state. Well-designed systems have asynchronous continuous backup of the storage tier to a replica in a physically different location. This location needs to be close enough that data can be efficiently and cost-effectively replicated, but distant enough that the probability of it encountering the same “act of God” is low. (Putting both your primary and backup data centers near the same earthquake fault line is a bad idea.)

Since the backup is asynchronous, failover to the replica may incur some data loss. That data loss, however, can be bounded to acceptable and well-defined limits that come into play only when an act of God may cause the complete destruction of the primary system. Carefully determining the physical location of your data centers is the first case where there is a need to treat failure in an end-to-end way. This same end-to-end focus on failure is also important in the design and implementation of software running on and interacting with the system.

How to Design Interfaces for WEBVACs

WEBVACs ultimately provide APIs that allow desktop computers, mobile devices, or other WEBVACs to submit requests and receive responses to those requests. In any case, you end up with two agents that must communicate with each other via some interface over an unreliable channel. Reusing traditional designs from client-server systems or standard RPC (remote procedure call) methods is not the best approach. Andrew Tanebaum and Robbert van Resse⁴ describe some common pitfalls when doing naïve refactoring of code not designed for distributed scenarios, which are generally applicable to the APIs here as well. One particular problem they call out is the 2AP (two-army problem), which demonstrates it is impossible to design a fully reliable method for two agents to reach consensus over an unreliable channel that may silently drop messages.

This is a restricted version of the more general problem of dealing with Byzantine failure, where the failure does not include data corruption. As a consequence, there is simply no way of building a system that can process any request with 100% reliability if the channel itself is unreliable. The 2AP result, however, does not rule out protocols that asymptotically approach 100% reliability. A simple solution is continually transmitting a request up to some finite bound until some acknowledgment is received. If the error rate of the channel is fixed and failures are independent, then the likelihood of success increases exponentially with the number of transmissions.

In a large data center, not only is the communication between servers unreliable, but also the servers themselves are prone to failure. If a server in the compute tier fails, then a request that targeted it can be quickly rerouted to an equivalent compute server. The process of rerouting the request is often not fully transparent and the request is lost during rerouting, because the routing logic cannot immediately detect the server failure or because the server was in the middle of processing a request when it failed. These lost requests appear to users as transient faults.

In the context of cloud computing, therefore, the observed request failure rate is really the combined error rate of the communication channel and the failure rate of the servers involved in the computation. Rather than reasoning about the individual failure rates of several components, you can make the simplifying assumption that a system of two unreliable agents communicating over an unreliable channel is equivalent to two idealized reliable agents communicating over an unreliable channel whose failure rate is increased appropriately to account for the failure of either of the original unreliable agents. An extended example illustrates this in more detail in the following section.

Enumerating a set of files over an unreliable channel. Figure 2 shows a simple interface definition in ANSI C that can be used to enumerate a set of file names. The interface has been exposed without careful consideration for failure, beyond the introduction of a new status code Fault, which indicates a failure likely caused by unreliable delivery. Assume that calling any one of these functions sends a request and waits synchronously for a response. The assumption is that the Fault status is returned if no response to a request is received after some fixed timeout.

Figure 3 illustrates a simple client-side function that attempts to enumerate all the files but returns immediately on the first Fault received by any call to the primitive functions in Figure 2.

You want to estimate the probability this function will return Ok under the assumption that calling any of the three functions mentioned earlier has a success rate of 0.99999 (S), which is to say that on average one out of a million invocations of the functions returns Fault. First you need to compute how many requests (M) are required to enumerate N files. Inspection of the code reveals that M is equal to

which can be simplified to

Since the function fails immediately on any fault, the probability of no faults is simply the probability that all the requests sent succeed, which is S^M assuming failures are uniformly distributed. For purposes of this analysis, let’s assume failure is independent and uniformly distributed. This simplifying assumption allows a comparison of the trade-offs of various approaches under equivalent ideal failure models. In practice, however, the distribution of failures is typically neither uniform nor completely independent. The results are summarized in Figure 4.

Depending on the workload characteristics, this success rate for the first attempt at listing files may be acceptable. In this example, the success rate of 0.99999 (a five-nines success rate results in fewer than 5.3 minutes of downtime a year for a continuously running system) is extremely high and typically can be achieved only with significant investment in expensive hardware infrastructure. A more realistic error rate would be 0.999 (a three-nines success rate results in fewer than 8.8 hours of downtime a year for a continuously running system), which is more typically seen with commodity components. A three-nines success rate produces the graph and table of values in Figure 5.

Clearly, a 3% failure rate for enumerating 10 files is not a usable system. You can improve the probability of success by simply retrying the whole function, but this is not only inefficient, but also, for large N, the success rate is so low that it would require an unreasonable number of retries. If the probability of the function ListFilesStopAtAnyFault enumerating N files successfully is

then the probability of failure is

The probability that after, at most, K retries the function succeeds is the same as the probability

that is the complement of the probability that all invocations fail. For this discussion, if the probability of success is at least 0.999, when N is 100, you must retry, on average, five times; when N is 1,000, the number of retries is at least 50 to get a three-nines success rate; for N = 10,000, the number is close to 3 billion. The smarter approach is to keep ListFilesStopAtAnyFault from immediately failing on any single fault.

This can be accomplished by creating simple wrapper functions that add some basic retry logic over the original primitives so there is a new set of more robust primitives, as shown in Figure 6.

Production code would likely include an exponential back-off that delays each retry with an exponentially increasing time delay. This avoids the so-called “thundering herd” problem when many clients are simultaneously trying to recover from a network partition to a given server. For simplicity, this discussion will ignore it. Assuming a success rate of 0.999 for the underlying primitives, performing three simple retries makes the probability of each of these returning without a fault

or 0.999999999 (nine nines). Figure 7 shows how you can now write a new routine that uses these more reliable wrappers.

Now you can evaluate the reliability of the function ListFilesWithRetry, but instead of computing this with respect to primitive requests, you compute it with respect to the number of times each request wrapper is called (see Table 1).

Now that each wrapper has a nine-nines success rate, the overall success rate for this function, even when N = 10,000, is more than 0.9999 (a four-nines success rate results in fewer than 53 minutes of downtime a year for a continuously running system). There is still a nonzero chance this function will return Fault, so this has not solved the 2AP but has significantly increased the likelihood that the system will make progress. The insertion of retries, of course, increases the overall latency when there are errors, and, with a reasonable model of latency, the expected time for enumerating N files assuming a specific request failure rate can be computed. The latency impact of these changes is discussed later.

Astute readers should notice a fatal flaw in the code. The function will continue to enumerate under the presence of request failures, but the naïve addition of retries will cause files to be skipped when there are failures. Specifically, this wrapper function may cause files to be skipped:

The fundamental issue here is that the underlying primitive request MoveToNextFileName is not idempotent—one invocation of it is not observationally equivalent to multiple invocations of the function. Because of the 2AP, there is no way to have the server or client agree if the cursor has moved forward or not on a fault. The only way to resolve this issue is to make MoveToNextFileName idempotent.

There are a variety of techniques to do this. One way is to include sequence numbers to detect retries and have the server track these numbers. These sequence numbers now become important state that must be placed in the storage tier for every client in the system, and this can result in scalability issues. A more scalable approach is to use an opaque state token similar to how cookies are used in HTTP to offload state from the server to the client. The client can maintain the needed state rather than have the server track. This leaves the API shown in Figure 8, which includes only idempotent functions.

In this example the state token is simply an integer in a realistic system; it would more likely be a variable-length byte array that the system verifies is valid so that malicious clients cannot harm the system.

The retry wrappers GetStartTokenWithRetry, GetStartTokenWithRetry, and MoveToNextFileNameWithTokenAndRetry can also be defined as earlier.

We can adjust our function to uses wrapped primitives over the idempotent API (see Figure 9).

The analysis performed earlier on the slightly buggy version that lacked the idempotent primitives is still valid, but now the function works correctly and reliably. When N = 10,000, the success rate is 0.999979998 (four nines).

Analyzing and Improving Latency

Because the only practical way to detect message loss between senders and receivers is via timeouts, it is important to document a time bound on how long either party should wait for a request to be processed. If a request is processed after that time bound has been exceeded, then consider the request failed. Services typically define an upper bound for each API function they support (for example, 99.9% of all requests will be successfully processed within one second). If no upper time bound is specified, a guaranteed 99.9% success rate is somewhat meaningless; you may have to wait an infinite amount of time for any single request to complete successfully.

Determining a reasonable upper bound for a system can be quite complex. It can be estimated observationally by looking at the distribution of latencies across a large sample of requests and choosing the maximum, or by simply having the server include a watchdog timer with a clear upper bound that fails any request that exceeds the threshold. In either case, the worst-case bound is needed so that clients of the service can set timeouts appropriately, but these worst-case bounds typically are very conservative estimates of the actual average-case performance. An extended version of this article published in ACM Queue analyzes worst-case, average-case, and slow-case latency for the file-listing function.

Let’s assume that every primitive request to the server has a worst-case successful latency of R_max and average time of R_avg, with these parameters we can analytically estimate how our retry policy impacts latency. There worst-case is based on a pathological worst-case scenario where the maximum number of failures and retries occur. This analysis is overly pessimistic, and we can instead use our average-case analysis to derive a slow-case estimate where the distribution of retries is the same as our average-case analysis, but we assume every request is as slow as possible.

Some readers may feel this final implementation is too “chatty” and a more efficient protocol could reduce server round-trips. Indeed, three functions in the API can be replaced with one (see Figure 10).

This protocol has a fixed globally known start token and a single function that returns both the current file name and next token in one request. There should be an expected improvement in latency, which can be seen by performing a latency analysis of the modified protocol depicted in Table 2.

The details of this latency analysis are in the full article, and use very elementary techniques of probability theory to analytically determine reasonable time out values for callers of the function to list files.

Conclusion

This article is a far from exhaustive survey about the many interesting issues surrounding cloud computing. The goal is to demonstrate the breadth of deep problems still to be solved. Some of the trailblazers who developed the electronic computer would be dumbfounded by the computation we now carry in our pockets. They would be equally surprised at how robustly some of their earliest ideas have stood the test of time. Taken in historical context, the modern WEBVAC should not be seen as the culmination of 70 years of human progress, but just the start of a promising future that we cannot imagine.