Home/Magazine Archive/August 2014 (Vol. 57, No. 8)/Technical Perspective: Getting Consensus For Data.../Full Text

Research highlights
## Technical Perspective: Getting Consensus For Data Replication

Data in a cloud computing system should be highly available. That is, whenever you connect to the system, the data you stored there should be ready to use. The standard mechanism to accomplish this*data replication*involves maintaining multiple copies of each user's data.

By itself, replicating data does not solve the problem, because the copy of data that is available might be stale. To see why, consider a system that maintains three copies of file *x*: *x*_{1}, *x*_{2}, and *x*_{3}. Suppose *x*_{1} and *x*_{2} are currently available and *x*_{3} is down. If a program updates *x*, the system writes the update to *x*_{1} and *x*_{2}, but not to *x*_{3} because it is down. No problem, since the system still has two correct copies. Next, suppose *x*_{1} and *x*_{2} fail, and then *x*_{3} recovers. Now there is a problem. If a user reads *x*, the best the system can do is return the value of *x*_{3}. But that copy is staleit does not have the latest update.

An early solution to this problem is called *majority consensus*.^{2} To process a write operation on *x*, the system assigns a monotonically increasing timestamp to the write, and writes *x*'s value and timestamp to a majority of the copies of *x*. To process a read operation on *x*, it reads a majority of the copies of *x* and returns one that has the largest timestamp. Since the intersection of any two majorities includes at least one copy, each read returns the result of the previous writes on the same data.

In our example, the read operation only reads one copy, which is not a majority of three. Therefore, it might not return the latest state of *x*. Using majority consensus, the system would have to wait until a second copy became available, so it could read two copies. Then, it could return the more up-to-date copy of the two that it reads, which must be the freshest.

The notion of majority was extended in Gifford^{1} to be a weighted majority, which is called a *quorum*. Each copy is given a weight. A read must access a *read quorum* of copiesa set of copies whose weight is at least R. And a write must update a *write quorum* of copiesa set whose weight is at least W. By requiring that R+W exceeds the total weight N of all copies, we are assured that each read reads a copy that was written by the last write.

Given R, W, and N, the authors of the following paper offer a formula to calculate the probability of reading data that was not written by one of the K most recent writes.

For example, suppose the system stores four copies of *x*, namely, *x*_{1}-*x*_{4}. We might assign a weight of 1 to *x*_{1} and *x*_{2}, but assign a weight of 2 to *x*_{3} and *x*_{4} if they are stored on more reliable servers. So N=6. If reads are more frequent than writes, we might set R=3 and W=4, so that reads have less work to do than writes. Since R+W>N, each read operation reads the result of the previous write on the same data.

It stands to reason that each write has to update all available copies of the data. But it is annoying that readers have to read multiple copies too, since it adds overhead and delay. It is especially annoying if writes usually execute quickly, since in this case each read is very likely to read the latest copy even if it does not read a quorum. If the probability of reading stale data is sufficiently small, then it might be satisfactory to allow readers to read less than a quorum of copies. In fact, some cloud applications do exactly that.

Until recently, this compromise was done by guesswork, without a bound on the staleness of the data that is read or the expected latency improvement from risking some staleness. The following paper is a breakthrough that removes the guesswork and replaces it by a principled analysis, called probabilistically bounded staleness. Given R, W, and N, the authors offer a formula to calculate the probability of reading data that was not written by one of the K most recent writes. To calculate the probability of reading data that was not written in the last delta time units, they use a Monte Carlo simulation based on time parameters that can be measured in a running system. Moreover, they made the analysis practical by implementing it in open source record managers, so that users can make their own judgment on trading off staleness for latency. What was formerly a guess is now a goal you can engineer for.

1. Gifford, D.K. Weighted voting for replicated data. *SOSP* (1979), 150162.

2. Thomas, R.H. A majority consensus approach to concurrency control for multiple-copy databases. *ACM Trans. Database Syst. 4*, 2 (1979), 180209.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.

No entries found