Research and Advances
Computing Applications Research highlights

Technical Perspective: Getting Consensus For Data Replication

Posted
  1. Article
  2. References
  3. Author
  4. Footnotes
Read the related Research Paper

Data in a cloud computing system should be highly available. That is, whenever you connect to the system, the data you stored there should be ready to use. The standard mechanism to accomplish this—data replication—involves maintaining multiple copies of each user’s data.

By itself, replicating data does not solve the problem, because the copy of data that is available might be stale. To see why, consider a system that maintains three copies of file x: x1, x2, and x3. Suppose x1 and x2 are currently available and x3 is down. If a program updates x, the system writes the update to x1 and x2, but not to x3 because it is down. No problem, since the system still has two correct copies. Next, suppose x1 and x2 fail, and then x3 recovers. Now there is a problem. If a user reads x, the best the system can do is return the value of x3. But that copy is stale—it does not have the latest update.

An early solution to this problem is called majority consensus.2 To process a write operation on x, the system assigns a monotonically increasing timestamp to the write, and writes x‘s value and timestamp to a majority of the copies of x. To process a read operation on x, it reads a majority of the copies of x and returns one that has the largest timestamp. Since the intersection of any two majorities includes at least one copy, each read returns the result of the previous writes on the same data.

In our example, the read operation only reads one copy, which is not a majority of three. Therefore, it might not return the latest state of x. Using majority consensus, the system would have to wait until a second copy became available, so it could read two copies. Then, it could return the more up-to-date copy of the two that it reads, which must be the freshest.

The notion of majority was extended in Gifford1 to be a weighted majority, which is called a quorum. Each copy is given a weight. A read must access a read quorum of copies—a set of copies whose weight is at least R. And a write must update a write quorum of copies—a set whose weight is at least W. By requiring that R+W exceeds the total weight N of all copies, we are assured that each read reads a copy that was written by the last write.


Given R, W, and N, the authors of the following paper offer a formula to calculate the probability of reading data that was not written by one of the K most recent writes.


For example, suppose the system stores four copies of x, namely, x1x4. We might assign a weight of 1 to x1 and x2, but assign a weight of 2 to x3 and x4 if they are stored on more reliable servers. So N=6. If reads are more frequent than writes, we might set R=3 and W=4, so that reads have less work to do than writes. Since R+W>N, each read operation reads the result of the previous write on the same data.

It stands to reason that each write has to update all available copies of the data. But it is annoying that readers have to read multiple copies too, since it adds overhead and delay. It is especially annoying if writes usually execute quickly, since in this case each read is very likely to read the latest copy even if it does not read a quorum. If the probability of reading stale data is sufficiently small, then it might be satisfactory to allow readers to read less than a quorum of copies. In fact, some cloud applications do exactly that.

Until recently, this compromise was done by guesswork, without a bound on the staleness of the data that is read or the expected latency improvement from risking some staleness. The following paper is a breakthrough that removes the guesswork and replaces it by a principled analysis, called probabilistically bounded staleness. Given R, W, and N, the authors offer a formula to calculate the probability of reading data that was not written by one of the K most recent writes. To calculate the probability of reading data that was not written in the last delta time units, they use a Monte Carlo simulation based on time parameters that can be measured in a running system. Moreover, they made the analysis practical by implementing it in open source record managers, so that users can make their own judgment on trading off staleness for latency. What was formerly a guess is now a goal you can engineer for.

Back to Top

Back to Top

Back to Top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More