Computing Applications Practice

Eventually Consistent: Not What You Were Expecting?

Methods of quantifying consistency (or lack thereof) in eventually consistent storage systems.

By Wojciech Golab, Muntasir R. Rahman, Alvin AuYoung, Kimberly Keeton, and Xiaozhou (Steve) Li

Posted Mar 1 2014

Introduction
Defining Eventual Consistency
Relaxed Consistency Properties
Prediction
Empirical Measurement
Comparison and Discussion
Future Work
Conclusion
Acknowledgments
References
Authors
Figures

Storage systems continue to lay the foundation for modern Internet services such as Web search, e-commerce, and social networking. Pressures caused by rapidly growing user bases and datasets have driven system designs away from conventional centralized databases and toward more scalable distributed solutions, including simple NoSQL key-value storage systems, as well as more elaborate NewSQL databases that support transactions at scale.

Distributed key-value storage systems are among the simplest and most scalable specimens in the modern storage ecosystem. Such systems forgo many of the luxuries of conventional databases, including ACID (atomicity, consistency, isolation, durability) transactions, joins, and referential integrity constraints, but retain fundamental abstractions such as tables and indexes. As a result, application developers are sheltered from technicalities such as normalizing the relational schema, selecting the optimal transaction isolation level, and dealing with deadlocks.

Despite their simple interface and data model, distributed key-value storage systems are complex internally as they must replicate data on two or more servers to achieve higher read performance and greater availability in the face of node or network failures. Keeping these replicas synchronized requires a distributed replication protocol that can add substantial latency to storage operations, especially in geo-replicated systems. Specifically, write operations must update a subset of the replicas before the system acknowledges the completion of the write to the client, and reads may fetch data from a subset of the replicas, adopting the latest value observed (for example, one having the highest timestamp) as the response.

As it turns out, choosing which subset of replicas to contact for a storage operation profoundly impacts the behavior of distributed storage systems. For strong consistency a client may read and write a majority of replicas. Since majorities overlap, this ensures each read “sees” the latest write. In contrast, eventually consistent systems may read and write non-overlapping subsets.^26,28 Once a write is acknowledged, the new value is propagated to the remaining replicas; thus, all replicas are eventually updated unless a failure occurs. In the meantime, readers may observe stale values if they fetch data from replicas that have not yet received the update.

Although many applications benefit from strong consistency, latency-sensitive applications such as shopping carts in e-commerce websites choose eventual consistency to gain lower latency.¹ This can lead to consistency anomalies such as items lost from a shopping cart or oversold. However, since a user can detect and correct the problem during checkout, such anomalies are tolerable provided they are short-lived and occur infrequently. The critical task for application developers and system administrators, therefore, is to understand how consistency and latency are impacted by various storage and application configuration parameters, as well as the workload that the application places on the storage system. Only with proper insight into this subtle issue can administrators and developers make sensible decisions regarding the configuration of the storage system or the choice of consistency model for a given application.

This article looks at methods of quantifying consistency (or lack thereof) in eventually consistent storage systems. These methods are necessary for meaningful comparisons among different system configurations and workloads. First, the article defines eventual consistency more precisely and relates it to other notions of weak consistency. It then drills down into metrics, focusing on the dimension of staleness, and surveys different techniques for predicting and measuring staleness. Finally, the relative merits of these techniques are evaluated, and any remaining open questions are identified.

Defining Eventual Consistency

Eventual consistency can be defined as either a property of the underlying storage system, or a behavior observed by a client application. For example, Doug Terry et al. give the following definition of eventual consistency in the context of the Bayou storage system: “All replicas eventually receive all writes (assuming sufficient network connectivity and reasonable reconciliation schedules), and any two replicas that have received the same set of writes have identical databases.”²⁶ This informal definition addresses both the propagation of writes to replicas (that is, the eventual) and convergence onto a well-defined state—for example, through timestamp-based reconciliation—(that is, the consistency).

This definition, while simple and elegant, does not say precisely what clients observe. Instead, it captures a range of behaviors encompassing both strongly consistent relational databases, and weakly consistent systems that may return stale or out-of-order data without bound. In contrast, Werner Vogels describes the consistency observed by clients in a concrete way: “The storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value.”²⁸ Vogels’s definition captures in an intuitive way the convergence of replicas to the last updated value, but only in the special case where updates are suspended while clients read an object—a scenario very different from the online environment in which many eventually consistent systems are deployed.

Eventual consistency can be defined as either a property of the underlying storage system, or a behavior observed by a client application.

Since weak consistency is difficult to reason about, application developers often seek stronger properties such as a consistent prefix, monotonic reads, “read my writes,” or causal consistency.²⁵ Indeed, some systems support such stronger-than-eventual properties (for example, COPS [Clusters of Order-Preserving Servers²²] and Pileus²⁷), but the commercial success of systems such as Amazon’s Dynamo¹² proves that eventual consistency can be good enough in practice. To understand why, researchers have sought to describe the range of behaviors of eventually consistent systems in more precise terms than abstract definitions in the style of Terry et al. and Vogels. Existing approaches in this endeavor fall into three categories: relaxed consistency metrics, system modeling and prediction, and empirical measurement tools. The next three sections survey representative works in each category.

Relaxed Consistency Properties

Abstract definitions of eventual consistency leave open a number of questions regarding system behavior in the presence of concurrent accesses, as well as in a failure-prone environment. For example: How quickly do updates propagate to replicas in practice, and how often do replicas agree on the latest value of an object? When replicas do not agree, how does that affect the clients’ view of the data? What exactly do clients observe during an extended failure that partitions the network, or in the moments shortly after the network is healed?

An emerging approach for describing these behaviors is to relate them to a strict form of consistency called linearizability.¹⁸ Informally speaking, this property states that the storage system behaves as though it executes operations one at a time, in some serial order, despite actually executing some operations in parallel. As will be explained later, this serial order is further constrained in a manner that forbids stale reads. As a result, linearizability is a suitable gold standard against which eventually consistent systems can be judged. In other words, the observed behavior can be described in terms of deviations from this standard, which in this context can be regarded as consistency violations. For example, one can think of Amazon’s Dynamo as violating linearizability when reads return stale values, even though the system promises only eventual consistency.

Whereas linearizability is ubiquitous in centralized systems—for example, in multithreaded data structures—Brewer’s CAP principle⁹ states that such a strong consistency property (C) is unattainable in always-available (A) and partition-tolerant (P) distributed systems. (Database experts should read consistency in this context as transaction isolation.) As a result, one cannot expect any storage system to accept updates and yet remain consistent during a network partition. Although this does not preclude strong consistency during failure-free operation, even in that case the system may be configured to sacrifice consistency for better latency.¹ For example, in the Cassandra key-value store, clients can effect this trade-off by requesting various “consistency levels,” which control the number of replicas that respond to a read or write.²⁰

The side effect of weakening consistency is increased staleness, which so far has been discussed informally. More precisely, a value is considered fresh from the moment it is written into a storage system until it is overwritten, and stale thereafter. Thus, staleness describes the age of a value returned by a read relative to the last updated value, and hence quantifies how badly a system’s behavior deviates from the gold standard. Two interpretations of this concept have been discussed in the literature,^2,15 arising from different ways to define “age”:

Version-based staleness defines age by counting versions of an object (for example, a read returns the kth-latest version).
Time-based staleness defines age in terms of wall-clock time (for example, a read returns a value t time units older than the last updated value).

The concept of staleness, although intuitive, is fraught with technical subtleties. In the simple case where operations are executed one at a time, there is a natural order in which clients expect these operations to take effect, and so stale reads can be identified easily. When operations are executed in parallel, however, their order can be very difficult to determine from the client’s perspective.

To make sense of eventual consistency, we turn to relaxed consistency properties—k-atomicity and Δ-atomicity—which give precise meaning to the notions of version-based and time-based staleness, respectively. Both properties are relaxed forms of linearizability,¹⁸ but for historical reasons they refer in name to Lamport’s atomicity property,²¹ which is similar in spirit.

Linearizability (The “Gold Standard”). Consider the trace of operations in Figure 1a, showing three operations applied to an object denoted by X. First, a write operation W(X,1) assigns 1 to object X, and then this value is updated to 2 by a second write operation, W(X,2). The third operation R(X) is a read of X and begins after both writes have ended. In this case, linearizability dictates that W(X,1) should appear to take effect before W(X,2), because W(X,1) ends before W(X,2) begins. Thus, 2 is the last updated value of X when R(X) is applied, but R(X) returns 1 instead, so the trace is not linearizable. The more complex case when operations overlap in time is addressed by the formal definition of linearizability,¹⁸ a discussion that is beyond the scope of this article.

K-Atomicity (Version-Based Staleness). The k-atomicity property was introduced by Amitanand Aiyer et al.² Like linearizability, it requires that operations appear to take effect in an order that is constrained by their start and finish times; within this order, however, a read may return any of the k last updated values. For example, the trace shown in Figure 1a is k-atomic for k = 2 but not k = 1.

Δ-Atomicity (Time-based Staleness). The Δ-atomicity property was proposed by Wojciech Golab et al.¹⁵ Similar to k-atomicity, it relaxes linearizability by allowing reads to return stale values. Staleness, however, is defined in terms of time: a read may return a value that is up to Δ time units stale. For example, the trace in Figure 1a is Δ-atomic if Δ is defined as the width of the gap between W(X,2) and R(X). Linearizability permits R(X) to take effect before W(X,2) if the two operations overlap in time, as shown in Figure 1b, whereas it requires R(X) to take effect after W(X,2) in Figure 1a. If R(X) is hypothetically “stretched” to the left by Δ time units (that is, if R(X) had started Δ time units earlier), as in Figure 1b, then the trace becomes linearizable. Thus, the response of R(X) is considered only Δ time units stale in Figure 1a.

Prediction

Another method used to characterize eventual consistency is based on a combination of system modeling and prediction. Peter Bailis et al.⁶ present the PBS (Probabilistically Bounded Staleness) framework, in which a white-box system model is proposed to predict data staleness observed by client applications. The PBS model estimates the probability of <k,t> staleness—the condition that the read, which begins t time units after the end of the write, returns the value assigned by one of the last k writes. This condition is similar to k-atomicity but considers only a single read at a fixed distance t from the write. When k = 1, it captures the probability that the read returns the latest value (that is, is not stale).

The PBS model makes two simplifying assumptions. First, like Vogels’s definition of eventual consistency,²⁸ PBS does not consider workloads where writes overlap in time with reads, in which the width t of the gap between writes and reads is not well defined. Second, PBS does not model failures explicitly. Although storage node failures can be simulated using longer latencies, PBS does not account for network partitions. Follow-up work¹¹ extends the PBS model by considering node failures.

Empirical Measurement

Thus far, this article has addressed techniques for quantifying staleness in eventually consistent systems. This section discusses approaches that measure consistency empirically from the perspective of both the system and the client.

Measuring eventual consistency “in the wild” is as difficult as defining it precisely. Concurrent operations make it difficult to identify the order in which operations take effect, and hence to classify reads as stale or not. To make matters worse, in the event of a network partition, clients on opposite sides of the divide may observe the last updated value differently, even if no new updates are made to an object after some point in time. Such an anomaly is possible because in an always-available system each partition will continue to accept writes.

Despite these challenges, a number of techniques have been devised for measuring eventual consistency, particularly data staleness. This article looks at two fundamentally different methodologies: active measurement, in which the storage system is exercised in an artificial way to determine the time lag from when a new value is written to the storage system until this value becomes visible to clients; and passive analysis, in which a trace of operations is recorded for an arbitrary workload and analyzed mathematically to obtain a measurement of staleness.

Active measurement underlies early studies of consistency in cloud storage systems. In this category of techniques, one client writes a new value to a key, and a different client then reads the same key repeatedly until the new value is returned. As in Vogels’s definition of eventual consistency,²⁸ writes do not overlap in time with reads in this scenario. The time from the write to the last read that returns the old value—or alternatively, the first read of the new value—answers the question, “How eventual?” and can be regarded as an estimate of the convergence time of the replication protocol—the time needed to propagate a new value to all the replicas of an object.

At a technical level, the main challenge in active measurement is to determine the difference in time between operations executed at different client nodes—namely, a write and a read. Figure 2 illustrates how this difference can be computed using a collection of clients. To capture precisely the moment when the last replica receives the updated value, Hiroshi Wada et al. apply reads 50 times per second using one or more clients.²⁹ Bermbach et al. go one step further and use a collection of geographically distributed readers to ensure the reads hit all possible replicas.⁸ Staleness measurement in YCSB++ (Yahoo! Cloud-serving Benchmark)²³ follows a similar approach but uses a ZooKeeper producer-consumer queue¹⁹ to synchronize writers and readers. This approach circumvents issues related to clock skew but introduces additional latencies caused by queue operations, which may limit precision.

Active measurement can be used to discover the range of update propagation times in an eventually consistent system. For example, in experiments involving Amazon’s SimpleDB,⁴ Wada et al. report that convergence occurred in at most one second in more than 90% of the runs, but took more than four seconds in a few (less than 1%) cases. On the other hand, active measurement does not indicate what proportion of reads in a real workload will return stale values, as this quantity depends on how the data objects are accessed. For example, the proportion of stale reads would be expected to vary with the rate at which operations are applied on a given object—close to zero when operations are applied infrequently and the gaps between them exceed the convergence time, and larger when reads follow writes more closely. Active measurement does not separate these two cases, as it is based upon a controlled workload designed to measure convergence time.

Passive analysis. In the earlier section on relaxed consistency properties, the discussion focused on ways of defining staleness in a precise and meaningful way and provided a hint of how a consistency metric can be derived from such definitions. It begins with a “black-or-white” consistency property, such as linearizability, which is either satisfied or not satisfied by a system or an execution trace; next, this property is relaxed by introducing a parameter that bounds the staleness of reads (for example, Δ-atomicity); the last step is to determine the parameter value that most accurately describes the system’s behavior (for example, Δ-atomic for Δ ≥ 10 ms). Passive analysis refers to this final step and entails examining the operations recorded in an execution trace to determine the order in which they appear to take effect.

The most immediate technical challenge in passive analysis is the collection of the trace, which records for each operation its type, start, and finish times, as well as its arguments and response (for example, a read of object X, starting at time t₁ and ending at t₂, returning the value 5). This trace can be obtained at clients, as shown in Figure 2.

Because the trace is obtained by merging data from multiple clients or storage nodes over a finite period of time, it is prone to two types of anomalies: dangling reads, which return values that lack a corresponding write (that is, a write that assigns the value returned by the read); and reversed operations, whereby a read appears before its corresponding write in the trace. These anomalies must be removed if they occur; otherwise, the k-atomicity and Δ-atomicity properties are undefined and cannot be used for computing staleness.

Dangling reads indicate missing information, such as when the first access to an object in a trace is a read and the corresponding write occurs before the start of the trace. In this case the dangling read is identified easily and can be dropped from the trace with no ill effects. Reversed operations are equally easy to detect, but more difficult to remedy. Clock synchronization techniques such as atomic clocks and GPS¹⁰ can eliminate reversed operations altogether. Even ordinary Network Time Protocol (NTP) is sufficient if the synchronization of clocks is tight enough. Alternatively, one can also estimate clock skew directly from reversed operations and adjust the trace accordingly.

Given a trace free from dangling reads and reversed operations, the next challenge is to compute staleness metrics. Efficient algorithms are critical in this context because the trace may be very long, which means that both the space and computation required is high. In fact, the problem of testing whether a trace is linearizable was shown by Gibbons and Korach to be intractable.¹³ Testing whether a trace is k-atomic or Δ-atomic for fixed k and Δ is at least as hard since these properties are equivalent to linearizability when k = 1 and Δ = 0. Computing k and Δ from a trace is harder still, and hence also intractable.

Fortunately, the intractability result breaks in the special case when every write on a given object assigns a unique value. This condition is straightforward to enforce (for example, by embedding a unique token in each value written, such as a node ID and timestamp), which opens the door to efficient algorithms for computing staleness metrics from traces in practice. For Δ-atomicity, such an algorithm is known,¹⁵ and has been used to analyze time-based staleness in traces obtained using Cassandra.²⁴ For k-atomicity, an efficient algorithm is known only for deciding whether a trace is 2-atomic.¹⁴

Passive analysis in general is not limited to staleness metrics, as illustrated by the technique of cycle analysis,^5,30 which detects linearizability violations by analyzing a conflict graph representing the execution trace. The absence of cycles in such a graph indicates the trace is linearizable, and so the number and length of such cycles can be defined as metrics for inconsistency.^5,30 The relationship of these metrics to staleness is not known precisely.

Comparison and Discussion

Prediction, active measurement, and passive analysis are all useful ways to study eventual consistency. Although each method has specific strengths and weaknesses, it is difficult to compare them directly, as they meet different goals. Passive analysis is client-centric in that it reflects the manner in which the client application interacts with the system. As a result, passive analysis can be used to compare staleness observed in two workloads applied to the same storage system. Active measurement, on the other hand, is system-centric in that it measures the convergence time of a system’s replication protocol for a controlled workload. Thus, active measurement is best suited for comparing different systems in terms of staleness, or the same system under different software or hardware configurations. The PBS framework⁶ is also designed around a controlled workload, and for that reason it is classified as system-centric.

The techniques discussed in this article also vary in terms of conceptual models of eventual consistency. Active measurement and PBS are based upon a simplified model, similar to Vogels’s definition,²⁸ in which writes occur before reads. In contrast, passive analysis considers a general model in which writes may overlap in time with reads and with other writes. As explained in another article by Golab et al.,¹⁶ storage systems may behave differently in these two models when clients follow the “R + W > N” rule,²⁸ which ensures read and write operations access overlapping subsets of replicas. The latter condition has long been considered sufficient for “strong consistency,”^6,28 and in fact guarantees linearizability in the simplified model but permits linearizability violations in the general model.

Despite the somewhat high cost of analyzing a detailed trace of operations, passive analysis plays an important role in understanding eventual consistency. Whereas any perspective on consistency is ultimately affected by the state of data in replicas in a storage system, which can be thought of as the “ground truth,” passive analysis is most closely tied to the actual consistency observed by client applications. Specifically, it reflects data-level anomalies only when they manifest themselves to clients—for example, as stale reads.

Future Work

For modern online service providers, small decreases in latency are known to have a measurable effect on user experience and, as a consequence, revenue.¹⁷ As weak consistency continues to gain traction as the means for reducing latency, the ability to reason clearly about this trade-off will be an important competitive advantage for service providers.

Consistency-aware pricing schemes and guarantees are already being adopted by industry. Amazon’s DynamoDB supports differentiated pricing for consistent reads and eventually consistent reads, with the former guarantee costing the application developer twice as much.³ More recently, Terry et al. describe a system that allows client applications to select different grades of weak consistency, such as “at most 200-ms latency and at most 5-minute staleness,” through novel SLAs (service-level agreements).²⁷ Other read guarantees, such as monotonic reads or causal consistency, can be requested, and a “wish list” of different combinations can be specified with different utilities. This opens the door to even more fine-grained pricing schemes, making it more important than ever for users to understand the utility of different points in consistency-related trade-offs.

Whereas the understanding of eventual consistency is expected to improve with additional empirical studies involving measurement, fundamental technical problems also remain to be solved. In particular, the problem of consistency verification remains an open and important challenge. Storage system designers need verification techniques to test whether their implementation correctly fulfills application-specified SLAs, and, likewise, application developers could use such tools to verify the service quality actually received.

One of the open problems related to verification algorithms is to determine the computational complexity of deciding k-atomicity. Existing algorithms handle only the special case k < 3,¹⁴ which limits their use in practice. Similar technical ideas can be applied when k ≥ 3 but only for a restricted class of traces, and it is not known whether an efficient algorithm exists for deciding k-atomicity in the general case.

Conclusion

Eventual consistency is increasingly viewed as a spectrum of behaviors that can be quantified along various dimensions, rather than a binary property that a storage system either satisfies or fails to satisfy. Advances in characterizing and verifying these behaviors will enable service providers to offer an increasingly rich set of service levels of differentiated performance, ultimately improving the end user’s experience.

Acknowledgments

We are grateful to Jay J. Wylie, Indranil Gupta, and Doug Terry for their helpful feedback.

Figures

Figure 1. Example trace of operations illustrating linearizability, k-atomicity, and Δ-atomicity.

Figure 2. Infrastructure for consistency measurement: active measurement (top), and passive analysis (bottom).

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Eventually Consistent: Not What You Were Expecting?

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2576794

March 2014 Issue

Published: March 1, 2014

Vol. 57 No. 3

Pages: 38-44

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

BLOG@CACM Apr 17 2024

Technical Marvels

Herbert Bruderer

Computer History

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Defining Eventual Consistency

Relaxed Consistency Properties

Prediction

Empirical Measurement

Comparison and Discussion

Future Work

Conclusion

Acknowledgments

Figures

Eventually Consistent: Not What You Were Expecting?

DOI

March 2014 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.