Storage systems continue to lay the foundation for modern Internet services such as Web search, e-commerce, and social networking. Pressures caused by rapidly growing user bases and datasets have driven system designs away from conventional centralized databases and toward more scalable distributed solutions, including simple NoSQL key-value storage systems, as well as more elaborate NewSQL databases that support transactions at scale.
Distributed key-value storage systems are among the simplest and most scalable specimens in the modern storage ecosystem. Such systems forgo many of the luxuries of conventional databases, including ACID (atomicity, consistency, isolation, durability) transactions, joins, and referential integrity constraints, but retain fundamental abstractions such as tables and indexes. As a result, application developers are sheltered from technicalities such as normalizing the relational schema, selecting the optimal transaction isolation level, and dealing with deadlocks.
Despite their simple interface and data model, distributed key-value storage systems are complex internally as they must replicate data on two or more servers to achieve higher read performance and greater availability in the face of node or network failures. Keeping these replicas synchronized requires a distributed replication protocol that can add substantial latency to storage operations, especially in geo-replicated systems. Specifically, write operations must update a subset of the replicas before the system acknowledges the completion of the write to the client, and reads may fetch data from a subset of the replicas, adopting the latest value observed (for example, one having the highest timestamp) as the response.
As it turns out, choosing which subset of replicas to contact for a storage operation profoundly impacts the behavior of distributed storage systems. For strong consistency a client may read and write a majority of replicas. Since majorities overlap, this ensures each read "sees" the latest write. In contrast, eventually consistent systems may read and write non-overlapping subsets.26,28 Once a write is acknowledged, the new value is propagated to the remaining replicas; thus, all replicas are eventually updated unless a failure occurs. In the meantime, readers may observe stale values if they fetch data from replicas that have not yet received the update.
Although many applications benefit from strong consistency, latency-sensitive applications such as shopping carts in e-commerce websites choose eventual consistency to gain lower latency.1 This can lead to consistency anomalies such as items lost from a shopping cart or oversold. However, since a user can detect and correct the problem during checkout, such anomalies are tolerable provided they are short-lived and occur infrequently. The critical task for application developers and system administrators, therefore, is to understand how consistency and latency are impacted by various storage and application configuration parameters, as well as the workload that the application places on the storage system. Only with proper insight into this subtle issue can administrators and developers make sensible decisions regarding the configuration of the storage system or the choice of consistency model for a given application.
This article looks at methods of quantifying consistency (or lack thereof) in eventually consistent storage systems. These methods are necessary for meaningful comparisons among different system configurations and workloads. First, the article defines eventual consistency more precisely and relates it to other notions of weak consistency. It then drills down into metrics, focusing on the dimension of staleness, and surveys different techniques for predicting and measuring staleness. Finally, the relative merits of these techniques are evaluated, and any remaining open questions are identified.
Eventual consistency can be defined as either a property of the underlying storage system, or a behavior observed by a client application. For example, Doug Terry et al. give the following definition of eventual consistency in the context of the Bayou storage system: "All replicas eventually receive all writes (assuming sufficient network connectivity and reasonable reconciliation schedules), and any two replicas that have received the same set of writes have identical databases."26 This informal definition addresses both the propagation of writes to replicas (that is, the eventual) and convergence onto a well-defined statefor example, through timestamp-based reconciliation(that is, the consistency).
This definition, while simple and elegant, does not say precisely what clients observe. Instead, it captures a range of behaviors encompassing both strongly consistent relational databases, and weakly consistent systems that may return stale or out-of-order data without bound. In contrast, Werner Vogels describes the consistency observed by clients in a concrete way: "The storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value."28 Vogels's definition captures in an intuitive way the convergence of replicas to the last updated value, but only in the special case where updates are suspended while clients read an objecta scenario very different from the online environment in which many eventually consistent systems are deployed.
Eventual consistency can be defined as either a property of the underlying storage system, or a behavior observed by a client application.
Since weak consistency is difficult to reason about, application developers often seek stronger properties such as a consistent prefix, monotonic reads, "read my writes," or causal consistency.25 Indeed, some systems support such stronger-than-eventual properties (for example, COPS [Clusters of Order-Preserving Servers22] and Pileus27), but the commercial success of systems such as Amazon's Dynamo12 proves that eventual consistency can be good enough in practice. To understand why, researchers have sought to describe the range of behaviors of eventually consistent systems in more precise terms than abstract definitions in the style of Terry et al. and Vogels. Existing approaches in this endeavor fall into three categories: relaxed consistency metrics, system modeling and prediction, and empirical measurement tools. The next three sections survey representative works in each category.
Abstract definitions of eventual consistency leave open a number of questions regarding system behavior in the presence of concurrent accesses, as well as in a failure-prone environment. For example: How quickly do updates propagate to replicas in practice, and how often do replicas agree on the latest value of an object? When replicas do not agree, how does that affect the clients' view of the data? What exactly do clients observe during an extended failure that partitions the network, or in the moments shortly after the network is healed?
An emerging approach for describing these behaviors is to relate them to a strict form of consistency called linearizability.18 Informally speaking, this property states that the storage system behaves as though it executes operations one at a time, in some serial order, despite actually executing some operations in parallel. As will be explained later, this serial order is further constrained in a manner that forbids stale reads. As a result, linearizability is a suitable gold standard against which eventually consistent systems can be judged. In other words, the observed behavior can be described in terms of deviations from this standard, which in this context can be regarded as consistency violations. For example, one can think of Amazon's Dynamo as violating linearizability when reads return stale values, even though the system promises only eventual consistency.
Whereas linearizability is ubiquitous in centralized systemsfor example, in multithreaded data structuresBrewer's CAP principle9 states that such a strong consistency property (C) is unattainable in always-available (A) and partition-tolerant (P) distributed systems. (Database experts should read consistency in this context as transaction isolation.) As a result, one cannot expect any storage system to accept updates and yet remain consistent during a network partition. Although this does not preclude strong consistency during failure-free operation, even in that case the system may be configured to sacrifice consistency for better latency.1 For example, in the Cassandra key-value store, clients can effect this trade-off by requesting various "consistency levels," which control the number of replicas that respond to a read or write.20
The side effect of weakening consistency is increased staleness, which so far has been discussed informally. More precisely, a value is considered fresh from the moment it is written into a storage system until it is overwritten, and stale thereafter. Thus, staleness describes the age of a value returned by a read relative to the last updated value, and hence quantifies how badly a system's behavior deviates from the gold standard. Two interpretations of this concept have been discussed in the literature,2,15 arising from different ways to define "age":
The concept of staleness, although intuitive, is fraught with technical subtleties. In the simple case where operations are executed one at a time, there is a natural order in which clients expect these operations to take effect, and so stale reads can be identified easily. When operations are executed in parallel, however, their order can be very difficult to determine from the client's perspective.
To make sense of eventual consistency, we turn to relaxed consistency propertiesk-atomicity and -atomicitywhich give precise meaning to the notions of version-based and time-based staleness, respectively. Both properties are relaxed forms of linearizability,18 but for historical reasons they refer in name to Lamport's atomicity property,21 which is similar in spirit.
Linearizability (The "Gold Standard"). Consider the trace of operations in Figure 1a, showing three operations applied to an object denoted by X. First, a write operation W(X,1) assigns 1 to object X, and then this value is updated to 2 by a second write operation, W(X,2). The third operation R(X) is a read of X and begins after both writes have ended. In this case, linearizability dictates that W(X,1) should appear to take effect before W(X,2), because W(X,1) ends before W(X,2) begins. Thus, 2 is the last updated value of X when R(X) is applied, but R(X) returns 1 instead, so the trace is not linearizable. The more complex case when operations overlap in time is addressed by the formal definition of linearizability,18 a discussion that is beyond the scope of this article.
K-Atomicity (Version-Based Staleness). The k-atomicity property was introduced by Amitanand Aiyer et al.2 Like linearizability, it requires that operations appear to take effect in an order that is constrained by their start and finish times; within this order, however, a read may return any of the k last updated values. For example, the trace shown in Figure 1a is k-atomic for k = 2 but not k = 1.
-Atomicity (Time-based Staleness). The -atomicity property was proposed by Wojciech Golab et al.15 Similar to k-atomicity, it relaxes linearizability by allowing reads to return stale values. Staleness, however, is defined in terms of time: a read may return a value that is up to time units stale. For example, the trace in Figure 1a is -atomic if is defined as the width of the gap between W(X,2) and R(X). Linearizability permits R(X) to take effect before W(X,2) if the two operations overlap in time, as shown in Figure 1b, whereas it requires R(X) to take effect after W(X,2) in Figure 1a. If R(X) is hypothetically "stretched" to the left by time units (that is, if R(X) had started time units earlier), as in Figure 1b, then the trace becomes linearizable. Thus, the response of R(X) is considered only time units stale in Figure 1a.
Another method used to characterize eventual consistency is based on a combination of system modeling and prediction. Peter Bailis et al.6 present the PBS (Probabilistically Bounded Staleness) framework, in which a white-box system model is proposed to predict data staleness observed by client applications. The PBS model estimates the probability of <k,t> stalenessthe condition that the read, which begins t time units after the end of the write, returns the value assigned by one of the last k writes. This condition is similar to k-atomicity but considers only a single read at a fixed distance t from the write. When k = 1, it captures the probability that the read returns the latest value (that is, is not stale).
The PBS model makes two simplifying assumptions. First, like Vogels's definition of eventual consistency,28 PBS does not consider workloads where writes overlap in time with reads, in which the width t of the gap between writes and reads is not well defined. Second, PBS does not model failures explicitly. Although storage node failures can be simulated using longer latencies, PBS does not account for network partitions. Follow-up work11 extends the PBS model by considering node failures.
Thus far, this article has addressed techniques for quantifying staleness in eventually consistent systems. This section discusses approaches that measure consistency empirically from the perspective of both the system and the client.
Measuring eventual consistency "in the wild" is as difficult as defining it precisely. Concurrent operations make it difficult to identify the order in which operations take effect, and hence to classify reads as stale or not. To make matters worse, in the event of a network partition, clients on opposite sides of the divide may observe the last updated value differently, even if no new updates are made to an object after some point in time. Such an anomaly is possible because in an always-available system each partition will continue to accept writes.
Despite these challenges, a number of techniques have been devised for measuring eventual consistency, particularly data staleness. This article looks at two fundamentally different methodologies: active measurement, in which the storage system is exercised in an artificial way to determine the time lag from when a new value is written to the storage system until this value becomes visible to clients; and passive analysis, in which a trace of operations is recorded for an arbitrary workload and analyzed mathematically to obtain a measurement of staleness.
Active measurement underlies early studies of consistency in cloud storage systems. In this category of techniques, one client writes a new value to a key, and a different client then reads the same key repeatedly until the new value is returned. As in Vogels's definition of eventual consistency,28 writes do not overlap in time with reads in this scenario. The time from the write to the last read that returns the old valueor alternatively, the first read of the new valueanswers the question, "How eventual?" and can be regarded as an estimate of the convergence time of the replication protocolthe time needed to propagate a new value to all the replicas of an object.
At a technical level, the main challenge in active measurement is to determine the difference in time between operations executed at different client nodesnamely, a write and a read. Figure 2 illustrates how this difference can be computed using a collection of clients. To capture precisely the moment when the last replica receives the updated value, Hiroshi Wada et al. apply reads 50 times per second using one or more clients.29 Bermbach et al. go one step further and use a collection of geographically distributed readers to ensure the reads hit all possible replicas.8 Staleness measurement in YCSB++ (Yahoo! Cloud-serving Benchmark)23 follows a similar approach but uses a ZooKeeper producer-consumer queue19 to synchronize writers and readers. This approach circumvents issues related to clock skew but introduces additional latencies caused by queue operations, which may limit precision.
Active measurement can be used to discover the range of update propagation times in an eventually consistent system. For example, in experiments involving Amazon's SimpleDB,4 Wada et al. report that convergence occurred in at most one second in more than 90% of the runs, but took more than four seconds in a few (less than 1%) cases. On the other hand, active measurement does not indicate what proportion of reads in a real workload will return stale values, as this quantity depends on how the data objects are accessed. For example, the proportion of stale reads would be expected to vary with the rate at which operations are applied on a given objectclose to zero when operations are applied infrequently and the gaps between them exceed the convergence time, and larger when reads follow writes more closely. Active measurement does not separate these two cases, as it is based upon a controlled workload designed to measure convergence time.
Passive analysis. In the earlier section on relaxed consistency properties, the discussion focused on ways of defining staleness in a precise and meaningful way and provided a hint of how a consistency metric can be derived from such definitions. It begins with a "black-or-white" consistency property, such as linearizability, which is either satisfied or not satisfied by a system or an execution trace; next, this property is relaxed by introducing a parameter that bounds the staleness of reads (for example, -atomicity); the last step is to determine the parameter value that most accurately describes the system's behavior (for example, -atomic for 10 ms). Passive analysis refers to this final step and entails examining the operations recorded in an execution trace to determine the order in which they appear to take effect.
The most immediate technical challenge in passive analysis is the collection of the trace, which records for each operation its type, start, and finish times, as well as its arguments and response (for example, a read of object X, starting at time t1 and ending at t2, returning the value 5). This trace can be obtained at clients, as shown in Figure 2.
Because the trace is obtained by merging data from multiple clients or storage nodes over a finite period of time, it is prone to two types of anomalies: dangling reads, which return values that lack a corresponding write (that is, a write that assigns the value returned by the read); and reversed operations, whereby a read appears before its corresponding write in the trace. These anomalies must be removed if they occur; otherwise, the k-atomicity and -atomicity properties are undefined and cannot be used for computing staleness.
Dangling reads indicate missing information, such as when the first access to an object in a trace is a read and the corresponding write occurs before the start of the trace. In this case the dangling read is identified easily and can be dropped from the trace with no ill effects. Reversed operations are equally easy to detect, but more difficult to remedy. Clock synchronization techniques such as atomic clocks and GPS10 can eliminate reversed operations altogether. Even ordinary Network Time Protocol (NTP) is sufficient if the synchronization of clocks is tight enough. Alternatively, one can also estimate clock skew directly from reversed operations and adjust the trace accordingly.
Given a trace free from dangling reads and reversed operations, the next challenge is to compute staleness metrics. Efficient algorithms are critical in this context because the trace may be very long, which means that both the space and computation required is high. In fact, the problem of testing whether a trace is linearizable was shown by Gibbons and Korach to be intractable.13 Testing whether a trace is k-atomic or -atomic for fixed k and is at least as hard since these properties are equivalent to linearizability when k = 1 and = 0. Computing k and from a trace is harder still, and hence also intractable.
Fortunately, the intractability result breaks in the special case when every write on a given object assigns a unique value. This condition is straightforward to enforce (for example, by embedding a unique token in each value written, such as a node ID and timestamp), which opens the door to efficient algorithms for computing staleness metrics from traces in practice. For -atomicity, such an algorithm is known,15 and has been used to analyze time-based staleness in traces obtained using Cassandra.24 For k-atomicity, an efficient algorithm is known only for deciding whether a trace is 2-atomic.14
Passive analysis in general is not limited to staleness metrics, as illustrated by the technique of cycle analysis,5,30 which detects linearizability violations by analyzing a conflict graph representing the execution trace. The absence of cycles in such a graph indicates the trace is linearizable, and so the number and length of such cycles can be defined as metrics for inconsistency.5,30 The relationship of these metrics to staleness is not known precisely.
Prediction, active measurement, and passive analysis are all useful ways to study eventual consistency. Although each method has specific strengths and weaknesses, it is difficult to compare them directly, as they meet different goals. Passive analysis is client-centric in that it reflects the manner in which the client application interacts with the system. As a result, passive analysis can be used to compare staleness observed in two workloads applied to the same storage system. Active measurement, on the other hand, is system-centric in that it measures the convergence time of a system's replication protocol for a controlled workload. Thus, active measurement is best suited for comparing different systems in terms of staleness, or the same system under different software or hardware configurations. The PBS framework6 is also designed around a controlled workload, and for that reason it is classified as system-centric.
The techniques discussed in this article also vary in terms of conceptual models of eventual consistency. Active measurement and PBS are based upon a simplified model, similar to Vogels's definition,28 in which writes occur before reads. In contrast, passive analysis considers a general model in which writes may overlap in time with reads and with other writes. As explained in another article by Golab et al.,16 storage systems may behave differently in these two models when clients follow the "R + W > N" rule,28 which ensures read and write operations access overlapping subsets of replicas. The latter condition has long been considered sufficient for "strong consistency,"6,28 and in fact guarantees linearizability in the simplified model but permits linearizability violations in the general model.
Despite the somewhat high cost of analyzing a detailed trace of operations, passive analysis plays an important role in understanding eventual consistency. Whereas any perspective on consistency is ultimately affected by the state of data in replicas in a storage system, which can be thought of as the "ground truth," passive analysis is most closely tied to the actual consistency observed by client applications. Specifically, it reflects data-level anomalies only when they manifest themselves to clientsfor example, as stale reads.
For modern online service providers, small decreases in latency are known to have a measurable effect on user experience and, as a consequence, revenue.17 As weak consistency continues to gain traction as the means for reducing latency, the ability to reason clearly about this trade-off will be an important competitive advantage for service providers.
Consistency-aware pricing schemes and guarantees are already being adopted by industry. Amazon's DynamoDB supports differentiated pricing for consistent reads and eventually consistent reads, with the former guarantee costing the application developer twice as much.3 More recently, Terry et al. describe a system that allows client applications to select different grades of weak consistency, such as "at most 200-ms latency and at most 5-minute staleness," through novel SLAs (service-level agreements).27 Other read guarantees, such as monotonic reads or causal consistency, can be requested, and a "wish list" of different combinations can be specified with different utilities. This opens the door to even more fine-grained pricing schemes, making it more important than ever for users to understand the utility of different points in consistency-related trade-offs.
Whereas the understanding of eventual consistency is expected to improve with additional empirical studies involving measurement, fundamental technical problems also remain to be solved. In particular, the problem of consistency verification remains an open and important challenge. Storage system designers need verification techniques to test whether their implementation correctly fulfills application-specified SLAs, and, likewise, application developers could use such tools to verify the service quality actually received.
One of the open problems related to verification algorithms is to determine the computational complexity of deciding k-atomicity. Existing algorithms handle only the special case k < 3,14 which limits their use in practice. Similar technical ideas can be applied when k 3 but only for a restricted class of traces, and it is not known whether an efficient algorithm exists for deciding k-atomicity in the general case.
Eventual consistency is increasingly viewed as a spectrum of behaviors that can be quantified along various dimensions, rather than a binary property that a storage system either satisfies or fails to satisfy. Advances in characterizing and verifying these behaviors will enable service providers to offer an increasingly rich set of service levels of differentiated performance, ultimately improving the end user's experience.
We are grateful to Jay J. Wylie, Indranil Gupta, and Doug Terry for their helpful feedback.
Eventual Consistency Today: Limitations, Extensions, and Beyond
Peter Bailis and Ali Ghodsi
BASE: An Acid Alternative
3. Amazon Web Services. DynamoDB; http://aws.amazon.com/dynamodb/.
4. Amazon Web Services. SimpleDB; http://aws.amazon.com/simpledb/.
5. Anderson, E., Li, X., Shah, M.A., Tucek, J. and Wylie, J.J. What consistency does your key-value store actually provide? In Proceedings of the 6th Usenix Workshop on Hot Topics in System Dependability, 2010.
6. Bailis, P., Venkataraman, S., Franklin, M.J., Hellerstein, J.M. and Stoica, I. Probabilistically bounded staleness for practical partial quorums. In Proceedings of the VLDB Endowment 5, 8 (2012), 776787.
7. Bermbach, D., Sakr, S. and Zhao, L. Towards comprehensive measurement of consistency guarantees for cloud-hosted data storage services. In Proceedings of the 5th TPC (Transaction Processing Performance Council) Technology Conference on Performance Evaluation and Benchmarking, 2013.
8. Bermbach, D. and Tai, S. Eventual consistency: how soon is eventual? An evaluation of Amazon S3's consistency behavior. In Proceedings of the 6th Workshop on Middleware for Service-oriented Computing, 2011.
11. Davidson, A., Rubinstein, A., Todi, A., Bailis, P. and Venkataraman, S. Adaptive hybrid quorums in practical settings, 2012; http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F12/projects/reports/project12_report_ver2.pdf.
15. Golab, W., Li, X. and Shah and M.A. Analyzing consistency properties for fun and profit. In Proceedings of the 30th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, (2011), 197206.
17. Hamilton, J. The cost of latency; http://perspectives.mvdirona.com/2009/10/31/TheCostOfLatency.aspx.
22. Lloyd, W., Freedman, M.J., Kaminsky, M. and Andersen, D.G. Don't settle for eventual: Scalable causal consistency for wide-area storage with COPS. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles, (2011), 401416.
23. Patil, S., Polte, M., Ren, K.. et al. YCSB++: benchmarking and performance debugging advanced features in scalable table stores. In Proceedings of the 2nd ACM Symposium on Cloud Computing, (2011), 9:19:14.
24. Rahman, M.R., Golab, W., AuYoung, A., Keeton, K. and Wylie, J.J. Toward a principled framework for benchmarking consistency. In Proceedings of the 8th Usenix Workshop on Hot Topics in System Dependability, 2012
27. Terry, D.B., Prabhakaran, V., Kotla, R., Balakrishnan, M., Aguilera, M.K. and Abu-Libdeh, H. Consistency-based service-level agreements for cloud storage. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, 2013.
29. Wada, H., Fekete, A., Zhao, L., Lee, K. and Liu, A. Data consistency properties and the trade-offs in commercial cloud storage: the consumers' perspective. In Proceedings of the 5th Biennial Conference on Innovative Data Systems Research, 2011, 134143.
©2014 ACM 0001-0782/14/03
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.
No entries found