Reading, 'riting, and 'rithmetic, the three Rs underlying much of human intellectual activity, not surprisingly, also stand as a venerable foundation of modern computing technology. Indeed, both the Turing machine and von Neumann machine models operate by reading, writing, and computing, and all practical uniprocessor implementations are based on performing activities structured in terms of the three Rs. With the advance of networking technology, communication became an additional major systemic activity. However, at a high level of abstraction, it is apparently still more natural to think in terms of reading, writing, and computing.
While it is difficult to imagine distributed systemssuch as those implementing the World Wide Webwithout communication, we often imagine browser-based applications that operate by retrieving (that is, reading) data, performing computation, and storing (that is, writing) the results. In this article, we deal with the storage of shared readable and writable data in distributed systems with the focus on implementations that provide resiliency and consistency in dynamic settings, namely, in systems that are subject to perturbations in the underlying distributed platforms composed of computers and networks that interconnect them. The perturbations include failures of individual computers, dynamically changing collections of computers participating in the system, and failures and delays in the communication medium.
Shared storage services are at the core of most information-age systems. Shared memory systems surveyed here provide objects supporting two access operations: read that obtains the current value of the object, and write that replaces the old value of the object with a new value. Such objects are often called registers. Although we do not include more complicated object semantics, such as transactions or integrated read-modify-write operations, there are common implementation challenges that any distributed storage system needs to resolve. Imagine a storage system that is implemented as a central server. The server accepts client requests to perform operations on its data objects and returns responses. While this is conceptually simple, this approach already presents two major problems. The first is the central server is a performance bottleneck. The second is the server is a single point of failure. The quality of service in such an implementation degrades rapidly as the number of clients grows, and the service becomes unavailable if the server crashes (imagine how inadequate a Web news service would be were it implemented as a central server). Thus the system must, first of all, be available. This means it must provide its services despite failures within the scope of its specification, for example, the system must be able to mask certain server and communication failures. The system must also support multiple concurrent accesses without imposing unreasonable degradation in performance. The only way to guarantee availability is through redundancy, that is, to use multiple servers and to replicate the contents of the objects among these servers. Moreover, the replication must be done at geographically distributed and distinct network locations, where the disconnection or failures of certain subsets of data servers can be masked by the system.
It is also critically important to ensure data longevity. A storage system may be able to tolerate failures of some servers, but over a long period it is conceivable that all servers may need to be replaced, because no servers are infallible, and due to planned upgrades. Additionally, in mobile settings, for example, search-and-rescue or military operations, it may be necessary to provide migration of data from one collection of servers to another, so the data can move as the needs dictate. Whether our concern is data longevity or mobility, the storage system must provide seamless runtime migration of data: one cannot stop the world and reconfigure the system in response to failures and changing environment.
A major problem that comes with replication is consistency. How does the system find the latest value of a replicated object? This problem was not present with a central server implementation: the server always contains the latest value. In a replicated implementation, one may attempt to consult all replicas in search of the latest value, but this approach is expensive and not fault-tolerant as it assumes that all replicas are accessible. In any case, none of the implementation issues should be a concern for the clients of the distributed memory service. What the clients should expect to see is the illusion of a single-copy object that serializes all accesses so that each read operation returns the value of the preceding write operation, and that this value is at least as recent as that returned by any preceding read. More generally, the behavior of an object, as observed externally, must be consistent with the abstract sequential data type of the object, and in developing applications that use such objects the clients must be able to rely on the abstract data type of the object. This notion of consistency is formalized as atomicity28 or, equivalently, as linearizability.25
While there is no argument that atomicity is the most convenient notion of consistency, we note that weaker notions have also been proposed and implemented, motivated primarily by efficiency considerations. For example, several less-than-intuitive consistency definitions emerged from the domain of multiprocessor memory systems, precipitating an anecdotal opinion that "no one can figure out these consistency models, but the memory access is fast." Atomicity provides strong guarantees, making it more expensive to provide than weaker consistency guarantees.4 Weak consistency also needs to be considered in the partitionable network setting. Eric Brewer conjectured7 that distributed storage systems cannot simultaneously provide consistency, availability, and partition-tolerance; known as the "CAP conjecture," this was later formalized as a theorem.23 Thus weaker consistency models need to be considered in some cases.8 We take the view that it is nevertheless important to provide simple and intuitive atomic consistency.
The situation reminiscent of the early days of programming languages when it was argued that assembly language is better because one can generate more efficient code, or the early days of graphical user interfaces when it was argued that command-line interfaces are better because they consume fewer resources. It is conceivable that such arguments will also pass in the case of atomicity. In the setting where network partitions are a concern, two approaches are considered. When partitions are intermittent and short-lived, one can deal with them as long(er) network delays, delegating the repair issues to lower level communication services. Otherwise, when partitions can be permanent, strong consistency can still be guaranteed using the approaches that constrain the service to primary partitions, with subsequent integration when partitions are merged. Relevantly, in a recent keynote address,30 ACM's 2008 A.M. Turing Award recipient Barbara Liskov remarked that atomicity is not cheap but if we do not guarantee it, this creates headaches for developers.
Interestingly, some existing storage systems provide data access primitives implementing atomic read-modify-write operations. Such access primitives are much stronger than separate read and write primitives we consider here. Implementing this type of consistency is expensive, and at its core requires atomic updates that in practice are implemented by either reducing parts of the system to a single-writer model (for example, Microsoft's Azure9), depending on clock synchronization hardware (Google's Spanner13), or relying on complex mechanisms for resolving event ordering such as vector clocks (Amazon's Dynamo15). Our exposition of atomic read/write storage illustrates challenges that are common to all distributed storage systems.
Here, we describe a general distributed setting for implementing consistent shared memory services.
Modeling distributed platforms. We model the system as a collection of interconnected computers, or nodes, that communicate by sending point-to-point messages. Each node has a unique identifier (such as IP address), local storage, and it can perform local computation. A node may fail by crashing at any point of the computation. Any node that crashes stops operating: it does not perform any local computation, it does not send any messages, and any messages sent to it are not delivered.
The system is asynchronous, and the nodes have no access to a global clock or synchronization mechanisms. This means that relative processing speeds at the nodes can be arbitrary, and that the nodes do not know the upper bound on time that it takes to perform a local computation. The message delays can also be arbitrary, and the nodes do not know bounds on message latency (although such bounds may exist). Thus the algorithms may not make assumptions about global time or delays, since the rate of processing at each node and message latencies are unknown.
We assume that messages can be reordered in transit, however, the messages cannot be corrupted, duplicated, or generated spontaneously. If a message is received then it must have been previously sent. The messages are not lost, but message loss can be modeled as long delays (we do not address techniques for constructing more dependable communication services, for example, by using retransmission or gossip).
We categorize a distributed networked system as either static or dynamic as follows. In the static networked system the set of participating nodes is fixed, and each node may know the identity of every other participant; crashes (or voluntary departures) may remove nodes from the system. In the dynamic networked system the set of nodes may be unbounded, and the set of participating nodes may completely change over time as the result of crashes, departures, and new nodes joining; such changes may occur at any time. At this point we do not state assumptions on the magnitude of failures or "churn" in a dynamic setting, and we postpone this discussion until specific solutions are presented. Regardless of the dynamics, we require that consistency of the data is preserved. Perturbations in the computing medium may negatively affect the performance of the memory services we consider, however memory access operations are guaranteed to terminate under certain assumptions. For example, a static memory service guarantees that operations terminate when a majority of replicas is active and when the network delays are bounded. Dynamic memory services relax the static majority assumption and instead assume that dynamically changing subsets of replicas (for example, dynamic quorums) are active during certain periods. If these assumptions are not satisfied, such services still guarantee atomicity, however some operations may be slow or may not terminate.
Overall, we consider the participating nodes to be "good citizens" in that they cooperate willingly when they are able to, and they do not act maliciously by intentionally or accidentally performing incorrect computation. Thus we do not present techniques for dealing with malicious behaviors. Instead we direct the interested reader to related work that considers participants that may act nefariously, for example, Rodrigues et al.,35 and Martin and Alvisi.34
Distributed shared memory and consistency. A distributed shared memory service emulates a shared memory space comprised of readable and writable objects (often called registers) over a networked platform consisting of distributed network nodes that communicate by message passing. Service implementations use replication to ensure survivability and availability of the objects, but the service makes this invisible to the clients. The content of each object is replicated across several servers or replica hosts. Clients invoke read and write operations on the objects, where the clients that perform read operations are called readers, and those that perform write operations are called writers (a client may be both a reader and a writer).
In response to client requests the service invokes a protocol that involves communication with the replica hosts. This protocol determines the consistency guarantees of the memory system. Atomic consistency definition involves "shrinking" the duration of each operation in any execution to a chosen serialization point between the operation's invocation and response, and requiring that the ordering of the operations according to the serialization points preserves their real-time ordering, and the resulting behavior of the object is consistent with its sequential specification. In particular, if a read is invoked after a write completes, then the read is guaranteed to return either the value of that write, or a value written by subsequent write that precedes the read. Additionally, if a read is invoked after another read completes, it returns the same or a "newer" value than the preceding read. (We present this more formally in an appendix available in the ACM Digital Library.) It is due to these natural properties that atomicity is the most convenient and intuitive consistency guarantee.
A major problem that comes with replication is consistency. How does the system find the latest value of a replicated object?
Atomic registers were introduced by Lamport28 (who also defined weaker notions, such as safety and regularity). Herlihy and Wing26 proposed an equivalent definition called linearizability, that also extends atomicity to arbitrary data types. Given that atomicity is the strongest notion of consistency providing the most useful access semantics, we focus on the atomic shared memory systems.
Algorithms designed for static settings must accommodate some dynamic behaviors, such as asynchrony, transient failures, and permanent crashes within certain limits. Here, we present a shared memory service for such settings for two reasons: to set the stage for more dynamic services in the next section, and to focus on the approach to atomicity that is also used in several dynamic services. The first comprehensive survey of implementations of shared memory in the static setting is presented by Chockler et al.12 Some replica access protocols given in that work can be used as building blocks for dynamic implementations, however, algorithms designed for the static setting cannot be used directly due to their inability to handle dynamically changing sets of replicas. One such algorithm is the seminal implementation of atomic shared memory given by Attiya, Bar-Noy, and Dolev,3 and commonly referred to as the ABD algorithm; this work won the Dijkstra Prize in 2011.
The algorithm implements atomic memory, where replication helps achieve fault-tolerance and availability. The system tolerates f replica node crashes, provided the number of replicas n is such that n > 2f. Our presentation includes extensions to the original algorithm following Lynch and Shvartsman.32
Each replica host i maintains the local value of the register, valuei and a tag associated with the replica, tagi = seq, pid, where seq is a sequence number and pid is the identifier of the
node that performed the write of that value. Tags are compared lexicographically. Each new write assigns a unique tag, where the originating node id is used to break ties. These tags in effect determine an ordering of the write operations, and therefore determine the values that subsequent read operations return.
The clients of a shared memory service should expect to see the illusion of single-copy objects that serialize all accesses.
Read and write operations are similar, each is implemented in terms of two phases, a
Get phase that queries replicas for information, and a
Put phase that propagates information to replicas. Each phase protocol ensures some majority participates in the communication exchange: first the messages are sent to all replica hosts, then the replies are collected until some majority of replicas responds. Recall that since n > 2f, a majority of non-crashed replicas always exists. Thus, each phase terminates after a single communication round, and any operation terminates after two communication rounds.
The correctness of this implementation, that is, atomicity, follows from the fact that for any pair of operations when one follows
another, at least one correct replica is involved in the
Put phase of the first operation and in the
Get phase of the second; this ensures the second operation will always "see" the value that is at least as recent as that of the most recent preceding operation. We point out the only property used in conjunction with the majorities is that any two majorities intersect. If waiting for a majority is too tasking, one can use quorum systems instead.32,38 (Details and the complete pseudo-code are in the appendix available in the ACM Digital Library.)
Here, we present several approaches for providing consistent shared memory in more dynamic systems, that is, where nodes may not only crash or depart voluntarily, but where new nodes may join the service. In general, the set of object replicas can substantially evolve over time, ultimately migrating to a completely different set of replica hosts. The ABD algorithm cannot be used directly in such settings because it relies on the majority of original replica hosts to always be available. In order to use an ABD-like approach in dynamic settings, one must provide some means for managing the collections of replica hosts, and to ensure that readers and writers contact suitable such collections.
It is noteworthy that dealing with dynamic settings and managing collections of nodes does not directly address the provision of consistency in memory services. Instead, these issues are representative of the broader challenges present in the realm of dynamic distributed computing. It is illustrative that implementations of consistent shared memory services can sometimes be constructed using distributed building blocks, such as those designed for managing collections of participating nodes, for providing suitable communication primitives, and for reaching agreement (consensus) in dynamic distributed settings. A tutorial covering several of these topics is given by Aguilera et al.1
We start by presenting the consensus problem because it provides a natural basis for implementing an atomic memory service by establishing an agreed-upon order of operations, and because consensus is used in other ways in atomic memory implementations. Next, we present group communication services (GCS) solutions that use strong communication primitives, such as totally ordered broadcast, to order operations. Finally, we focus on approaches that extend the ideas of the ABD algorithm to dynamic settings with explicit management of the evolving collections of replica hosts.
Consensus. Reaching agreement in distributed settings is a fundamental problem of computer science. The agreement problem in distributed settings is called consensus. Here, a set of processes needs to agree on a value, with nodes proposing several values for consideration. Any solution must have the following properties: Agreement: no two processes decide on different values; Validity: the value decided was proposed by some process; Termination: all correct processes reach a decision. Consensus is a powerful tool in designing distributed services,33 however, consensus is a notoriously difficult problem in asynchronous systems where termination cannot be guaranteed in the presence of even a single process crash;19 thus consensus must be used with care. (One approach to solving consensus is based on introducing "failure detectors" that provide, possibly limited, information about the nodes in the system.10)
Consensus algorithms can be used directly to implement an atomic data service by letting the participants use consensus to agree on a global total ordering of all operations.29 The correctness (atomicity) in this approach is guaranteed regardless of the choice of a specific consensus implementation, but the understanding of the underlying platform characteristics can guide the choice of the implementation for the benefit of system performance (for a tour de force of implementations, see Lynch33). Nevertheless, using consensus for each operation is a heavy-handed approach, especially given that perturbations may delay or even prevent termination. Thus, when using consensus, one must avoid invoking it in conjunction with individual memory operations, and make operations independent of the termination of consensus.
We note that achieving consensus is a more difficult problem than implementing atomic read/write objects, in particular, consensus cannot be solved for two or more processes by using atomic read/write registers.31
Group communication services. Among the most important building blocks for distributed systems are group communication services (GCSs)5 that enable processes located at different nodes of a network to operate collectively as a group. The processes do this by using a GCS multicast service to send messages to all members of the group. GCSs offer various guarantees about the order and reliability of message delivery.
The basis of a GCS is a group membership service. Each process, at each time, has a unique view of the group that includes a list of the processes that are members of the group. Views can change over time, and may become different at different processes. Another important concept introduced by the GCS approach is virtual synchrony, where an essential requirement is that processes that proceed together through two consecutive views deliver the same set of messages between these views. This allows the recipients to take coordinated action based on the message, the membership set, and the rules prescribed by the application.5
GCSs provide one approach for implementing shared memory in dynamic networks. This can be done, for example, by implementing a global totally ordered multicast service on top of a view-synchronous GCS18 (where there is a total order on the messages associated with each view, and each participant receives a prefix of this order). The ordered multicast is used to impose an order on the memory access operations, yielding atomic memory. The main disadvantage in such solutions is that in most GCS implementations, forming a new view takes a substantial amount of time, and client memory operations are delayed (or aborted) during the view-formation period. New view formation normally involves several rounds of communication, and in typical GCS implementations performance is degraded even if only one node crashes. In memory services it is desirable to ensure that read and write operations make progress during reconfiguration, and it is important to tolerate a modest number of failures without imposing a performance penalty.
Another approach is to integrate a GCS with the ABD algorithm. For example, the dynamic primary configuration GCS as describe by De Prisco et al.14 implements atomic memory by using techniques of Attiya3 within each configuration. Here, a configuration combines a group view with a quorum system. Configurations can change either due to the dynamic nature of the system or because a different quorum system is desired, and like other solutions based on GCSs, reads and writes are delayed during reconfiguration. Lastly, any new configuration must satisfy certain intersection properties with respect to the previous configurations. This impacts the flexibility of reconfiguration and requires intermediate configuration changes in reaching the desired configuration.
A general methodology for dynamic service replication is presented in Birman.6 This reconfiguration model unifies the virtual synchrony approach with state machine replication, as used in consensus solutions, in particular, Paxos.29
DynaStore algorithm. DynaStore2 is an implementation of a dynamic atomic memory service for multi-writer/multi-reader objects. It integrates the ABD algorithm and allows reconfiguration of the collection of replica hosts without the use of consensus.
The participants start with a default local configuration, that is, some common set of replica hosts. The algorithm supports three kinds of operations: read, write, and reconfig. The read and write operations involve two phases, and in the absence of reconfigurations, the protocol is similar to ABD: it uses majorities of replicas, where each replica maintains the value of the object and the associated tag.
If a participant wishes to change its current configuration, it uses the reconfig operation and supplies with it a set of incremental changes with elements of the form (+, i), indicating the addition of replica host i, and (, j), indicating the removal of host j. The implementation of reconfig consists of two phases (described below) and uses a distributed weak snapshot service to announce the locally originating changes by means of the update primitive, and obtain the changes submitted by other members of the configuration by means of the scan primitive. The snapshot itself is not atomic as it does not globally order updates, and because scan is not guaranteed to reflect all previously completed updates. Yet, the snapshot service is sufficient for establishing a certain directed acyclic graph (DAG) that is stored locally as part of the state of each participant. Vertices of the graph correspond to configurations that can be produced by means of changes that in turn correspond to the edges.
The implementation of reconfig involves traversals of such DAG's, representing possible sequences of changed configurations. In each traversal the DAG may be revised to re-multiple changes, submitted at different hosts, to the same configuration. The assumption that a majority of the involved hosts are not removed and do not crash ensures there is a path through the DAG that is guaranteed to be common among all hosts. Interestingly, the hosts themselves do not learn of this common path, however, traversing all paths ensures the common path is also traversed. The traversal terminates when a sink node is reached. An additional assumption that there is a finite number of reconfigurations ensures termination.
Now we return to the two-phase structure of reconfig. The goal of the first phase is similar to the
Get phase of ABD: discover the latest value-tag pair for the object. The goal of the second phase is similar to the
Put phase of ABD: convey the latest value-tag pair to a suitable majority of replica hosts. The main difference is that these two phases are performed in the context of applying the incremental changes to the configuration, while at the same time discovering the changes submitted by other participants. This essentially "bootstraps" possible new configurations. Given that all of this is done by traversing all possible pathsand thus configurationsin the DAG's ensures the common path is also traversed.
Finally, we provide additional details for the read and write operations. The read follows the implementation of reconfig, with the differences being: (a) the set of configuration changes is empty, and (b) the discovered value is returned to the client. The write also follows the implementation of reconfig, with the differences being: (a) the set of changes is empty, (b) a new, higher tag is produced upon the completion of the first phase, and (c) the new value-tag pair is propagated in the second phase. Note that while the set of configuration changes is empty for read and write, both operations may discover changes submitted elsewhere, and help with the bootstrapping of revised configurations. As with reconfig, stable majorities ensure nothing blocks the protocols from progressing, and the finite number of reconfigurations guarantees termination.
It is worth reiterating that Dyna Store implementation does not incorporate an agreement protocol for reconfiguration. On the other hand, reconfigurations are accomplished by additions and removals of individual nodes and this may lead to larger overheads as compared to approaches that evolve the system by replacing a complete configuration with another. Thus the latency of read and write operations are more dependent on the rate of reconfigurations (we touch on this later when discussing DynaDisk). Finally, in order to guarantee termination, DynaStore assumes reconfigurations eventually subside.
The Rambo framework. Rambo is a dynamic memory service supporting multi-reader/multi-writer objects;24 Rambo stands for Reconfigurable Atomic Memory for Basic Objects. This algorithm uses configurations, each consisting of a set of replica hosts plus a quorum system defined over these hosts, and supports reconfiguration, by which configurations can be replaced. Notably, any quorum configuration may be installed at any time, and quorums from distinct configurations are not required to have non-empty intersections. The algorithm ensures atomicity in all executions.
During quiescent periods when there are no reconfigurations, the algorithm operates similarly to the ABD algorithm3 (generalized as in Lynch and Shvartsman32): each of the two phases involves interaction with one complete quorum in the current configuration. New participants join the service by means of message handshakes with at least one existing participant (this involves no reconfiguration).
Any participant may crash at any time. To enable long-term operation of the service, quorum configurations can be reconfigured. Reconfigurations are performed concurrently with any ongoing read and write operations, and do not directly affect such operations. Additionally, multiple reconfigurations may be in progress concurrently. Reconfiguration involves two decoupled protocols: introduction of a new configuration by the component called Recon, and upgrade to the new configuration and garbage collection of obsolete configuration(s).
We now discuss this in more detail. The main data structure maintained by each participant is the configuration map, or cmap, that stores a sequence of configurations, where for node i, cmapi (k) is the configuration number k.
This sequence evolves as new configurations are introduced by Recon and as old configurations are garbage collected. It is possible for participants to have differing views on what is in each cmap, however Recon always emits a unique new configuration k to be stored in every cmapi (k). This is done as follows. Any node i that is a member of its latest known configuration c = cmapi (k 1) can propose a new configuration at any time. Different proposals are reconciled by executing consensus among the members of c (here, consensus can be implemented, for example, using a version of Paxos29). Although a consensus execution may be slowin fact, in some situations, it may not even terminatethis dors not delay read and write operations (provided at least one quorum set is intact for the involved configurations during the operations). Note the members of the new configuration may or may not know the latest value of the object. It is the duty of a decoupled upgrade protocol to garbage collect old configurations and propagate the information about the object to the latest locally known configuration. Here, a two-phase algorithm first tells a quorum of each older configuration about the new one and obtains the latest object information, then propagates this information to a quorum of the new configuration and removes the obsolete configurations.
GeoQuorums is an approach to implementing atomic shared memory on top of a physical platform that is based on mobile nodes moving in arbitrary patterns.
It is possible for the participants to have multiple active (non-garbage-collected) configurations in their cmap if reconfigurations occur quickly, or if configuration upgrade lags behind. In this case the two-phase protocol that implements read and write operations behaves as follows. The first phase gathers information from the quorums of active configurations; the second phase propagates information to the quorums of active configurations. Note that during each phase new configurations may be discovered. To handle this, each phase is terminated by a fixed-point condition that involves a quorum from each active configuration.
Revisiting reconfiguration in Rambo, memory access operations are decoupled from reconfigurations enabling operations to terminate even when Recon may be slow due to its use of consensus. The reconfiguration itself involves two separate activities: a consistent sequence of configurations is issued with the help of consensus, and then the upgrade process finalizes the reconfiguration. In some settings it may be advantageous to integrate these two activities for specialized performance reasons, for example, as in RDS service.11
Lastly, Rambo can be viewed as a framework for refinements and optimizations, some of which were implemented in a prototype system, for example, Georgiou et al.20 Next, we describe a remolding of the Rambo framework for ad hoc mobile networks.
GeoQuorums16 is an approach to implementing atomic shared memory on top of a physical platform that is based on mobile nodes moving in arbitrary patterns. An ad hoc network uses no preexisting infrastructure; instead, the mobile nodes that cooperate to route communication from sources to destinations form the network. GeoQuorums can be viewed as a system of two layers, where the top layer implements a dynamic replicated storage system, and the bottom layer provides object replicas in terms of stationary focal points that are implemented by the mobile nodes.
The focal points are geographic areas of interest that are normally "populated" by mobile nodes. This may be a road junction, a scenic observation point, or a water resource in the desert. Mobile nodes in the vicinity of a focal point participate in implementing a stationary virtual object, called the focal point object. The implementation at each focal point supports a local broadcast service, LBcast, that provides reliable, totally ordered broadcast. LBcast is used to implement a type of replicated state machine, one that tolerates joins and leaves of mobile nodes. If every mobile node leaves the focal point, the focal point object fails.
Next, this approach defines a collection of quorum systems over the focal points. Each quorum system involves two sets, called get-quorums and put-quorums, with the property that every get-quorum intersects every put-quorum. The use of quorums enables the service to tolerate the failure of a limited number of focal point objects. For reasons of performance, or in response to periodic migration of mobile nodes, different quorum systems can be installed.
To facilitate communication with the focal point objects, GeoQuorums assumes the availability of a GeoCast service (as in Imielinski26) that enables message delivery to the geographic area of a focal point. With this setting, one can use the Rambo framework as the top layer to implement an atomic memory system with focal points serving as replica hosts. Motivated by simplicity and efficiency considerations, the GeoQuorums approach makes additional modifications. The first deals with reconfiguration, and the second affects how read and write operations are processed.
Dynamic atomic memory services provide data consistency and availability despite the perturbations of the underlying distributed platforms.
GeoQuorums introduced the first general reconfiguration capability that does not rely on consensus. The algorithm reconfigures among a finite number of predetermined configurations, and instead of consensus it uses a two-phase protocol that is similar to the upgrade protocol in Rambo. Here in the first phase, the invoker contacts any complete get-quorum and put-quorum of all preceding configurations (note that at most one pair of messages per focal point is needed even if the finite number of possible configurations is large), then in the second phase information is conveyed to any complete put-quorum of the next configuration.
GeoQuorums implements a modified approach to read and write operations that allows some operations to complete in just one phase. This is accomplished for the case of writes with the help of a global positioning system (GPS) clock to generate tags for the written values, thus ordering writes. This obviates the need for the phase that in other implementations determines the highest tag, and the write protocol here performs just a single put phase that interacts with any put-quorum in the current configuration. If the write detects a concurrent reconfiguration, it also awaits response from put-quorums in every configuration (this involves at most one contact with each focal point). Once the write completes, the tag becomes confirmed. For the case of reads, this protocol involves one or two phases. The first, get, phase proceeds as a typical query phase to obtain the value with the maximum tag from some complete get-quorum; if a concurrent reconfiguration is detected, then the phase also awaits responses from one get-quorum from each configuration (again, at most one message exchange per focal point). Once the get phase completes, and it is determined the maximum obtained tag is confirmed, the read terminates. Otherwise, the read performs the put phase that propagates the maximum tag-value pair to some put-quorum. The information about the confirmed tags is propagated through the system to enable single-phase read operations.
The use of separate get-quorums and put-quorums allows one to tune the system performance depending on the balance between reads and writes. When there are more writes than reads, the system can reconfigure to use larger get-quorums and smaller put-quorums because writes access only the put-quorums. When reads are more prevalent, the system can use smaller get-quorums and larger put-quorums because some reads access just the get-quorums.
GeoQuorums uses real-time timestamps (provided through GPS) to expedite read and write operations, allowing the writes and some reads to complete in a single round. Lastly, the assumption that there is a fixed set of focal points limits the system's evolvability, but allows the algorithm to reconfigure without the use of consensus.
The precise consistency guarantees, the tolerance to failures, and the ability to operate in dynamic environments prompted researchers to construct exploratory implementations of some services. We have already mentioned the exploratory implementation of Rambo variants. Here, we present two additional implementations. The first is a distributed disk array implementation36 that derives algorithmic ideas from Rambo. The second implementation37 is based on the DynaStore algorithm.
Federated Array of Bricks (FAB)36 is a storage system developed and evaluated at HP Labs. FAB deals with disk storage, where the basic objects are logical blocks. The goal of the implementation is to outperform traditional master-slave replication by distributing the workload and handling failures and recoveries without disturbing the client requests.
FAB uses computers, called bricks, equipped with commodity disks and a network interface. To achieve distribution, FAB splits the storage into logical blocks, and using an erasure-coding algorithm replicates each logical block to a subset of bricks. The system is based on majority quorum systems, where to ensure longevity of the service a reconfiguration algorithm moves data when bricks are added or removed. A client performs read and write operations by sending a request to a brick that specifies the logical block it wants to obtain or modify. The brick then determines the set of bricks that store this block and runs the read or write protocol involving this set of bricks as specified by the Rambo algorithm, also using tag-value pairs to determine the order of the written values. Writes require two phases to complete whereas reads may take one or two phases. One-phase reads are possible when the reader receives the same tag from all bricks. To improve efficiency the reader requests the actual block contents only from an idle, active replica, and requests only tags from other members of a quorum.
Evaluations of the implementation showed that FAB performance is similar to the centralized solutions, while offering at the same time continuous service and high availability.
DynaDisk37 is an evaluation implementation of the DynaStore algorithm.2 The implementation was tested in a local area network and adopts the data-centric approach, where the replica hosts are passive network storage devices.
The implementation was designed to reconfigure the service either asynchronously without the use of consensus, or partially synchronously with the help of consensus.
The evaluation showed that in the absence of reconfigurations the two versions of the algorithm have similar read and write operation latencies. When multiple reconfigurations occur concurrently, it was observed that the asynchronous, consensus-free approach has a significant negative effect on the latency of read and write operations that are concurrent with reconfigurations. The explanation here is that the consensus-free algorithm must examine multiple possible configurations, whereas an algorithm that incorporates reconfiguration with consensus normally deals with a single configuration on which all clients agree. Reconfiguration latency on the other hand is somewhat better in the consensus-free version of DynaDisk when many reconfigurations occur simultaneously. In such situations, the consensus-based DynaDisk may take longer to reach a decision.
We presented several approaches to implementing atomically consistent memory services in dynamic distributed systems. In such systems the collections of participating nodes may evolve over time due to failures, voluntary or planned departures of nodes, and new nodes join the computation. We focused on atomic consistency because it is an intuitive notion that, despite replication, provides equivalence with serial object specifications. Such services make it easier to construct distributed applications, especially given the perception that using the shared memory paradigm is easier than using the message passing paradigm in designing distributed algorithms.
The approaches presented in this article are representative of the different design choices available for implementing distributed memory services. Each of these has its pros and cons. For instance, solutions based on consensus, while conceptually simple, typically use coordinators during certain stages, and the performance may critically depend on the availability of the coordinator. Substantial delays may be incurred during coordinator failover, even if a large majority of hosts do not fail. Group communication services is an effective building block for low-latency networks with high link availability, however they suffer from high overheads during view changes that occur even if the number of failures is relatively small, but spread over longer stretches of time. Theoretically elegant approaches that use incremental host replacement such as DynaStore, quorum replacement such as Rambo, and focal points such as GeoQuorums, may be more complex to implement, but have greater flexibility and allow system-specific optimizations. Achieving good performance also depends on stability of the deployment platform, assumptions on failure rates, and other factors, such as those considered in a tailored implementation of the FAB system.
Regardless of how the reconfiguration of the set of replica hosts is done, any practical implementation must address the challenge of deciding when to reconfigure. Here, one may delegate the decision to individual hosts, for example, a node joining the service causes reconfiguration, or a node implicitly causes reconfiguration when its failure is detected. Although such approaches may be simple and effective when perturbations are infrequent and the set of replicas is small, they may cause undue overheads when nodes continue joining and leaving the service even though some core set of hosts is stable and sufficient for providing good service. Another approach is to leave the decision for when to reconfigure to another distributed service that monitors the performance of the memory system and decides when to reconfigure based on these observations and on speculative forecasts. This is a more complicated solution, but it has the potential of providing superior quality of service. Additional consideration is selecting a suitable set of hosts. Just because a node is interested in serving as a replica host does not mean its wish must be granted. Here, the external service that decides when to reconfigure can also decide on the target set of nodes. Note that no agreement on the target set is requiredall memory services are able to deal with the situations when several target sets are proposed.
The dynamic atomic shared memory services guarantee consistency in all executions, regardless of the magnitude or frequency of replica host failures. Termination of read and write operations, however, is conditioned on restricting failures. For static systems, generally this restriction can be easily formulated: here, any minority subset of hosts is allowed to fail. For dynamic systems the constraints on failure patterns are much more involved and depend on the specific algorithmic approaches; the reader is referred to the cited articles for additional details.
With the advent of Cloud services, distributed storage services are bound to continue attract attention. The technical challenges and performance overheads may be the reasons why the existing distributed storage solutions shy away from atomic consistency guarantees. Commercial solutions, such as Google's File System (GFS),22 Amazon's Dynamo,15 and Facebook's Cassandra,28 provide less-than-intuitive, unproved guarantees. The concepts discussed in this survey are echoed in the design decisions of production systems. For instance, consensus is used in GFS22 to ensure agreement on system configuration as it is done in Rambo; global time is used in Spanner13 as it is done in GeoQuorums; replica access protocols in Dynamo15 use quorums as in some approaches surveyed here. These examples provide motivation for pursuing rigorous algorithmic approaches in the study of consistent data services for dynamic networked systems.
Consistent storage systems continues to be an area of active research and advanced development, and there are good reasons to believe that as high-performance dynamic memory systems with superior fault-tolerance become available, they will play a significant role in the construction of sophisticated distributed applications. The demand for implementations providing atomic read/write memory will ultimately be driven by the needs of distributed applications that require provable consistency and performance guarantees.
This work was supported in part by the NSF award 1017232. We thank Jennifer Welch and the anonymous reviewers for numerous insightful comments.
14. De Prisco, R., Fekete, A., Lynch, N.A. and Shvartsman, A.A. A dynamic primary configuration group communication service. In Proceedings of the 13th Int-l Symposium on Distributed Computing. Springer-Verlag, 1999, 6478.
16. Dolev, S., Gilbert, S., Lynch, N., Shvartsman, A. and Welch, J. GeoQuorums: Implementing atomic memory in ad hoc networks. In Proceedings of the 17th International Symposium on Distributed Computing (2003), 306320.
31. Loui, M.C. and Abu-Amara, H.H. Memory requirements for agreement among unreliable asynchronous processes. In Parallel and Distributed Computing, Vol 4 of Advances in Computing Research. F.P. Preparata, Ed. JAI Press, Greenwich, Conn., 1987, 163183.
35. Rodrigues, R., Liskov, B., Chen, K., Liskov, M. and Schultz, D. Automatic reconfiguration for large-scale reliable storage systems. IEEE Trans. on Dependable and Secure Computing 9, 2 (2012), 145158.
37. Shraer, A. Martin, J.-P., Malkhi, D. and Keidar, I. Data-centric reconfiguration with network-attached disks. In Proceedings of the 4th Int'l Workshop on Large Scale Distributed Systems and Middleware (2010), ACM, 2226.
©2014 ACM 0001-0782/14/06
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2014 ACM, Inc.
No entries found