What is nonblocking progress? Consider the simple example of incrementing a counter C shared among multiple threads. One way to do so is by protecting the steps of incrementing C by a mutual exclusion lock
L (such as,
release (L);). If a thread
P is holding
L, then a different thread
Q must wait for
P to release
Q can proceed to operate on
C. That is,
Q is blocked by
Now consider an implementation of the increment operation using the Compare-and-Swap (CAS) atomic primitive. CAS atomically reads a shared location and compares the read value with an expected value. If equal, it writes a new value to the location and returns an indicator of equality with the expected value. In the following implementation
&C is the address of
This implementation is nonblocking because no thread can be blocked by the inaction of other threads (caused by, for example, preemption, a page fault, or even termination), regardless of where the other threads stop.
There are three established levels of nonblocking progress: obstruction-free, lock-free, and wait-free. The weakest is obstruction freedom. An operation is obstruction-free if it is guaranteed to complete in a finite number of steps when running alone.9 An obstruction-free operation does not need to wait for actions by other threads, regardless of where they have stopped. An obstruction-free operation, however, may starve or end up in a livelock scenario with one or more concurrent operations where the actions of the threads prevent them all from making progress.
Lock-free progress combines obstruction-free progress with livelock freedom. That is, in a system with only lock-free operations, whenever an operation takes a finite number of steps, some operation (maybe a different one) must have completed during that period.
Finally, wait-free progress7 combines lock-free progress with starvation-freedom. That is, a wait-free operation is guaranteed to complete in a finite number of its own steps, regardless of the actions or inaction of other operations.
It is worth noting these levels of progress guarantees require that primitive memory-access steps have at least the same level of progress guarantee at the hardware-arbitration and cache-coherence levels. For example, a cache-coherence protocol that may cause some primitive memory accesses to be retried indefinitely cannot support a wait-free algorithm that uses such primitives.
Nonblocking operations are often used for systems or interthread inter- actions where having threads wait for actions by other threads make the system vulnerable to deadlock, livelock, and/or prolonged delays. Examples of such uses are:
Unlike the example of the shared counter, almost any nonblocking operation that involves more than one location presents multiple considerations that are often at odds. This article examines trade-offs and compromises that have to be considered in selecting features of nonblocking operations. These trade-offs are seldom straightforward and sometimes require compromises in order to achieve the bare minimum of system requirements. Balancing these trade-offs can be daunting if you start unaware of all the potential pitfalls.
The goals of this article are to walk the reader through the various issues and features of nonblocking operations and the various choices to be made; to understand the interactions among these options; and to develop a sense of the best path to take to disentangle the interdependencies among these choices and quickly prune options that are incompatible with the main objectives of using nonblocking operations in the first place.
For some uses of nonblocking operations, such as kill-safety, the need for nonblocking progress likely involves operations on multiple data structures and services. In these cases, one has to consider the totality of the interdependence of the features of the various nonblocking components to reach an acceptable solution.
Figure 1 presents a variant of the classic lock-free IBM LIFO (last-in first-out)-free list algorithm10 as a running example. The variable
Header consists of a packed pointer-integer pair that can be read and operated on atomically by a double-width CAS. The integer size is assumed to be large enough that it never overflows. The notation *
b is used to indicate the location pointed to by a pointer
b. Depending on the context, reads and CAS operations on
Header are either single- or double-width.
Ignoring for now why an integer is packed with the pointer in
Header (explained later), the reader can notice that the inaction of any number of threads stopped anywhere in the Push and Pop operations cannot block other threads from operating on the list. In fact, since Push and Pop are not only nonblocking but also lock-free, then as long as there are active threads attempting these operations, some operation will complete. Whenever any thread takes five steps in one of these operations, some operation must have completed (by the same or a different thread) during that period.
The following are the key issues to consider in selecting the features of nonblocking operations.
Levels of progress guarantee. The appropriate level of nonblocking progress (for example, obstruction-free vs. lock-free) and the extent of use of nonblocking operations depend on the progress guarantees required by the system.
To achieve async signal safety using nonblocking operations, only the signal handler's operations need to be nonblocking, while operations by the main threads can be blocking. For example, in a case where signal handlers may search hash tables that may be updated by threads that may be interrupted, a fully nonblocking algorithm is not necessary, whereas an algorithm with nonblocking search operations and blocking add and remove operations (as described by Heller et al6) can be sufficient.
If the only concern is the interaction between the signal handler and the interrupted thread, then using obstruction-free operations is sufficient. Kill-safety tends to require broader use of nonblocking operations to access objects shared by processes that are subject to arbitrary termination. Similar to async signal safety, the use of obstruction-free operations suffices to provide availability. Using lock-free or wait-free operations guarantees livelock freedom or starvation freedom, respectively.
For avoiding priority inversion in soft real-time applications, it may suffice to make the high-priority operation obstruction-free; however, stronger progress guarantees (lock-free or wait-free) are clearly desirable in this context.
Wait-free algorithms tend to entail space at least linear in the number of threads that may operate concurrently.1 Aside from the space overheads, some recent wait-free algorithms show reasonably competitive performance.5,19 Lock-free algorithms can achieve competitive performance with blocking algorithms and do not have inherently high space requirements. There are hardly any custom obstruction-free algorithms for specific data structures that are not lock-free; however, obstruction-free algorithms for arbitrary atomic sections (or transactional memory) are evidently simpler than corresponding lock-free algorithms.
Accordingly, in choosing the appropriate level of nonblocking progress to use, you must take into account what is absolutely needed to achieve the primary requirement (for example, kill-safety) vs. what is desirable (for example, wait-free progress) and what this entails for the other issues.
If the number of threads that can concurrently execute certain operations is limited, then strong progress guarantees can be achieved by using simpler algorithms than when higher levels of concurrency are allowed (for example, single-producer or single-consumer queues vs. multiproducer multiconsumer queues).
To achieve async signal safety using nonblocking operations, only the signal handler's operations need to be nonblocking, while operations by the main threads can be blocking.
An issue that is sometimes missed is the effect of read operations on update operations. Consider two buffer implementations. Both support check (if empty) and add operations. In both implementations, operations are lock-free. The first implementation guarantees that check operations never prevent an add operation from completing. In the other, check operations may interfere with and prevent add operations from completing (for example, as a result of using reference counting, as discussed later).
You can see how problematic the latter implementation is when the act of checking if the buffer is not empty can prevent items from ever being added to the buffer. Therefore, in selecting the appropriate level of progress for a nonblocking operation, it is not enough just to require the implementation, for example, to be lock-free. It is also important to decide which operations must be immune from interference by others. In the buffer example, while add operations are lock-free with respect to each other in both implementations, it is desirable that an individual add operation is wait-free with respect to any number of concurrent check operations.
The choice of data structures is one of the most important decisions in designing a nonblocking environment. This step is often overlooked because data-structure choices may be inherited from sequential or blocking designs. In choosing data structures, designers need to consider the minimum requirements and the range of desirable characteristics.
For example, if lock-free FIFO lists are considered, one should ask whether FIFO order is indeed necessary or if a more relaxed order is acceptable. If it is the latter, then the design may be simplified and performance under certain conditions may be improved by using lock-free LIFO lists instead. The classic LIFO list algorithm has simpler memory safety requirements (as discussed later) and in general has shorter path length than FIFO algorithms.
Another example is that a search structure that may be a good choice in sequential code (for example, balanced trees) may not be the best choice in nonblocking systems compared with hash structures. Also, static structures such as hash tables with open addressing can be simpler to manage than dynamic structures such as hash tables with chaining.
Safe memory reclamation issues. Nonblocking operations, by definition, do not wait for actions by other nonblocking operations and cannot expect to prevent other nonblocking operations from taking actions. This poses a problem for nonblocking operations that dereference pointers to dynamic structures that could be removed and freed by other operations. Without proper precautions a nonblocking operation may be about to access a dynamic block when the block gets removed from the structure and freed by another operation. This could lead to problematic outcomes such as an access violation if the block were freed back to the operating system, corrupting a different structure that happens to allocate and use the memory of the freed block, or returning an incorrect result.
Using the LIFO-free list (in Figure 1) as an example, one can see that a Pop operation may lead to accessing free memory. For example, a thread
P reads a pointer to address
A from the variable
Header in line 1 of Pop. Then, another thread
Q pops the block at
A from the list and frees it back to the operating system. Now
P proceeds to line 2 in Pop and dereferences the pointer to
A, potentially suffering an access violation.
Paul McKenney13 offers a detailed discussion of safe memory reclamation issues and solutions. The following is a brief list of categories of memory-safety solutions and their implications:
Pholds a reference to block
Ais guaranteed not to be freed. Of course, this raises the question of whether GC itself is nonblocking.
Hardware transactional memory may simplify nonblocking safe memory reclamation.4 As discussed later, however, most upcoming mainstream HTM implementations are best effort and require an alternate non-HTM path.
The choice of data structures is one of the most important decisions in designing a nonblocking environment.
The options for safe memory reclamation in order of increasing flexibility (and difficulty of memory-safety solutions) are:
Note that for some algorithms (for example, the classic LIFO list), memory safety might not be a problem if the operating system supports nonfaulting loads. In the scenario just mentioned of thread
P reading address
A in line 2 of Pop after the block at
A was freed to the operating system, a system with nonfaulting loads would allow such a read.
ABA prevention. The ABA problem is common in optimistic concurrency algorithms, including nonblocking ones. The term ABA refers to the change of the value of a shared variable from
B and back to
A again. Using the LIFO-list Pop operation as an example and ignoring the packed integer with
Header for now, the following is an ABA problem scenario starting with a list that includes blocks
Preads the value
Afrom Header in line 1 and B from *
Ain line 2;
Band push blocks
Headerholding the value
Pperforms a CAS operation on Header with parameters
Bin line 3, and the CAS succeeds.
Now the list is corrupted.
Header points to
B, which is not in the list anymore. What went wrong is that (without the packed integer)
P's Pop algorithm cannot differentiate between the case where
Header never changed between lines 1 and 3 and this scenario where the list changed but
Header returned to its old value before the CAS in line 3.
The classic LIFO algorithm packs an integer with the pointer in the
Header variable and is designed such that the counter will change if the list changes between lines 1 and 3 of Pop (assuming that the counter never overflows).
Packing an integer with the main content of variables vulnerable to the ABA problem is the classic ABA prevention solution.2 It remained the only solution for decades, but it has its limitations. It requires wide atomic operations to allow space for a large enough integer that cannot overflow (at least cannot overflow in the lifetime of one operation, such as Pop in the example). Another disadvantage of this solution is that it requires the integer subfield to retain its semantics indefinitely. This can make it very difficult to free the memory of dynamic blocks that include variables packed with ABA-prevention counters, thus meaning memory cannot be coalesced or reused for different purposes.
Although the ABA problem can occur in algorithms that do not use dynamic memory, its solutions are intertwined with safe memory reclamation solutions. First, as already mentioned, the classic ABA solution can hinder memory reclamation. Second, any memory-safety solution (GC, RCU, reference counting, hazard pointers) can be adapted to construct an ABA solution, possibly with an added level of indirection.
An advantage of the RCU type of solution is that traversal operations have almost no overhead, while reference counting and hazard pointers have nontrivial overhead for traversal operations. On the other hand, unlike RCU, reference counting and hazard-pointer methods guarantee bounds on the number of removed blocks that are not ready for reuse.
A disadvantage of reference counting that can be significant in some cases is that it can cause a supposedly read-only operation (such as the check operation mentioned earlier) to actually write to shared data (reference counters) and prevent update operations from ever completing.
The operations LL (load-linked), SC (store-conditional), and VL (validate) are inherently immune to the ABA problem.
LL(location) returns the value in a shared location.
VL(location) returns a Boolean indicator of whether or not the shared location has not been written by another thread since the last LL to it by the current thread.
SC(location, value) writes a new value to the location if and only if it has not been written by another thread since it was last LL'ed to it by the current thread, and it returns a Boolean indicator of the occurrence of such a write. If the read in line 1 of Pop is replaced with
LL(&Header) and the CAS in line 3 of Pop is replaced with
SC(&Header,n), then the Pop operation would be immune to the ABA problem, without the need for using a packed integer.
Actual architectures that support LL/SC (for example, IBM Power PC) do not support the full semantics of ideal LL/SC/VL. None supports the VL operation, and all impose restrictions on what can be executed between LL/SC and prohibit the nesting or interleaving of LL/SC pairs
on different locations. So, while actual LL/SC support can help the lock-free LIFO algorithm avoid the ABA problem, it is limited in preventing the ABA problem in general.
While implementations of ideal LL/SC/VL present an absolute ABA prevention solution,18 their strong semantics disallow many correct scenarios. For example, all ABA scenarios in the lock-free LIFO-list Push operation are benign. Therefore, it is advisable to consider algorithms expressed in terms of reads and CAS operations, and address only the harmful cases of ABA, such as Pop but not Push in the lock-free LIFO-list algorithm.
It is advisable to consider ABA prevention and safe memory reclamation solutions together to avoid unnecessary duplication of overhead or solutions that are contradictory or contrary to the overall system requirements.
Portability of required atomic operations. The range of hardware support for atomic operations needed for nonblocking algorithms and methods varies significantly. If portability is an important issue, designers need to take that into account in selecting data structures, algorithms, and supporting methods for safe memory reclamation and management, and ABA prevention.
Hardware support requirements include:
stcxon the IBM Power architecture). These were also shown by Herlihy to be universal operations. As already discussed, they are immune to the ABA problem in some cases in which CAS is susceptible. Therefore, pointer-size LL/SC may suffice or entail simpler code where double-width CAS is needed in the absence of LL/SC.
Note that if the number of threads that can concurrently execute certain operations is limited, nonuniversal atomic operations may suffice to design wait-free and lock-free implementations. For example, wait-free single-producer or single-consumer FIFO queue operations15 (by skipping the appropriate locks in the two-lock algorithm) and single-updater sets with lock-free lookup by multiple threads16 can be implemented with just atomic loads and stores.
Choice of language and memory ordering. In addition to the variety of requirements of atomic operations, nonblocking algorithms vary in their requirements of memory-ordering primitives. For example, in the Push operation of the running LIFO-list example (in Figure 1), the write in line 2 must be ordered before the write (in the case of a successful CAS) in line 3. Hardware platforms and high-level programming languages offer explicit primitives, as well as implicit guarantees of ordering among memory accesses, usually specified as the architecture or language memory consistency model.
Prior to Java 5 (2004) and C11/C++11, these languages could not be reliably used (as specified by their standards) to fully express the required memory ordering. Programmers and custom library writers relied on assembly language and machine binary codes to express such ordering.
Now with C11/C++11 and Java (5 and later), programmers can designate some variables as subject to special ordering (volatiles in Java and atomics in C11/C++11). In some cases, these designations can be heavy handed in imposing order around such variables even when not needed by the algorithms. Standard C11/C++11 offers finer levels of ordering that allow the programmer to specify the desired order.
The implementations of such languages have varying overheads on different hardware platforms. Designers of nonblocking implementations should take into account the choice of high-level languages and their memory-ordering performance on the targeted hardware platforms.
Choice of algorithms. The combined choice of data structures and algorithms is one issue where big compromises can be made to design a system that meets its overall requirements.
Nonblocking algorithms vary in their portability (for example, requirement of special hardware support), reliability (for example, whether or not they are widely used), complexity (for example, reasonable ease of implementation, maintenance, and modification), progress guarantees (for example, wait-free, and lock-free), and memory safety and ABA-prevention features (for example, compatibility or incompatibility with certain methods). The choice of algorithms is intertwined with most of the issues discussed here.
The purpose of the following example is to demonstrate the interactions among nonblocking features and issues discussed in this article.
Consider a simplified example of a kill-safe system that requires processes to share large potentially variable-size records (each with a unique key value) that can be deleted or moved arbitrarily among various data structures. Operations on the records and data structures are search, add, delete, and update of certain fields. Figure 2 shows dynamic variable-size records that can be freed and moved among structures.
Considering that the records can be moved back and forth among data structures, any nonblocking algorithm will have to deal with the ABA problem. Since the records can be large and variably sized, limiting the reuse of their memory is not an acceptable option. The classic ABA solution (used in the LIFO-list Pop example) can be excluded, as it will limit arbitrary reuse of memory. Epoch-based solutions should also be excluded because they might get extremely complex under arbitrary termination. But then, the remaining solutionsreference counting and hazard pointersare not acceptable options either. They depend on preventing the records from being reinserted in the same structures (which is a requirement) as long as there are active references to them. There is no acceptable solution.
This may be a good time to consider a compromise. How about adding a separate descriptor to each record? The descriptor holds the key value and any fields needed for traversing the data structures; the fields then may be changed in place without removing and adding the records to the structure. With this compromise the memory of the full records can still be freed, but it might be acceptable to keep descriptors from being freed for arbitrary reuse.
Now let's reconsider the ABA problem. Maybe the classic packed-integer solution can work. This has the advantage of allowing the descriptors (and the associated records) to be moved freely among data structures, but it adds a dependence on hardware support for wide atomic operations (for example, 128-bit CAS). The other options (hazard pointers and reference counting) do not require wide atomic operations but will require the use of a fresh descriptor (that is, one that does not have any old references lingering from prior operations) on every pair of moves of the records between data structures. Furthermore, both hazard pointers and reference counting add overhead (memory ordering and extra atomic operations, respectively). Finally, to deal with the actual allocation and deallocation of the variable-size records, the designer can use a lock-free allocation algorithm with coalescing and the ability to return free memory to the operating system.17
Now the designer is left with the option of balancing sequential performance and portability.
This case study goes through a few iterations over the various choices, starting with what is absolutely necessary, making a compromise (using descriptors and adding a level of indirection), and finally reaching a reasonable set of options for further consideration of their pros and cons.
This article has presented the key issues that can help designers of nonblocking systems make oft-needed compromises and balance trade-offs to reach a feasible design that satisfies all the requirements. In considering the totality of these issues, designers should go through a few passes on this list of features and issues. First, they should identify and separate the features that are absolutely necessary from those that are desirable but open to compromise. After that, they can start to consider the implications of using these various features.
Thanks to Samy Al Bahra and Paul McKenney for their insightful comments on earlier versions of this article.
Structured Deferral: Synchronization via Procrastination
Paul E. McKenney
Proving the Correctness of Nonblocking Data Structures
Nonblocking Algorithms and Scalable Multicore Programming
Samy Al Bahra
3. Cain, H.W., Frey, B. Williams, D., Michael, M.M., May, C. and Le, H. Robust architectural support for transactional memory in the Power architecture. ACM/IEEE 40th International Symposium on Computer Architecture (2013).
4. Dragojevic, A., Herlihy, M., Lev, Y. and Moir, M. On the power of hardware transactional memory to simplify memory management. In Proceedings of the 30th Annual ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (2011), 99108.
6. Heller, S., Herlihy, M., Luchangco, V., Moir, M., Scherer III, W.N. and Shavit, N. A lazy concurrent list-based set algorithm. In Proceedings of the Ninth International Conference on Principles of Distributed Systems (2005): 316.
9. Herlihy, M., Luchangco, V. and Moir, M. Obstruction-free synchronization: double-ended queues as an example. In Proceedings of the 23rd International Conference on Distributed Computing Systems (2003), 522529.
12. Jacobi, C., Slegel, T.J., Greiner, D.F. Transactional Memory Architecture and Implementation for IBM System Z. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (2012), 2536.
13. McKenney, P. Structured deferral: Synchronization via procrastination. ACM Queue (2013); http://queue.acm.org/detail.cfm?id=2488549
14. McKenney, P.E., Slingwine, J.D. Read-copy update: using execution history to solve concurrency problems. In Proceedings of the 10th IASTED International Conference on Parallel and Distributed Computing and Systems (1998).
15. Michael, M.M. and Scott, M.L. Simple, fast, and practical nonblocking and blocking concurrent queue algorithms. In Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing (1996); 267275.
21. Wang, A., Gaudet, M., Wu, P., Amaral, J.N., Ohmacht, M., Barton, C., Silvera, R. and Michael, M.M. Evaluation of Blue Gene/Q hardware support for transactional memories. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (2012), 127136.
©2013 ACM 0001-0782/13/09
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from email@example.com or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2013 ACM, Inc.
No entries found