What is nonblocking progress? Consider the simple example of incrementing a counter C shared among multiple threads. One way to do so is by protecting the steps of incrementing C by a mutual exclusion lock L
(such as, acquire (L)
; old
: = C
; C
: = old+1
; release (L)
;). If a thread P
is holding L
, then a different thread Q
must wait for P
to release L
before Q
can proceed to operate on C.
That is, Q
is blocked by P.
Now consider an implementation of the increment operation using the Compare-and-Swap (CAS) atomic primitive. CAS atomically reads a shared location and compares the read value with an expected value. If equal, it writes a new value to the location and returns an indicator of equality with the expected value. In the following implementation &C
is the address of C
:
This implementation is nonblocking because no thread can be blocked by the inaction of other threads (caused by, for example, preemption, a page fault, or even termination), regardless of where the other threads stop.
There are three established levels of nonblocking progress: obstruction-free, lock-free, and wait-free. The weakest is obstruction freedom. An operation is obstruction-free if it is guaranteed to complete in a finite number of steps when running alone.9 An obstruction-free operation does not need to wait for actions by other threads, regardless of where they have stopped. An obstruction-free operation, however, may starve or end up in a livelock scenario with one or more concurrent operations where the actions of the threads prevent them all from making progress.
Lock-free progress combines obstruction-free progress with livelock freedom. That is, in a system with only lock-free operations, whenever an operation takes a finite number of steps, some operation (maybe a different one) must have completed during that period.
Finally, wait-free progress7 combines lock-free progress with starvation-freedom. That is, a wait-free operation is guaranteed to complete in a finite number of its own steps, regardless of the actions or inaction of other operations.
It is worth noting these levels of progress guarantees require that primitive memory-access steps have at least the same level of progress guarantee at the hardware-arbitration and cache-coherence levels. For example, a cache-coherence protocol that may cause some primitive memory accesses to be retried indefinitely cannot support a wait-free algorithm that uses such primitives.
Uses of Nonblocking Operations
Nonblocking operations are often used for systems or interthread inter- actions where having threads wait for actions by other threads make the system vulnerable to deadlock, livelock, and/or prolonged delays. Examples of such uses are:
- Async signal safety. This allows signal handlers of asynchronous signals to share data or services with the interrupted thread, with a guarantee that the signal handler never needs to wait for actions by the interrupted thread—which, as a result of being interrupted, cannot run until the signal handler completes. Nonblocking operations used in particular by the signal handlers can guarantee the absence of such waiting and provide async signal safety.
- Kill-safe systems. Such systems guarantee availability even if processes may be terminated at arbitrary points. This requirement arises in server systems, where the ability to terminate processes representing client requests allows high server throughput. Such systems must guarantee availability even when processes may be terminated at arbitrary points while operating on shared structures or services. When using nonblocking operations in such a system, the remaining processes never have to wait for action—that will never happen—by a terminated process.
- Soft real-time applications on commodity systems. Using nonblocking operations on such systems (for example, media players) helps avoid priority inversion and provides preemption tolerance. That is, it eliminates the possibility that an active thread—potentially executing a high-priority task—is blocked awaiting action by other threads that are delayed for a long time (say, because of page faults). This helps make long noticeable delays highly unlikely in such systems.
Selecting Nonblocking Features
Unlike the example of the shared counter, almost any nonblocking operation that involves more than one location presents multiple considerations that are often at odds. This article examines trade-offs and compromises that have to be considered in selecting features of nonblocking operations. These trade-offs are seldom straightforward and sometimes require compromises in order to achieve the bare minimum of system requirements. Balancing these trade-offs can be daunting if you start unaware of all the potential pitfalls.
The goals of this article are to walk the reader through the various issues and features of nonblocking operations and the various choices to be made; to understand the interactions among these options; and to develop a sense of the best path to take to disentangle the interdependencies among these choices and quickly prune options that are incompatible with the main objectives of using nonblocking operations in the first place.
For some uses of nonblocking operations, such as kill-safety, the need for nonblocking progress likely involves operations on multiple data structures and services. In these cases, one has to consider the totality of the interdependence of the features of the various nonblocking components to reach an acceptable solution.
A Running Example
Figure 1 presents a variant of the classic lock-free IBM LIFO (last-in first-out)-free list algorithm10 as a running example. The variable Header
consists of a packed pointer-integer pair that can be read and operated on atomically by a double-width CAS. The integer size is assumed to be large enough that it never overflows. The notation *b
is used to indicate the location pointed to by a pointer b.
Depending on the context, reads and CAS operations on Header
are either single- or double-width.
Ignoring for now why an integer is packed with the pointer in Header
(explained later), the reader can notice that the inaction of any number of threads stopped anywhere in the Push and Pop operations cannot block other threads from operating on the list. In fact, since Push and Pop are not only nonblocking but also lock-free, then as long as there are active threads attempting these operations, some operation will complete. Whenever any thread takes five steps in one of these operations, some operation must have completed (by the same or a different thread) during that period.
Key Issues in Selecting Nonblocking Features
The following are the key issues to consider in selecting the features of nonblocking operations.
Levels of progress guarantee. The appropriate level of nonblocking progress (for example, obstruction-free vs. lock-free) and the extent of use of nonblocking operations depend on the progress guarantees required by the system.
To achieve async signal safety using nonblocking operations, only the signal handler’s operations need to be nonblocking, while operations by the main threads can be blocking. For example, in a case where signal handlers may search hash tables that may be updated by threads that may be interrupted, a fully nonblocking algorithm is not necessary, whereas an algorithm with nonblocking search operations and blocking add and remove operations (as described by Heller et al6) can be sufficient.
If the only concern is the interaction between the signal handler and the interrupted thread, then using obstruction-free operations is sufficient. Kill-safety tends to require broader use of nonblocking operations to access objects shared by processes that are subject to arbitrary termination. Similar to async signal safety, the use of obstruction-free operations suffices to provide availability. Using lock-free or wait-free operations guarantees livelock freedom or starvation freedom, respectively.
For avoiding priority inversion in soft real-time applications, it may suffice to make the high-priority operation obstruction-free; however, stronger progress guarantees (lock-free or wait-free) are clearly desirable in this context.
Wait-free algorithms tend to entail space at least linear in the number of threads that may operate concurrently.1 Aside from the space overheads, some recent wait-free algorithms show reasonably competitive performance.5,19 Lock-free algorithms can achieve competitive performance with blocking algorithms and do not have inherently high space requirements. There are hardly any custom obstruction-free algorithms for specific data structures that are not lock-free; however, obstruction-free algorithms for arbitrary atomic sections (or transactional memory) are evidently simpler than corresponding lock-free algorithms.
Accordingly, in choosing the appropriate level of nonblocking progress to use, you must take into account what is absolutely needed to achieve the primary requirement (for example, kill-safety) vs. what is desirable (for example, wait-free progress) and what this entails for the other issues.
If the number of threads that can concurrently execute certain operations is limited, then strong progress guarantees can be achieved by using simpler algorithms than when higher levels of concurrency are allowed (for example, single-producer or single-consumer queues vs. multiproducer multiconsumer queues).
To achieve async signal safety using nonblocking operations, only the signal handler’s operations need to be nonblocking, while operations by the main threads can be blocking.
An issue that is sometimes missed is the effect of read operations on update operations. Consider two buffer implementations. Both support check (if empty) and add operations. In both implementations, operations are lock-free. The first implementation guarantees that check operations never prevent an add operation from completing. In the other, check operations may interfere with and prevent add operations from completing (for example, as a result of using reference counting, as discussed later).
You can see how problematic the latter implementation is when the act of checking if the buffer is not empty can prevent items from ever being added to the buffer. Therefore, in selecting the appropriate level of progress for a nonblocking operation, it is not enough just to require the implementation, for example, to be lock-free. It is also important to decide which operations must be immune from interference by others. In the buffer example, while add operations are lock-free with respect to each other in both implementations, it is desirable that an individual add operation is wait-free with respect to any number of concurrent check operations.
The choice of data structures is one of the most important decisions in designing a nonblocking environment. This step is often overlooked because data-structure choices may be inherited from sequential or blocking designs. In choosing data structures, designers need to consider the minimum requirements and the range of desirable characteristics.
For example, if lock-free FIFO lists are considered, one should ask whether FIFO order is indeed necessary or if a more relaxed order is acceptable. If it is the latter, then the design may be simplified and performance under certain conditions may be improved by using lock-free LIFO lists instead. The classic LIFO list algorithm has simpler memory safety requirements (as discussed later) and in general has shorter path length than FIFO algorithms.
Another example is that a search structure that may be a good choice in sequential code (for example, balanced trees) may not be the best choice in nonblocking systems compared with hash structures. Also, static structures such as hash tables with open addressing can be simpler to manage than dynamic structures such as hash tables with chaining.
Safe memory reclamation issues. Nonblocking operations, by definition, do not wait for actions by other nonblocking operations and cannot expect to prevent other nonblocking operations from taking actions. This poses a problem for nonblocking operations that dereference pointers to dynamic structures that could be removed and freed by other operations. Without proper precautions a nonblocking operation may be about to access a dynamic block when the block gets removed from the structure and freed by another operation. This could lead to problematic outcomes such as an access violation if the block were freed back to the operating system, corrupting a different structure that happens to allocate and use the memory of the freed block, or returning an incorrect result.
Using the LIFO-free list (in Figure 1) as an example, one can see that a Pop operation may lead to accessing free memory. For example, a thread P
reads a pointer to address A
from the variable Header
in line 1 of Pop. Then, another thread Q
pops the block at A
from the list and frees it back to the operating system. Now P
proceeds to line 2 in Pop and dereferences the pointer to A
, potentially suffering an access violation.
Paul McKenney13 offers a detailed discussion of safe memory reclamation issues and solutions. The following is a brief list of categories of memory-safety solutions and their implications:
- Automatic GC (garbage collection). On systems with GC, such as Java applications, memory safety is implicitly guaranteed. In the preceding example, as long as thread
P
holds a reference to blockA
, blockA
is guaranteed not to be freed. Of course, this raises the question of whether GC itself is nonblocking. - RCU(Read-Copy-Update) and epoch-based collectors. Briefly, RCU-like solutions guarantee that blocks removed from certain data structures are not freed until it can be established that all threads have reached quiescence points where it is impossible that any threads can still be holding references to such blocks.14
- Reference counting. Blocks are associated with counters that are incremented and decremented as threads acquire and release references to such blocks. Typically, a block is freed only when its reference count goes to zero and it is guaranteed that no new references to it can be created.20
- Hazard pointers. Briefly, in this method each thread that may traverse blocks that may be removed and freed owns a number of hazard pointers. Before dereferencing a pointer, a thread sets one of its hazard pointers to the pointer value. Other threads that may remove the block guarantee that they will not free the block until no hazard pointers point to it.8,16
Hardware transactional memory may simplify nonblocking safe memory reclamation.4 As discussed later, however, most upcoming mainstream HTM implementations are best effort and require an alternate non-HTM path.
The choice of data structures is one of the most important decisions in designing a nonblocking environment.
The options for safe memory reclamation in order of increasing flexibility (and difficulty of memory-safety solutions) are:
- Freed blocks will be reused but never freed for different uses.
- Freed blocks can be coalesced and reused for different types and sizes but not returned to the operating system.
- Freed blocks may be coalesced, reused arbitrarily, or returned to the operating system.
Note that for some algorithms (for example, the classic LIFO list), memory safety might not be a problem if the operating system supports nonfaulting loads. In the scenario just mentioned of thread P
reading address A
in line 2 of Pop after the block at A
was freed to the operating system, a system with nonfaulting loads would allow such a read.
ABA prevention. The ABA problem is common in optimistic concurrency algorithms, including nonblocking ones. The term ABA refers to the change of the value of a shared variable from A
to B
and back to A
again. Using the LIFO-list Pop operation as an example and ignoring the packed integer with Header
for now, the following is an ABA problem scenario starting with a list that includes blocks A
and B
:
- A thread
P
reads the valueA
from Header in line 1 and B from *A
in line 2; - Other threads pop blocks
A
andB
and push blocksC
andA
, leavingHeader
holding the valueA
again; P
performs a CAS operation on Header with parametersA
andB
in line 3, and the CAS succeeds.
Now the list is corrupted. Header
points to B
, which is not in the list anymore. What went wrong is that (without the packed integer) P
‘s Pop algorithm cannot differentiate between the case where Header
never changed between lines 1 and 3 and this scenario where the list changed but Header
returned to its old value before the CAS in line 3.
The classic LIFO algorithm packs an integer with the pointer in the Header
variable and is designed such that the counter will change if the list changes between lines 1 and 3 of Pop (assuming that the counter never overflows).
Packing an integer with the main content of variables vulnerable to the ABA problem is the classic ABA prevention solution.2 It remained the only solution for decades, but it has its limitations. It requires wide atomic operations to allow space for a large enough integer that cannot overflow (at least cannot overflow in the lifetime of one operation, such as Pop in the example). Another disadvantage of this solution is that it requires the integer subfield to retain its semantics indefinitely. This can make it very difficult to free the memory of dynamic blocks that include variables packed with ABA-prevention counters, thus meaning memory cannot be coalesced or reused for different purposes.
Although the ABA problem can occur in algorithms that do not use dynamic memory, its solutions are intertwined with safe memory reclamation solutions. First, as already mentioned, the classic ABA solution can hinder memory reclamation. Second, any memory-safety solution (GC, RCU, reference counting, hazard pointers) can be adapted to construct an ABA solution, possibly with an added level of indirection.
An advantage of the RCU type of solution is that traversal operations have almost no overhead, while reference counting and hazard pointers have nontrivial overhead for traversal operations. On the other hand, unlike RCU, reference counting and hazard-pointer methods guarantee bounds on the number of removed blocks that are not ready for reuse.
A disadvantage of reference counting that can be significant in some cases is that it can cause a supposedly read-only operation (such as the check operation mentioned earlier) to actually write to shared data (reference counters) and prevent update operations from ever completing.
The operations LL (load-linked), SC (store-conditional), and VL (validate) are inherently immune to the ABA problem. LL(location)
returns the value in a shared location. VL(location)
returns a Boolean indicator of whether or not the shared location has not been written by another thread since the last LL to it by the current thread. SC(location, value)
writes a new value to the location if and only if it has not been written by another thread since it was last LL’ed to it by the current thread, and it returns a Boolean indicator of the occurrence of such a write. If the read in line 1 of Pop is replaced with LL(&Header)
and the CAS in line 3 of Pop is replaced with SC(&Header,n)
, then the Pop operation would be immune to the ABA problem, without the need for using a packed integer.
Actual architectures that support LL/SC (for example, IBM Power PC) do not support the full semantics of ideal LL/SC/VL. None supports the VL operation, and all impose restrictions on what can be executed between LL/SC and prohibit the nesting or interleaving of LL/SC pairs on
different locations. So, while actual LL/SC support can help the lock-free LIFO algorithm avoid the ABA problem, it is limited in preventing the ABA problem in general.
While implementations of ideal LL/SC/VL present an absolute ABA prevention solution,18 their strong semantics disallow many correct scenarios. For example, all ABA scenarios in the lock-free LIFO-list Push operation are benign. Therefore, it is advisable to consider algorithms expressed in terms of reads and CAS operations, and address only the harmful cases of ABA, such as Pop but not Push in the lock-free LIFO-list algorithm.
It is advisable to consider ABA prevention and safe memory reclamation solutions together to avoid unnecessary duplication of overhead or solutions that are contradictory or contrary to the overall system requirements.
Portability of required atomic operations. The range of hardware support for atomic operations needed for nonblocking algorithms and methods varies significantly. If portability is an important issue, designers need to take that into account in selecting data structures, algorithms, and supporting methods for safe memory reclamation and management, and ABA prevention.
Hardware support requirements include:
- No support. For example, the reading and writing of hazard pointers can be nonatomic.16
- Nonuniversal atomic operations (such as fetch-and-add, test-and-set, and atomic swap). Maurice Herlihy showed it is impossible to design wait-free (and lock-free) implementations of certain data types that can be operated on by an arbitrary number of concurrent threads using only such (nonuniversal) operations.7
- Compare-and-swap. Herlihy showed that CAS is a universal operation and can be used to design implementations of any data type with wait-free operations without limitation on the maximum number of threads that operate on it concurrently. Pointer-size CAS may suffer from the ABA problem. The classic solution to that problem requires the use of wider atomic operations (for example, double-width load and CAS primitives).
- The PAIR LL and SC (for example,
larx
andstcx
on the IBM Power architecture). These were also shown by Herlihy to be universal operations. As already discussed, they are immune to the ABA problem in some cases in which CAS is susceptible. Therefore, pointer-size LL/SC may suffice or entail simpler code where double-width CAS is needed in the absence of LL/SC. - Hardware transaction memory. Recently IBM (Blue Gene/Q,21 System Z,12 and Power3) and Intel11 architectures are offering hardware support for arbitrary memory transactions. However, most of these HTM architectures (except IBM System Z) are best effort (that is, that they require programmers provide a nontransactional path in case the hardware transactions never succeed).
Note that if the number of threads that can concurrently execute certain operations is limited, nonuniversal atomic operations may suffice to design wait-free and lock-free implementations. For example, wait-free single-producer or single-consumer FIFO queue operations15 (by skipping the appropriate locks in the two-lock algorithm) and single-updater sets with lock-free lookup by multiple threads16 can be implemented with just atomic loads and stores.
Choice of language and memory ordering. In addition to the variety of requirements of atomic operations, nonblocking algorithms vary in their requirements of memory-ordering primitives. For example, in the Push operation of the running LIFO-list example (in Figure 1), the write in line 2 must be ordered before the write (in the case of a successful CAS) in line 3. Hardware platforms and high-level programming languages offer explicit primitives, as well as implicit guarantees of ordering among memory accesses, usually specified as the architecture or language memory consistency model.
Prior to Java 5 (2004) and C11/C++11, these languages could not be reliably used (as specified by their standards) to fully express the required memory ordering. Programmers and custom library writers relied on assembly language and machine binary codes to express such ordering.
Now with C11/C++11 and Java (5 and later), programmers can designate some variables as subject to special ordering (volatiles in Java and atomics in C11/C++11). In some cases, these designations can be heavy handed in imposing order around such variables even when not needed by the algorithms. Standard C11/C++11 offers finer levels of ordering that allow the programmer to specify the desired order.
The implementations of such languages have varying overheads on different hardware platforms. Designers of nonblocking implementations should take into account the choice of high-level languages and their memory-ordering performance on the targeted hardware platforms.
Choice of algorithms. The combined choice of data structures and algorithms is one issue where big compromises can be made to design a system that meets its overall requirements.
Nonblocking algorithms vary in their portability (for example, requirement of special hardware support), reliability (for example, whether or not they are widely used), complexity (for example, reasonable ease of implementation, maintenance, and modification), progress guarantees (for example, wait-free, and lock-free), and memory safety and ABA-prevention features (for example, compatibility or incompatibility with certain methods). The choice of algorithms is intertwined with most of the issues discussed here.
Case Study
The purpose of the following example is to demonstrate the interactions among nonblocking features and issues discussed in this article.
Consider a simplified example of a kill-safe system that requires processes to share large potentially variable-size records (each with a unique key value) that can be deleted or moved arbitrarily among various data structures. Operations on the records and data structures are search, add, delete, and update of certain fields. Figure 2 shows dynamic variable-size records that can be freed and moved among structures.
Considering that the records can be moved back and forth among data structures, any nonblocking algorithm will have to deal with the ABA problem. Since the records can be large and variably sized, limiting the reuse of their memory is not an acceptable option. The classic ABA solution (used in the LIFO-list Pop example) can be excluded, as it will limit arbitrary reuse of memory. Epoch-based solutions should also be excluded because they might get extremely complex under arbitrary termination. But then, the remaining solutions—reference counting and hazard pointers—are not acceptable options either. They depend on preventing the records from being reinserted in the same structures (which is a requirement) as long as there are active references to them. There is no acceptable solution.
This may be a good time to consider a compromise. How about adding a separate descriptor to each record? The descriptor holds the key value and any fields needed for traversing the data structures; the fields then may be changed in place without removing and adding the records to the structure. With this compromise the memory of the full records can still be freed, but it might be acceptable to keep descriptors from being freed for arbitrary reuse.
Now let’s reconsider the ABA problem. Maybe the classic packed-integer solution can work. This has the advantage of allowing the descriptors (and the associated records) to be moved freely among data structures, but it adds a dependence on hardware support for wide atomic operations (for example, 128-bit CAS). The other options (hazard pointers and reference counting) do not require wide atomic operations but will require the use of a fresh descriptor (that is, one that does not have any old references lingering from prior operations) on every pair of moves of the records between data structures. Furthermore, both hazard pointers and reference counting add overhead (memory ordering and extra atomic operations, respectively). Finally, to deal with the actual allocation and deallocation of the variable-size records, the designer can use a lock-free allocation algorithm with coalescing and the ability to return free memory to the operating system.17
Now the designer is left with the option of balancing sequential performance and portability.
This case study goes through a few iterations over the various choices, starting with what is absolutely necessary, making a compromise (using descriptors and adding a level of indirection), and finally reaching a reasonable set of options for further consideration of their pros and cons.
Summary
This article has presented the key issues that can help designers of nonblocking systems make oft-needed compromises and balance trade-offs to reach a feasible design that satisfies all the requirements. In considering the totality of these issues, designers should go through a few passes on this list of features and issues. First, they should identify and separate the features that are absolutely necessary from those that are desirable but open to compromise. After that, they can start to consider the implications of using these various features.
Acknowledgments
Thanks to Samy Al Bahra and Paul McKenney for their insightful comments on earlier versions of this article.
Related articles
on queue.acm.org
Structured Deferral: Synchronization via Procrastination
Paul E. McKenney
http://queue.acm.org/detail.cfm?id=2488549
Proving the Correctness of Nonblocking Data Structures
Mathieu Desnoyers
http://queue.acm.org/detail.cfm?id=2490873
Nonblocking Algorithms and Scalable Multicore Programming
Samy Al Bahra
http://queue.acm.org/detail.cfm?id=2492433
Join the Discussion (0)
Become a Member or Sign In to Post a Comment