Shared-memory concurrent data structures are pervasive. They are used explicitly in server software such as in-memory databases and key-value stores. Even when software is built with a programing model based around shared-nothing computation or side-effect-free functional programming, we find concurrent data structures in the heart of the implementation in the operating system, the language runtime system, or the garbage collector.
Designing these high-performance concurrent data structures has long been recognized as a difficult task. Not only is it challenging to develop an implementation that is correct, but the underlying hardware is a moving target; techniques that work well on one system may work poorly on another, and techniques that work on today’s systems may work poorly on tomorrow’s.
What better way to simplify this task than to have an automatic technique to generate a concurrent data structure from existing sequential code? That is the goal set by Calciu et al. in their work on Node Replication (NR). In the following article, they show that not only can a concurrent data structure be built automatically, but that performance is actually competitive with state-of-the-art designs for a series of important workloads.
NR achieves this good performance by recognizing that the bottleneck for many concurrent data structures is the memory accesses that are made—particularly when threads on different sockets are "fighting" over the same cache lines, or when threads on one socket are accessing data stored on a different socket. NR reduces these costs by maintaining per-socket synchronized replicas of a data structure, and routing a thread’s requests to its own local replica.
This is a surprising and inspiring result, particularly given that building "universal" constructions for concurrent data structures has been an active research field for over 30 years. Prior work has made important theoretical contributions, not the least of which around the importance of instruction set support for atomic operations such as compare-and-swap. However, NR takes that exploration further both in reaching the point where practical implementations can perform well in some workloads, and also in illustrating the benefits of "mechanical sympathy" between the techniques in the implementation and the physical structure of the underlying machine.
An exciting implication of this paper is it provides a division of responsibility: The implementer of a data structure is responsible for its correctness and for making it efficient in the absence of concurrency. The implementer of NR is responsible for building the replication and routing mechanisms efficiently for a particular machine; improvements to these mechanisms will help any data structure using them. For example, if a new machine has specialized hardware for sending messages between sockets then that could be used within NR without needing to change data structures.
In the longer term, it is interesting to think about broader applications of the ideas used in NR. One is to support shared data structures on machines without hardware cache coherence—both at the scale of multiple processors combined in a system-on-chip, and also in a distributed context between separate machines with a high-performance interconnect.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment