Multithreaded programs that communicate through shared memory are pervasive. They originally provided a convenient way for an application to perform, for example, a long compute task while remaining responsive to an interactive user. Today they are the most obvious route to using multiple available processor cores for a single task, so the user can benefit from the increased number of available cores.
Unfortunately, a surprising amount of confusion surrounds the basic rules obeyed by shared memory. If a variable is updated by one thread, when will the new value become visible to another thread? What does it mean, if anything, to have two threads updating the same variable at the same time? Do all threads have to see updates in a consistent order?
The confusion surrounding these issues has resulted in many intermittent software bugs, often in low-level libraries that affect large numbers of applications. On at least one occasion, it has resulted in a pervasively used, but usually incorrect, programming idiom. (Try searching for "double-checked locking").
This problem arises at different levels. At the programming language level, there must be clear rules for the programmer’s use of shared variables. Compilers and low-level libraries must enforce these rules by relying on corresponding hardware properties for memory-access instructions—the subject of the following paper.
Most of the early work on sharedmemory semantics focused on the instruction set level and trade-offs with hardware optimization. Roughly concurrently, some older programming language designs, notably Ada 83, made very credible attempts to address the language-level issue. Unfortunately none of this prevented major problems in the most widely used languages from escaping attention until very recently. The Java specification was drastically revised around 2005,2 and has still not completely settled.1,3 Similarly, the C and C++ specifications are being revised to finally address similar issues.1 As a result, hardware architects often could not have a clear picture of the programming language semantics they needed to support, making a fully satisfactory resolution of the hardware-level issues more difficult or impossible.
The recent work on shared variables in programming languages highlighted some remaining questions about hardware memory models. For example, Java’s
volatile requires that updates of all such fields must become visible in the same order to all other threads. Is it even possible to enforce that on common hardware at reasonable cost? What instruction sequences do compilers need to generate? Until 2007 or so, the answers to such questions remained unclear, and sometimes unraised, on several major architectures. X86 is probably the most visible of these architectures.
Raising these issues resulted in extended discussions involving the machine architects. As a result of this process, Intel and AMD have released a sequence of specifications for x86 shared-memory accesses. These are far more precise than they were at the start of the process, and directly address many interesting test cases that arose during the discussions. However, they are still not a precise mathematical description that could be used to, for example, prove that an implementation of Java
volatile is correct.
The x86-TSO model fills that gap, by providing precise mathematical and empirically accurate models of x86 shared memory as it is visible to user programs. These include an operational model presented here in a very intuitive fashion. In the process of making the model precise and directly testing it against existing implementations, Sewell et al. expose new issues not currently addressed by the manufacturers’ specifications, while also confirming their model is compatible with existing implementations.
The examples here are interesting, not just because it may be surprising that there are tiny program fragments (often with only four instructions) for your desktop computer whose meaning is still open to debate, but also because these same small examples are often at the core of important algorithms or software. The first example is an abstraction of Dekker’s mutual exclusion algorithm first described in 1965. The same property is important to many modern lock-free algorithms. We already mentioned the importance of a consistent write visibility ordering for Java
volatiles. The upcoming C and C++ standards introduce an approximate analog,
atomic variables, aspects of which also rely on the issues surrounding single-variable coherence (examples n5 and n4b in the paper) being resolved exactly as suggested by x86-TSO. The paper itself discusses the impact of these issues on Linux spin-lock code.
X86-TSO describes the behavior of x86 memory accesses, which depends on the behavior of processors produced by multiple vendors, and chipsets produced by an even larger number. This helped to put academic researchers into the best position to write such a specification.