Computing Applications Research highlights

Efficient System-Enforced Deterministic Parallelism

By Amittai Aviram, Shu-Chun Weng, Sen Hu, and Bryan Ford

Posted May 1 2012

Abstract
1. Introduction
2. A Deterministic Programming Model
3. The Determinator Kernel
4. High-Level Parallel Abstractions
5. Evaluation
6. Conclusion
Acknowledgments
References
Authors
Footnotes
Figures
Tables

Deterministic execution offers many benefits for debugging, fault tolerance, and security. Current methods of executing parallel programs deterministically, however, often incur high costs, allow misbehaved software to defeat repeatability, and transform time-dependent races into input-or path-dependent races without eliminating them. We introduce a new parallel programming model addressing these issues, and use Determinator, a proof-of-concept OS, to demonstrate the model’s practicality. Determinator’s microkernel application programming interface (API) provides only “shared-nothing” address spaces and deterministic interprocess communication primitives to make execution of all unprivileged code—well-behaved or not—precisely repeatable. Atop this microkernel, Determinator’s user-level runtime offers a private workspace model for both thread-level and process-level parallel programming. This model avoids the introduction of read/write data races, and converts write/write races into reliably detected conflicts. Coarse-grained parallel benchmarks perform and scale comparably to non-deterministic systems, both on multicore PCs and across nodes in a distributed cluster.

1. Introduction

The multicore revolution has made parallel programming both ubiquitous and unavoidable, but building reliable parallel software is difficult and error-prone. Mainstream parallel programming models introduce pervasive risks for timing-dependent races, leading to nondeterministic “heisenbugs.” Slight synchronization bugs often cause nondeterministic races between threads accessing shared memory.^{2, 20, 21} Concurrent processes can similarly race when passing messages or accessing files in a shared file system, yielding not just bugs but also security vulnerabilities.²⁵

We would often prefer to run software deterministically, so that from a given input it always produces the same output.²⁰ Beyond mere convenience considerations, deterministic execution is also a basic virtualization tool: a foundation required to implement replay debugging,¹⁹ fault tolerance,¹¹ and accountability mechanisms.¹⁶ Methods of intrusion analysis¹² and timing channel control⁴ further require the system to enforce determinism, even on malicious code that might be designed to evade analysis.

User-space techniques for parallel deterministic execution^{8, 22} show promise but have limitations. First, by relying on a deterministic scheduler residing in the application process, they permit buggy or malicious applications to compromise determinism by interfering with the scheduler. Second, deterministic schedulers emulate conventional APIs by synthesizing a repeatable—but arbitrary—schedule of inter-thread interactions, often using an instruction counter as an artificial time metric. Therefore, data races remain, but their manifestation depends subtly on inputs and code path lengths instead of on “real” time. Third, the user-level instrumentation required to isolate and schedule threads’ memory accesses can incur considerable overhead, even on coarse-grained code that synchronizes rarely.

To meet the challenges presented by ubiquitous parallelism, we may need to rethink standard nondeterministic parallel programming models instead of merely shoehorning them into artificial execution schedules. To this end, we present Determinator, a proof-of-concept OS designed to make determinism the norm,²⁰ avoiding data races at either memory access²¹ or higher semantic levels.² Owing to its OS-level approach, Determinator supports existing languages, can enforce deterministic execution not only on a single process but on groups of interacting processes, and can prevent malicious user-level code from subverting the kernel’s guarantee of determinism. In order to explore the design space freely, Determinator takes a “clean-slate” approach, offering only limited compatibility with basic Unix process, thread, and file APIs. To improve backward compatibility, Determinator’s environment could be implemented in a “sandbox,” extending a legacy kernel.⁹

Determinator’s kernel denies user code direct access to hardware resources whose use can yield nondeterministic behavior, including real-time clocks, cycle counters, and writable shared memory. User code runs within a hierarchy of single-threaded, process-like spaces, each running in a private virtual address space and capable of communicating only with its immediate parent and children. Potentially useful sources of nondeterminism, such as timers, which the kernel represents as I/O devices, can be accessed by unprivileged spaces only via explicit communication with more privileged spaces. A supervisory space can thus mediate all nondeterministic inputs affecting a subtree of unprivileged spaces, logging true nondeterministic events for future replay or synthesizing artificial events, for example.

Atop this kernel API, Determinator’s user-level runtime emulates familiar shared-resource programming abstractions. The runtime employs file replication and versioning²⁴ to offer applications a logically shared file system accessed via the Unix file API, and adapts distributed shared memory techniques¹ to emulate shared memory for multithreaded applications. Since this emulation is implemented in user space, applications can freely customize it, and runtime bugs cannot compromise the kernel’s guarantee of determinism.

Rather than emulating conventional parallel memory models, Determinator explores a private workspace model.³ In this model, each thread keeps a private virtual replica of all shared memory and file system state; normal reads and writes access and modify this working copy. Threads reconcile their changes only at program-defined synchronization points, much as developers use version control systems. This model eliminates read/write data races, because reads see only causally prior writes in the explicit synchronization graph, and write/write races become conflicts, which the runtime reliably detects and reports independently of any (real or synthetic) execution schedule.

Experiments with common parallel benchmarks suggest that Determinator can run coarse-grained parallel applications deterministically with both performance and scalability comparable to nondeterministic environments. Determinism incurs a high cost on fine-grained parallel applications, however, because of Determinator’s use of virtual memory to isolate threads. For “embarrassingly parallel” applications requiring little inter-thread communication, Determinator can distribute the computation across nodes in a cluster mostly transparently to the application, maintaining usable performance and scalability. The prototype still has many limitations to be addressed in future work, such as a restrictive space hierarchy, limited file system size, no persistent storage, and inefficient cross-node communication.

2. A Deterministic Programming Model

Determinator’s goal is to offer a programming model that is naturally and pervasively deterministic. To be naturally deterministic, the model’s basic abstractions should avoid introducing data races or other nondeterministic behavior in the first place, not merely provide ways to control, detect, or reproduce races. To be pervasively deterministic, the model should behave deterministically at all levels of abstraction: for example, for shared memory access, inter-thread synchronization, file system access, interprocess communication, external device or network access, and thread/process scheduling.

Many intermediate design points may yield useful tradeoffs: enforcing determinism only on synchronization and not on memory accesses can improve efficiency, for example.²² For now, however, we explore the feasibility of a “purist” approach to pervasive determinism. To achieve this goal, we found we had to address at least four sources of nondeterminism in current systems: nondeterministic inputs that applications may require for operation; shared state such as memory and file systems; synchronization APIs that threads and processes use to coordinate; and namespaces through which applications use and manage system resources.

2.1. Explicit nondeterministic inputs

Many applications use nondeterministic inputs, such as incoming messages for a Web server, timers for interactive applications, and random numbers for cryptographic algorithms. We seek not to eliminate such nondeterministic inputs but to make relevant inputs explicit and controllable.

Mechanisms for parallel debugging,¹⁹ fault tolerance,¹¹ accountability,¹⁶ and intrusion analysis¹² all rely on the ability to replay a computation instruction-for-instruction, in order to replicate, verify, or analyze a program’s execution history. Replay can be efficient when only I/O need be logged, as for a uniprocessor virtual machine,¹² but becomes much more costly if internal sources of nondeterminism due to parallelism must also be replayed.¹³

Determinator transforms useful sources of nondeterminism into explicit I/O, which applications access via controllable channels, and eliminates only internal non-determinism resulting from parallelism. If an application calls gettime-ofday(), for example, a supervising process can intercept this I/O to log, replay, or synthesize these time “inputs.”

2.2. A race-free model for shared state

Conventional systems give threads concurrent access to many forms of shared state, such as shared memory and file systems, yielding data races and heisenbugs if the threads are improperly synchronized.^{20, 21} Replay debuggers¹⁹ and deterministic schedulers^{8, 22} make data races reproducible once they manifest, but do not change the inherently race-prone model in which developers write applications.

Determinator replaces the standard concurrent access model with a private workspace model,³ which avoids data races in the first place. This model gives each thread a private virtual replica of all logically shared states, including shared memory and file system state. A thread’s normal reads and writes affect only its private working state, and do not interact directly with other threads. Instead, Determinator accumulates each thread’s changes to the shared state, and then reconciles these changes among threads only at program-defined synchronization points. This model is related to early parallel Fortran systems,⁶ replicated file systems,²⁴ distributed version control systems,¹⁵ and snapshot isolation in databases.⁷ To our knowledge, however, Determinator is the first OS to introduce such a model for pervasive thread- and process-level determinism.

If one thread executes the assignment “x = y” while another concurrently executes “y = x,” for example, these assignments race in the conventional model, but are race-free under Determinator and always swap x with y. Each thread’s read of x or y always sees the “old” version of that variable, saved in the thread’s private workspace at the last explicit synchronization point.

Figure 1 illustrates a more realistic example of a game or simulator, which uses an array of “actors” (players, particles, etc.) to represent some logical “universe,” and updates all actors in parallel at each time step. To update the actors, the main thread forks a child thread per actor, then synchronizes by joining all child threads. The child thread code to update each actor is “inline” within the main() function, a convenience of Unix fork() that Determinator extends to shared memory threads.

Each child thread reads the “prior” state of any or all actors in the array, then updates the state of its assigned actor in-place, without explicit copying or additional synchronization. With standard threads this code has a read/write race: each child thread may see an arbitrary mix of “old” and “new” states as it examines other actors in the array. Under Determinator, however, this code is correct and race-free. Each child thread reads only its private working copy of the actors array, which is untouched except by the child thread itself, since the main thread forked that child. As the main thread rejoins all its child threads, Determinator merges each child’s actor array updates back into the main thread’s working copy, for use in the next time step.

Traditional write/write races become conflicts in this model. If two child threads concurrently write to the same actor array element, for example, Determinator detects this conflict and signals a runtime exception when the main thread attempts to join the second conflicting child. In the conventional model, by contrast, either thread might “win” this timing-dependent race, silently propagating a likely erroneous value throughout the computation. Running this code under a conventional deterministic scheduler causes the “winner” to be decided based on a synthetic, reproducible time metric (e.g., instruction count) rather than real time, but the race remains and may still manifest or vanish because of slight changes in inputs or instruction path lengths.

2.3. A race-free synchronization API

Standard threads can behave nondeterministically even in a correctly locked program with no memory access races. Two threads might acquire a lock in any order, for example, leading to high-level data races.² This nondeterminism is inherent in the lock abstraction: we can record and replay or synthesize a lock acquisition schedule,²² but such a schedule is still arbitrary and unpredictable to the developer.

Fortunately, many synchronization abstractions are naturally deterministic, such as fork/join, barriers, and futures. Deterministic abstractions have the key property that when threads synchronize, program logic alone determines the points in the threads’ execution paths at which the synchronization occurs, and the threads that are involved. In fork/join synchronization, for example, the parent’s thread_join(t) operation and the child’s thread_exit() determine the respective synchronization points, and the parent explicitly indicates the thread t to join. Locks fail this test because one thread’s unlock() passes the lock to an arbitrary successor thread’s lock(). Queue abstractions such as semaphores and pipes are deterministic if only one thread can access each end of the queue¹⁸ but nondeterministic if several threads can race to insert or remove elements at either end.

Since the multicore revolution is young and most application code is yet to be parallelized, we may still have a choice of what synchronization abstractions to use. Determinator therefore supports only race-free synchronization primitives natively, although it can emulate nondeterministic primitives via deterministic scheduling for compatibility.

2.4. Race-free system namespaces

Current operating system APIs often introduce nondeterminism unintentionally by exposing shared namespaces implicitly synchronized by locks. Execution timing affects the pointers returned by malloc() or mmap() or the file numbers returned by open() in multithreaded Unix processes, and the process IDs returned by fork() or the file names returned by mktemp() in single-threaded processes. Even if only one thread actually uses a given memory block, file, process ID, or temporary file, the assignment of these names from a shared namespace is inherently nondeterministic.

Determinator’s API avoids creating shared namespaces with system-chosen names, instead favoring thread-private namespaces with application-chosen names. Applications, not the system, choose virtual addresses for allocated memory and process IDs for children. This principle ensures that naming a resource reveals no shared state information other than what the application itself provided. Since implicitly shared namespaces often cause multiprocessor contention, designing system APIs to avoid this implicit sharing may be synergistic with recent multi-core scalability work.¹⁰

3. The Determinator Kernel

We now briefly outline the Determinator kernel’s design and API. Applications normally do not use this API directly, but rather use the user-level runtime described in Section 4.

3.1. Spaces

Determinator executes application code within a hierarchy of spaces, illustrated in Figure 2. Each space consists of CPU register state for a single control flow, a virtual address space containing code and working data, and optionally a snapshot of a prior version of this address space. A Determinator space is analogous to a single-threaded Unix process, with important differences; we use the term “space” to highlight these differences and avoid confusion with the “process” and “thread” abstractions that Determinator emulates at the user level.

A Determinator space cannot outlive its parent, and a space can directly interact only with its immediate parent and children via three system calls described below. The kernel provides no file systems, writable shared memory, or other abstractions that imply globally shared state.

Only the root space has direct access to nondeterministic inputs via I/O devices, such as console input or clocks. Other spaces access I/O devices only indirectly via parent/child interactions or via privileges delegated by the root space. A parent space can thus control all nondeterministic inputs into any unprivileged space subtree, for example, logging inputs for future replay. This space hierarchy also creates a performance bottleneck for I/O-bound applications, a design limitation we leave to address in future work.

3.2. System call API

Determinator spaces interact only as a result of processor traps and three system calls—Put, Get, and Ret, shown in Table 1. Arguments allow multiple related operations to be combined in one system call. A Put call can, for example, initialize a child’s registers, copy a virtual memory range into the child, set permissions on the target range, snapshot the child’s address space, and start the child executing.

Each space has a namespace of child spaces. User-level code manages this namespace, specifying any child number in a Get or Put; the kernel creates this child if it does not exist. If the child did exist and was running, the kernel blocks the parent until the child stops via Ret or a trap. These cooperative “rendezvous” semantics ensure that spaces synchronize only at well-defined points in both spaces’ execution.

User-level code can specify a Copy option in a Get or Put call to copy a range of virtual memory between the invoking space and a child. The kernel uses copy-on-write to optimize large copies and avoid physically copying read-only pages. A Snap option similarly uses copy-on-write to save a reference snapshot of a child space’s entire virtual address space.

The Put call also supports a Merge option, which is like Copy, except that the kernel copies only bytes that differ between the child’s current and reference snapshots into the parent space, leaving other bytes in the parent untouched. The kernel also detects conflicts: if a byte changed in both the child’s and parent’s spaces since the snapshot, the kernel generates an exception, treating a conflict as a programming error like an illegal memory access or divide-by-zero. Determinator’s user-level runtime uses Merge to give multithreaded processes the illusion of shared memory, as described later.

Finally, the Ret system call stops the calling space, returning control to the space’s parent. Exceptions such as divide-by-zero also cause an implicit Ret, providing the parent a status code indicating why the child stopped.

To facilitate debugging or protect against looping children, a parent can start a child with an instruction limit, forcing the child to trap back to the parent after executing a specific number of instructions. Counting instructions instead of “real time” preserves determinism, while enabling spaces to “quantize” a child’s execution to implement deterministic scheduling at user level.⁸

3.3. Distribution via space migration

Space hierarchies can span not only multiple CPUs in a multiprocessor/multicore system but also multiple nodes in a homogeneous cluster. While distribution is semantically transparent to applications, an application may have to be designed with distribution in mind to perform well.

To support distribution, the kernel interprets the higher order bits in each space’s child namespace as a node number. When a space invokes Put or Get, the kernel first migrates the calling space’s state and control flow to the node specified in the child number argument, before creating or interacting with some child on that node indicated in the remaining child number bits. Figure 3 illustrates a space migrating between two nodes and managing children on each.

When the kernel migrates a space, it first transfers to the receiving kernel only the space’s register state and address space metadata. The receiving kernel requests the space’s memory pages on demand as the space accesses them on the new node. Each node’s kernel avoids redundant cross-node page copying in the common case when a space repeatedly migrates among several nodes—for example, when a space starts children on each of several nodes, then returns later to collect their results. For read-only pages such as program code, each node’s kernel reuses cached copies of these pages whenever the space returns to that node.

4. High-Level Parallel Abstractions

Determinator’s kernel API eliminates many convenient and familiar abstractions, but we can reproduce many of them deterministically in user space, with important trade-offs. This section describes how Determinator’s user-level run-time infrastructure emulates traditional Unix processes, file systems, threads, and synchronization.

4.1. Processes and fork/exec/wait

We make no attempt to replicate Unix process semantics exactly, but wish to emulate traditional fork/exec/wait enough to support common uses in scriptable shells, build tools such as make, and multiprocess applications such as compilers.

Fork: A basic Unix fork() requires only one Put system call, to copy the parent’s entire memory state into a child space, set up the child’s registers, and start the child. The difficulty arises from Unix’s global process ID (PID) namespace, a source of nondeterminism as discussed in Section 2.4. Since most applications use PIDs returned by fork() merely as an opaque argument to a subsequent waitpid(), our runtime makes PIDs local to each process. Each process allocates its own PIDs, which are unrelated to and may numerically conflict with PIDs in other processes. This change breaks Unix applications that pass PIDs among processes, and means that commands like "ps" must be built into shells for the same reason that "cd" already is. This approach works for compute-oriented applications following the typical fork/wait pattern, however.

Exec: A user-level implementation of Unix exec() must construct the new program’s memory image, intended to replace the old program, while still executing the old program’s runtime library code. Our runtime loads the new program into a “reserved” child space never used by fork(), then calls Get to copy that child’s entire memory atop that of the (running) parent: this Get thus “returns” into the new program. The runtime also carries over some Unix process state, such as the PID namespace and file system state described later, from the old to the new program.

Wait: When an application calls waitpid() to wait for a specific child, the runtime calls Get to synchronize with the child’s Ret and obtain the child’s exit status. The child may Ret back to the parent to make I/O requests before terminating; the parent’s runtime services the I/O request and resumes the waitpid() transparently to the application.

Unix’s wait() is problematic as it waits for any (“the first”) child to terminate, violating determinism as discussed in Section 2.3. The kernel cannot offer a “wait for any child” system call without compromising determinism, so our runtime simply waits for the child that was forked earliest.

This behavior does not affect applications that fork several children and then wait for all of them to complete, but affects two common uses of wait(). First, interactive Unix shells use wait() to report when background processes complete; an interactive shell running under Determinator requires special “nondeterministic I/O privileges” to provide this functionality (and related functions such as interactive job control). Second, our runtime’s behavior may adversely affect the performance of programs that use wait() to implement dynamic scheduling or load balancing in user space.

Consider a parallel make run with or without limiting the number of concurrent children. A plain "make -j," allowing unlimited children, leaves scheduling to the system. Under Unix or Determinator, the kernel’s scheduler assigns tasks to available CPUs, as in Figure 4 (a) and (b). If the user runs "make -j2," however, then make initially starts only tasks 1 and 2, then waits for one to complete before starting task 3. Under Unix, wait() returns when the short task 2 completes, enabling make to start task 3 promptly as in Figure 4 (c). On Determinator, the wait() returns only when (deterministically chosen) task 1 completes, resulting in nonoptimal schedule in Figure 4 (d): determinism prevents the runtime from learning which of tasks 1 and 2 completed first. The unavailability of timing information with which to inform user-level scheduling decisions thus suggests a practice of leaving scheduling to the system in a deterministic environment (e.g., "make -j" instead of "-j2").

4.2. A shared file system

Unix’s file system provides a convenient namespace and repository for staging program inputs, storing outputs, and holding intermediate results such as temporary files. Since our kernel permits no physical state sharing, user-level code must emulate shared state abstractions. Determinator’s “shared-nothing” space hierarchy is similar to a distributed system consisting only of uniprocessor machines, so our user-level runtime borrows distributed file system principles to offer applications a shared file system abstraction.

Since our current focus is on emulating familiar abstractions and not on developing storage systems, Determinator’s file system currently provides no persistence: it effectively serves only as a temporary file system.

Our runtime mimics a weakly consistent, replicated file system,²⁴ by maintaining a complete file system replica in each process’s address space, as shown in Figure 5. When a process creates a child via fork(), the child inherits a copy of the parent’s file system in addition to the parent’s open file descriptors. Individual open/close/read/write operations in a process use only that process’s file system replica, so different processes’ replicas may diverge as they modify files concurrently. When a child terminates and its parent collects its state via wait(), the parent’s runtime uses file versioning to propagate the child’s changes into the parent.

If a parallel make forks several compiler processes, for example, each child writes its output .o file to its own file system replica. The parent’s runtime merges these .o files into the parent’s file system as the parent wait()s on each child. This copying and reconciliation is not as inefficient as it may appear, due to the kernel’s copy-on-write optimizations. Replicating a file system image among many spaces copies no physical pages until user-level code modifies them, so all copies of identical files consume only one set of pages.

As in any weakly consistent file system, processes that perform unsynchronized, concurrent writes to the same file may cause conflicts. When our runtime detects a conflict, it simply discards one copy and sets a conflict flag on the file; subsequent attempts to open() the file result in errors. This behavior is intended for batch compute applications for which conflicts indicate a bug, whose solution is to fix the bug and rerun the job. Interactive use would demand a policy that avoids losing data. The user-level runtime could implement stronger consistency and avoid unsynchronized concurrent writes, at higher synchronization costs.

4.3. Input/output and logging

Since unprivileged spaces can access external I/O devices only indirectly via parent/child interaction within the space hierarchy, our user-level runtime treats I/O as a special case of file system synchronization. In addition to regular files, a process’s file system contains special I/O files, such as console input and output files. Unlike Unix device special files, Determinator’s I/O files actually hold data in the process’ file system image: for example, the console input file accumulates all characters that the process has received from the console, and the console output file holds all characters written to the console. In the current prototype, console or log files can eventually “fill up” and become unusable, though a suitable garbage-collection mechanism could address this flaw.

When a process does a console read(), the C library first returns unread data already in the process’s local console input file. When no more data is available, instead of returning an end-of-file condition, the process calls Ret to synchronize with its parent and wait for more console input (or in principle any other form of new input) to become available. When the parent does a wait() or otherwise synchronizes with the child, it propagates any new input to the child. When the parent has no new input for any waiting children, it forwards all their input requests to its parent, and ultimately to the kernel via the root process.

When a process does a console write(), the runtime appends the new data to its internal console output file as it would append to a regular file. The next time the process synchronizes with its parent, file system reconciliation propagates these writes toward the root process, which forwards them to the kernel’s I/O devices. A process can request immediate output propagation by explicitly calling fsync().

Reconciliation handles “append-only” writes differently from other writes, enabling concurrent writes to console or log files without conflict. During reconciliation, if both the parent and child have made append-only writes to the same file, reconciliation appends the child’s latest writes to the parent’s copy of the file, and vice versa. Each process’s replica thus accumulates all processes’ concurrent writes, though different processes may observe these writes in a different order. Unlike Unix, rerunning a parallel computation from the same inputs with and without output redirection yields byte-for-byte identical console and log file output.

4.4. Shared memory multithreading

Shared memory multithreading is popular despite its non-determinism, in part because parallel code need not pack and unpack messages: threads simply compute “in-place” on shared variables and structures. Since Determinator gives user spaces no physically shared memory other than read-only sharing via copy-on-write, emulating shared memory involves distributed shared memory (DSM) techniques. Adapting the private workspace model of Section 2.2 to thread-level shared memory reuses ideas from early parallel Fortran machines⁶ and release-consistent DSM systems,¹ although these prior designs did not offer determinism.

Our runtime uses the kernel’s Snap and Merge operations (Section 3.2) to emulate shared memory in the private workspace model, using fork/join synchronization. To fork a child, the parent thread calls Put with the Copy, Snap, Regs, and Start options to copy the shared part of its memory into a child space, save a snapshot of that memory state in the child, and start the child running, as illustrated in Figure 6. The master thread may fork multiple children this way. To synchronize with a child and collect its results, the parent calls Get with the Merge option, which merges all changes that the child made to shared memory, since its snapshot was taken, back into the parent. If both parent and child—or the child and other children whose changes the parent has collected—have concurrently modified the same byte since the snapshot, the kernel detects and reports this conflict.

Our runtime also supports barriers, the foundation of data-parallel programming models like OpenMP.²³ When each thread in a group arrives at a barrier, it calls Ret to stop and wait for the parent thread managing the group. The parent calls Get with Merge to collect each child’s changes before the barrier, then calls Put with Copy and Snap to resume each child with a new shared memory snap-shot containing all threads’ prior results. While the private workspace model conceptually extends to nonhierarchical synchronization,³ our prototype’s strict space hierarchy currently limits synchronization flexibility, an issue we intend to address in the future. Any synchronization abstraction may be emulated at some cost as described in the next section, however.

An application can choose which parts of its address space to share and which to keep thread-private. By placing thread stacks outside the shared region, all threads can reuse the same stack area, and the kernel wastes no effort merging stack data. Thread-private stacks enable a child thread to inherit its parent’s stack and run “inline” in the same C/C++ function as its parent, as in Figure 1. If threads wish to pass pointers to stack-allocated structures, however, then they may locate their stacks in disjoint shared regions. Similarly, if the file system area is shared, then the threads share a common file descriptor namespace as in Unix. Excluding the file system area from shared space and using normal file system reconciliation (Section 4.2) to synchronize it yields thread-private file tables.

4.5. Emulating legacy thread APIs

For legacy code already parallelized using nondeterministic synchronization, Determinator’s runtime can emulate the standard pthreads API via deterministic scheduling.⁸ In this case, the process’s initial master space never runs application code directly, but acts as a scheduler supervising one child space per application thread. The scheduler uses the kernel’s instruction limit feature (Section 3.2) to quantize each thread’s execution. After allowing each thread to run independently for one quantum, the scheduler uses Merge to collect the thread’s shared memory changes, then restarts the thread from the merged state for another quantum.

To emulate pthreads synchronization, a thread can end its quantum prematurely to interact with the scheduler. Each mutex, for example, always has an “owner” thread, whether or not the mutex is locked. The owner can lock and unlock the mutex without scheduler interactions, but another thread needing the mutex must invoke the scheduler to obtain ownership. At the current owner’s next quantum, the scheduler “steals” the mutex’s ownership if the mutex is unlocked, otherwise placing the locking thread on the mutex’s queue to be awoken once the mutex becomes available.

While deterministic scheduling offers compatibility with legacy code, it has drawbacks. The master space, required to serialize synchronization operations, may limit scalability unless execution quanta are large. Large quanta can increase the time threads waste waiting to interact with another, however, to steal an unlocked mutex, for example.

Further, since the deterministic scheduler may preempt a thread and propagate shared memory changes at any point in application code, the programming model remains nondeterministic. In contrast to our private workspace model, if one thread runs “x = y” while another runs “y = x” under the deterministic scheduler, the result may be repeatable but is no more predictable to the programmer than before. While rerunning a program with exactly identical inputs will yield identical results, if an input change or code recompilation affects the length of any instruction sequence, this change may cascade into a different execution schedule and trigger schedule-dependent if not timing-dependent bugs.

5. Evaluation

This section evaluates the Determinator prototype, first informally, and then examines single-node and distributed parallel processing performance and, finally, code size.

Determinator is written in C with small assembly fragments, and currently runs on the 32-bit x86 architecture. Since our focus is on parallel compute-bound applications, Determinator’s I/O capabilities are currently limited to text-based console I/O and a simple Ethernet-based protocol for space migration (Section 3.3). The prototype has no persistent disk storage: the runtime’s shared file system abstraction currently operates in physical memory only.

5.1. Experience using the system

We find that a deterministic programming model simplifies debugging of both applications and user-level runtime code, since user-space bugs are always reproducible. Conversely, when we do observe nondeterministic behavior, it can result only from a kernel (or hardware) bug, immediately limiting the search space.

Because Determinator’s file system holds a process’s output until the next synchronization event (often the process’s termination), each process’s output appears as a unit even if the process executes in parallel with other output-generating processes. Further, different processes’ outputs appear in a consistent order across runs, as if run sequentially.

While race detection tools exist,²¹ we find it convenient that Determinator detects conflicts under “normal-case” execution, without requiring a special tool. Determinator’s model makes conflict detection as standard as detecting a division by zero or illegal memory access.

A subset of Determinator doubles as PIOS, “Parallel Instructional Operating System,” used in Yale’s OS course for the past 2 years.¹⁴ While determinism is not a primary course topic, we found Determinator/PIOS to be a useful tool for teaching operating systems in the multicore era, due to its simple design and minimal kernel API. Partly derived from and inspired by JOS,¹⁷ PIOS includes an instructional framework where students fill in missing pieces of a “skeleton.” Students work in small groups to reimplement all of Determinator’s core features: multi-processor scheduling, virtual memory with copy-on-write and Snap/Merge, user-level threads with fork/join, the user-space file system with versioning and reconciliation, and cross-node space migration. In their final projects, students extend the OS with features such as graphics, pipes, and remote shells. The course is challenging and incurs a heavy programming load, but anonymous student reviews have been highly positive due to its educational value and perceived relevance.

5.2. Single-node multicore performance

Since Determinator runs user-level code “natively” on the hardware instead of rewriting user code,⁸ its performance is comparable to that of existing systems when running coarse-grained parallel code, but incurs higher costs on fine-grained parallel code because of the system calls, context switches, and virtual memory operations required at synchronization events.

Figure 7 shows the performance of several shared-memory parallel benchmarks on Determinator, relative to the same benchmarks using standard pthreads on 32-bit Ubuntu Linux 9.10. The coarse-grained benchmarks md5, matmult, qsort, blackscholes, and fft perform comparably to nondeterministic execution under Linux, while the fine-grained lu benchmarks show a higher performance cost.

Figure 8 shows each benchmark’s speedup relative to single-threaded execution on Determinator. The “embarrassingly parallel” md5 and blackscholes scale well, matmult and fft level off after four processors (but still perform comparably to Linux), and the remaining benchmarks scale poorly.

5.3. Distributed computing performance

To evaluate cross-node space migration (Section 3.3), we changed the md5 and matmult benchmarks to distribute workloads across up to 32 uniprocessor nodes, while preserving Determinator’s shared memory programming model. Lacking a cluster suitable for native testing, we ran Determinator under QEMU,⁵ on a shared-use Linux cluster.

Figure 9 shows parallel speedup versus single-node execution in Determinator, on a log-log scale. The md5-circuit benchmark serially migrates to each node to fork worker threads and collect results, while md5-tree and matmult-tree fork and join workers recursively in a binary tree structure. The “embarrassingly parallel” md5-tree performs and scales well with recursive work distribution, while matmult-tree suffers from the cost of large cross-node data transfers using Determinator’s unoptimized page copying protocol.

As Figure 10 shows, the shared-memory benchmarks on Determinator perform comparably to nondeterministic, distributed-memory equivalents running on Linux. Illustrating the benefits of its shared memory API, the Determinator version of md5 is 63% of the size of the Linux version (62 lines versus 99), which uses remote shells to coordinate workers. Determinator’s matmult is 34% of the size of the Linux version (90 lines versus 263), which passes data using TCP.

5.4. Implementation complexity

To provide a feel for implementation complexity, Table 2 shows source code line counts for Determinator, as well as its PIOS instructional subset, counting only lines containing semicolons. The entire system is less than 15,000 lines, about half of which is generic C and math library code needed mainly for porting Unix applications easily.

6. Conclusion

While Determinator is only a proof of concept, it shows that operating systems can offer a pervasively, naturally deterministic environment. Determinator avoids introducing data races in shared memory and file system access, thread and process synchronization, and throughout its API. Experiments suggest that such an environment can efficiently run coarse-grained parallel applications, both on multicore machines and across clusters, though running fine-grained applications efficiently may require hardware evolution.

Acknowledgments

We thank Zhong Shao, Ramakrishna Gummadi, Frans Kaashoek, Nickolai Zeldovich, Sam King, and the OSDI reviewers for their valuable feedback. We thank ONR and NSF for their support under grants N00014-09-10757 and CNS-1017206.

Figures

Figure 1. C pseudocode for lock-step time simulation, which contains a data race in standard concurrency models but is bug-free under Determinator.

Figure 2. The kernel’s hierarchy of spaces, each containing private register and virtual memory state.

Figure 3. A space migrating between two nodes and starting child spaces on each node.

Figure 4. Parallel make under unix and Determinator: (a) and (b) with unlimited parallelism; (c) and (d) with a 2-worker quota imposed at user level.

Figure 5. Each process maintains a private replica of the shared file system, using file versioning to reconcile replicas at synchronization points.

Figure 6. A multithreaded process built from one space per thread, with a master space managing synchronization and memory reconciliation.

Figure 7. Determinator performance relative to pthreads under Linux using parallel benchmarks.

Figure 8. Determinator parallel speedup relative to its own single-CPU performance.

Figure 9. Speedup of deterministic shared-memory benchmarks on varying-size distributed clusters.

Figure 10. Deterministic shared-memory benchmarks versus distributed-memory equivalents for Linux.

Tables

Table 1. System calls comprising Determinator’s kernel API.

Table 2. Implementation code size of the Determinator OS and of PIOS, its instructional subset

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Efficient System-Enforced Deterministic Parallelism

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2160718.2160742

May 2012 Issue

Published: May 1, 2012

Vol. 55 No. 5

Pages: 111-119

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 26 2024

Optimizing Energy Efficiency in Datacenters with Advanced Cooling Technologies

Alex Williams

Architecture and Hardware

Credit: Getty Images Servers in snowy setting.

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

1. Introduction

2. A Deterministic Programming Model

3. The Determinator Kernel

4. High-Level Parallel Abstractions

5. Evaluation

6. Conclusion

Acknowledgments

Figures

Tables

Efficient System-Enforced Deterministic Parallelism

DOI

May 2012 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.