Architecture and Hardware Contributed articles

Formal Analysis of MPI-Based Parallel Programs

The goal is reliable parallel simulations, helping scientists understand nature, from how foams compress to how ribosomes construct proteins.

By Ganesh Gopalakrishnan, Robert M. Kirby, Stephen Siegel, Rajeev Thakur, William Gropp, Ewing Lusk, Bronis R. De Supinski, Martin Schulz, and Greg Bronevetsky

Posted Dec 1 2011

Introduction
Key Insights
Full-Scale Debugging
Symbolic Analysis of MPI
Static Analysis of MPI
Conclusion
Acknowledgments
References
Authors
Figures

Intrepid Blue Gene/P supercomputer — The Intrepid Blue Gene/P Open Science machine at Argonne National Laboratory, with 163,840 cores and 557 TFlops peak performance.

Most parallel computing applications in high-performance computing use the Message Passing Interface (MPI) API. Given the fundamental importance of parallel computing to science and engineering research, application correctness is paramount. MPI was originally developed around 1993 by the MPI Forum, a group of vendors, parallel programming researchers, and computational scientists. However, the document defining the standard is not issued by an official standards organization but has become a de facto standard through near-universal adoption. MPI development continues, with MPI-2.2 released in 2009 and MPI 3.0 expected in 2012. The standard is published on the Web²¹ and as a book, along with several other books based on it; see, for example, Gropp et al.¹³ and Pacheco.²⁶ Implementations are available in open source from MPICH2²² and Open MPI²⁵ from software vendors and from every vendor of HPC systems. MPI is widely cited; Google Scholar recently returned 39,600 hits for the term “+MPI +Message Passing Interface.”

Key Insights

Addressing the challenges of distributed systems, debugging necessitates collaboration between HPC and formal verification.
Along with HPC, distributed computing based on communication libraries is going mainstream in the commodity world, two communities that must look to learn and benefit from one another.
Catastrophic disruption of programmer productivity can be avoided through formal verification tools that handle problems of scale, enhance coverage by avoiding redundant searches, and decrease false-alarm rates through more precise analysis.

MPI is designed to support highly scalable computing applications using more than 100,000 cores on, say, the IBM Blue Gene/P (see Figure 1) and Cray XT5. Many MPI programs represent dozens, if not hundreds, of person-years of development, including calibration for accuracy and performance tuning. Scientists and engineers worldwide use MPI in thousands of applications, including in investigations of alternate-energy sources and in weather simulation. For HPC computing, MPI is by far the dominant programming model; most (at some centers, all) applications running on supercomputers use MPI. Many application developers for exascale systems¹⁵ regard support for MPI as a requirement.

Still, the MPI debugging methods available to these developers are typically wasteful and ultimately unreliable. Existing MPI testing tools seldom provide coverage guarantees, examining essentially equivalent execution sequences, thus reducing testing efficiency. These methods fare even worse at large problem scales. Consider the costs of HPC bugs. A high-end HPC center costs hundreds of millions of dollars to commission, and the machines become obsolete within six years; in many centers, the annual electricity bill can run more than $3 million, and research teams apply for computer time through competitive proposals, spending years planning their experiments. In addition to these costs, one must add the costs to society of relying on potentially defective software to inform decisions involving issues of great public importance (such as climate change).

Formal methods can play an important role in debugging and verifying MPI applications. Here, we describe existing techniques, including their pros and cons, and why they have value beyond MPI, addressing the general needs of future concurrency application developers who will inevitably use low-level concurrency APIs.

Historically, parallel systems have used either message passing or shared memory for communication. Compared to other message-passing systems noted for their parsimony, MPI supports a large number of cohesively engineered features essential for designing large-scale simulations; for example, MPI-2.2²¹ specifies more than 300 functions, though most developers use only a few dozen in any given application.

MPI programs consist of one or more threads of execution with private memories (called “MPI processes”) and communicate through message exchanges. The two most common are point-to-point messages (such as sends and receives) and collective operations (such as broadcasts and reductions). MPI also supports non-blocking operations that help overlap computation and communication and persistent operations that make repeated sends/receives efficient. In addition, MPI allows processes and communication spaces to be structured using topologies and communicators. MPI’s derived datatypes further enhance the portability and efficiency of MPI codes by enabling the architect to communicate noncontiguous data with a single MPI function call. MPI also supports a limited form of shared-memory communication based on one-sided communication. A majority of MPI programs are still written using the “two-sided” (message-passing-oriented) constructs we focus on through the rest of the article. Finally, MPI-IO addresses portable access to high-performance input/output systems.

MPI applications and libraries are written predominantly in C, C++, and/or Fortran. Languages that use garbage collection or managed runtimes (such as Java and C#) are rarely used in HPC; preexisting libraries, compiler support, and memory locality management drive these choices. Memory is a precious resource in large-scale systems; a rule of thumb is an application cannot afford to consume more than one byte per FLOP. Computer memory is expensive and increases cluster energy consumption. Even when developing traditional shared-memory applications, system architects must work with low amounts of cache-coherent memory per core and manage data locality, something done routinely by MPI programmers. Computer scientists are also realizing that future uses of MPI will be in conjunction with shared-memory libraries (such as Pthreads⁷) to reduce message-copy proliferation. While some MPI applications are written from scratch, many are built atop user libraries, including ParMETIS¹⁶ for parallel hypergraph partitioning, ScaLAPACK⁵ for high-performance linear algebra, and PETSc³ for solving partial differential equations.

MPI processes execute in disjoint address spaces, interacting through communication commands involving deterministic, nondeterministic, collective, and non-blocking modes. Existing (shared-memory concurrent program) debugging techniques do not directly carry over to MPI, where operations typically match and complete out-of-program order according to an MPI-specific matches-before order.^30,33 The overall behavior of an MPI program is also heavily influenced by how specific MPI library implementations take advantage of the latitude provided by the MPI standard.

An MPI program bug is often introduced when modeling the problem and approximating the numerical methods or while coding, including whole classes of floating-point challenges.¹¹ While lower-level bugs (such as deadlocks and data races) are serious concerns, detecting them requires specialized techniques of the kind described here. Since many MPI programs are poorly parameterized, it is not easy for HPC developers to down-scale a program to a smaller instance and locate the bug. For these reasons, HPC developers need a variety of verification methods, each narrowly focused on subsets of correctness issues and making specific trade-offs. Our main focus here is formal analysis methods for smaller-scale MPI applications and semiformal analysis methods for the very large scale. For detecting MPI bugs in practice, formal analysis tools must be coupled with runtime instrumentation methods found in tools like Umpire,³² Marmot,¹⁹ and MUST,¹⁴ though much more research is needed in tool integration.

Dynamic analysis. MPI provides many nondeterministic constructs that free the runtime system to choose the most efficient way to carry out an operation but also mean a program can exhibit multiple behaviors when run on the same input, posing verification challenges; an example is a communication race arising from a “wildcard” receive, an operation that does not specify the source process of the message to be received, leaving the decision to the runtime system. Many subtle program defects are revealed only for a specific sequence of choices. Though random testing might happen on one such sequence, it is hardly a reliable approach.

In contrast, dynamic verification approaches control the exact choices made by the MPI runtime, using this control to methodically explore a carefully constructed subset of behaviors. For each such behavior, a number of properties may be verified, including absence of deadlocks, assertion violations, incompatible data payloads between senders and receivers, and MPI resource leaks. Using a formal model of the MPI semantics, a dynamic verifier can conclude that if no violations occur on the subset of executions, then there can be no violation on an execution. If even this reduced subset cannot be explored exhaustively, the developer can specify precise coverage criteria and obtain a lesser (but still quantifiable) degree of assurance. This approach was originally demonstrated in the VeriSoft¹⁰ tool and has the advantage of not requiring modifications to the program source code, compiler, or libraries.

Full-scale debugging. Traditional “step-by-step” debugging techniques are untenable for traces involving millions of threads. Later, in an expanded description of full-scale debugging, we describe a new debugging approach called Stack Trace Analysis that analyzes an execution trace and partitions the threads into equivalence classes based on their behavior. Experience on real large-scale systems shows that only a small number of classes typically emerge, and the information provided can help a developer isolate defects. While this approach is not comparable to the others covered here, in that the focus is on the analysis of one trace rather than reasoning about all executions, it provides a clear advantage in terms of scalability.

Symbolic analysis. The techniques discussed earlier are only as good as the set of inputs chosen during analysis. Defects revealed for very specific input or parameter values may be difficult to discover with these techniques alone. Symbolic execution¹⁸ is a well-known technique for identifying defects, described later in the section on symbolic analysis of MPI, including how it is applied to MPI programs. The TASS toolkit²⁷ uses symbolic execution and state-enumeration techniques to verify properties of MPI programs, not only for all possible behaviors of the runtime system, but for all possible inputs as well. It can even be used to establish that two versions of a program are functionally equivalent, at least within specified bounds. On the other hand, implementing the symbolic execution technique requires sophisticated theorem-proving technology and a symbolic interpreter for all program constructs and library functions; for this reason, TASS supports only C and a subset of MPI. Moreover, it generally does not scale beyond a relatively small number of processes, though, as we show, defects that usually appear only in large configurations can often be detected in much smaller configurations through symbolic execution.

Static analysis. Compilers use static analyses to verify a variety of simple safety properties of sequential programs, working on a formal structure that abstractly represents some aspect of the program (such as a control-flow graph, or CFG). Extending these techniques to verify concurrency properties of MPI programs (such as deadlock freedom) requires new abstractions and techniques. Later, in the section on static analysis of MPI, we outline a new analysis framework targeting this problem that introduces the notion of a parallel CFG. The framework has the advantage that the pCFG is independent of the number of processes, essentially making it infinitely scalable. However, because automating these analyses is so difficult they may require user-provided program annotation to guide them.

Dynamic Verification of MPI

Here, we explore two dynamic analysis approaches: The first, implemented by the tool ISP^31,35 (see Figure 2), delivers a formal coverage guarantee with respect to deadlocks and local safety assertions³⁰; ISP has been demonstrated on MPI applications of up to 15,000 lines of code. Running on modern laptop computers, ISP can verify such applications for up to 32 MPI processes on mostly deterministic MPI programs.

ISP’s scheduler, as outlined in the figure, exerts centralized control over every MPI action. It limits ISP scalability to at most a few dozen MPI processes and does not help programmers encountering difficulty at higher ends of the scale where user applications and library codes often use different algorithms. What if a designer has optimized an HPC computation to work efficiently on 1,000 processors and suddenly finds an inexplicable bug? Traditional HPC debugging support is severely lacking in terms of ensuring coverage goals. To address this limitation, some of the authors have built a tool called Distributed Analyzer of MPI, or DAMPI,³⁴ which uses a distributed scheduler while still ensuring nondeterminism coverage. DAMPI scales demonstrably far more than ISP.

Dynamic verification using ISP. For programs with nondeterministic MPI calls, simply modulating the absolute times at which MPI calls are issued (such as by inserting nondeterministic sleep durations, as performed by stress-testing tools) is ineffective because most often it does not alter the way racing MPI sends match with MPI nondeterministic receives deep inside the MPI runtime. Also, such delays slow the entire testing process unnecessarily.

ISP’s active testing approach (see Figure 3) means if P2’s MPI_Isend can match P₁‘s MPI_Irecv, the test encounters a bug. But can such a match occur? Yes, and here’s how; first, let P₀ issue its non-blocking MPI_Isend call and P1 its non-blocking MPI_Irecv call; then allow the execution to cross the MPI_Barrier calls; after that, P₂ can issue its MPI_Isend. The MPI runtime then faces a nondeterministic choice of matching either MPI_Isend. The system achieves this particular execution sequence only if the MPI_Barrier calls are allowed to match before the MPI_Irecv matches. Existing MPI testing tools cannot exert such fine control over MPI executions. By interposing a scheduler, as outlined in Figure 2, ISP can reorder, at runtime, MPI calls issued by the program. In the example, ISP’s scheduler intercepts all MPI calls coming to it in program order and dynamically reorders the calls going into the MPI runtime (ISP’s scheduler sends Barriers first, an order allowed by the MPI semantics), at which point it discovers the nondeterminism.

When ISP determines two matches could occur, it re-executes (replays from the beginning) the program in Figure 3 twice, once with the Isend from P₀ matching the receive, the second Isend from P₂ matching it. To ensure these matches occur, ISP dynamically rewrites Irecv(from:*) into Irecv(from:0) and into Irecv(from:2) in these replays. If the algorithm does not do this but instead issues Irecv(from:*) into the MPI runtime, coverage of both process sends is no longer guaranteed. ISP discovers the maximal extent of non-determinism through dynamic MPI call reordering and achieves scheduling control of relevant interleavings through dynamic API call rewriting. While pursuing relevant interleavings, ISP additionally detects three basic error conditions: deadlocks, resource leaks (such as MPI object leaks), and violations of C assertions in the code.

Developers should bear in mind that MPI programmers often use non-blocking MPI calls to enhance computation/communication overlap and nondeterministic MPI calls in master/worker patterns to detect which MPI process finishes first, so more work can be assigned to it. When these operations, together with “collective” operations (such as Barriers), are all employed in the same example, a developer can obtain situations like the one in Figure 3. The safety net provided by ISP and other such tools is therefore essential for efficiency-oriented MPI programming.

ISP guarantees MPI communication nondeterminism coverage under the given test harness and helps avoid exponential interleaving explosion primarily by avoiding redundantly examining equivalent behaviors (such as by not examining the n! different orders in which an MPI barrier call might be invoked); testing tools typically fall victim to this explosion. ISP also includes execution-space sampling options.

ISP has examined many large MPI programs, including those making millions of MPI calls. Some of the authors have also built the Graphical Explorer of Message passing (GEM) tool,¹² which hosts the ISP verification engine. GEM is an official component of the Eclipse Parallel Tools Platform, or PTP,⁹ (PTP version 3.0 onward), making dynamic formal verification of MPI available seamlessly within a popular integrated development environment. GEM also serves as a formal-methods supplement to a popular MPI textbook²⁶ by providing chapter examples as readily available MPI C projects.

Dynamic verification using DAMPI. A widely used complexity-reduction approach is to debug a given program after first suitably downscaling it. However, a practical difficulty in carrying out such debugging is that many programs are poorly parameterized. For them, if a problem parameter is reduced, it is often unclear whether another parameter should be reduced proportionally, logarithmically, or through some other relationship. A more serious difficulty is that some bugs are manifest only when a problem is run at scale. The algorithms employed by applications and/or the MPI library itself can change depending on problem scale. Also, resource bugs (such as buffer overflows) often show up only at scale.

While user-level dynamic verification supported by ISP resolves significant nondeterminism, testing at larger scales requires a decentralized approach where supercomputing power aids verification, an idea the authors implemented in their tool framework DAMPI³⁴ (see Figure 4).

The key insight that allowed them to design the decentralized scheduling algorithm of DAMPI is that a non-deterministic operation, as in MPI_Irecv(MPI_ANY_SOURCE) and MPI_Iprobe(MPI_ANY_SOURCE), represents a point on the timeline of the issuing process when the operation commits to a match decision. It is natural for an HPC programmer to view each such event as starting an epoch, an interval stretching from the current nondeterministic event up to (but not including) the next nondeterministic event. All deterministic receives can be assigned the same epoch as the one in which they occur. Even though the epoch is defined by a non-deterministic receive matching another process’s send, how can the tool determine all other sends that match it? The solution is to pick all the sends that are not causally after the nondeterministic receive (and subject to MPI’s “non-overtaking” rules). DAMPI determines these sends through an MPI-specific version of Lamport clocks,²⁰ striking a good compromise between scalability and omissions.

Experimental results show DAMPI effectively tests realistic problems running on more than 1,000 CPUs by exploiting the parallelism and memory capacity of clusters. It has examined all benchmarks from the Fortran NAS Parallel Benchmark suite,²⁴ with instrumentation overhead less than 10% compared to ordinary testing, but able to provide nondeterminism coverage not provided by ordinary testing.

Recent experiments by some of the authors found a surprising fact: None of the MPI programs in the NAS Parallel Benchmarks employing MPI_Irecv(MPI_ANY_SOURCE) calls actually exhibit nondeterminism under DAMPI. This means these benchmarks were “determinized,” perhaps through additional MPI call arguments and is further confirmation of the value of dynamic analysis in providing precise answers.

Full-Scale Debugging

The approach described here targets the large-scale systems that will emerge over the next few years; current estimates anticipate half a billion to four billion threads in exascale systems. With such concurrency, developers of verification tools must target debugging techniques able to handle these scales, as bugs are often not manifest until a program is run at its largest scale. Bugs often depend on input, which can differ significantly across full-scale runs. Furthermore, certain types of errors (such as integer overflows) often depend directly on the number of processors.

However, most debugging techniques do not translate well to full-scale runs. The traditional paradigm of stepping through code has significant performance limitations with large processor counts, as well as being impractical with thousands of processes or threads, let alone billions. Dynamic-verification techniques offer paradigmatic scaling but have even more performance limitations, particularly when the number of interleavings depends on process count.

Faced with scaling requirements, HPC developers require new techniques to limit the scope of their debugging efforts. Some of the authors developed mechanisms for identifying behavioral-equivalence classes based on the observation that when errors occur in large-scale programs, they do not exhibit thousands or millions of different behaviors. Rather, they exhibit a limited set of behaviors in which all processes follow the same erroneous path (a single common behavior) along which one or a few processes follow an erroneous path that can then lead to changes in the behavior of a few related processes (two or three behaviors). While the effect may trickle further out, developers rarely observe more than a half-dozen behaviors, regardless of the total number of processes in an MPI program.

Given the limited behaviors that are exhibited, developers can then focus on only debugging representative processes from each behavioral class, rather than all processes at once, thereby enabling the debugging of problems previously not debuggable.

The Stack Trace Analysis Tool (STAT)² achieves this debugging goal by attaching to all processes in a large-scale job and gathering stack traces sampled over time in a low overhead and distributed manner. It then merges these stack traces to identify which processes are executing similar code. The tool considers a variety of equivalence relations; for example, for any n ≥ 1, it considers two processes as equivalent if they agree on the first n function calls issued. Increasing n refines this equivalence relation, giving the developer control of the precision-accuracy trade-off.

The resulting tree readily identifies different execution behaviors. For example, Figure 5 shows the top levels of the tree obtained from a run of the Community Climate System Model (CCSM), an application that uses five separate modules to model land (CSM), ice, ocean (POP), and atmosphere (CAM) and couple the four models. In it, the developer can quickly identify that MPI processes 2439 are executing the land model, 823 the ice model, 40135 the ocean model, and 136471 the atmospheric model, while 07 are executing the coupler. If a problem should be observed in one of them, the developer can then concentrate on this subset of tasks; in the case of a broader error, the developer can pick representatives from the five classes, thereby reducing the initial debugging problem to five processes. The STAT tool has been used to debug several codes with significantly shortened turnaround time, including an Algebraic Multigrid (AMG) package, which is fundamental for many HPC application codes.

Tools like STAT also detect outliers that can directly point to erroneous behavior without further debugging; for example, the STAT tool was used on the CCSM code when it hung on more than 4,096 processes. The stack trace tree showed one task executing in an abnormally deep stack, and, on closer examination of the stack trace, not only that a mutex lock operation within the MPI implementation was called multiple times, creating the deadlock, but also exactly where in the code the respective erroneous mutex lock call occurred. This led to a quick fix of the MPI implementation.

The STAT developer group’s efforts now include extensions that provide better identification of the behavior equivalence classes, as well as techniques to discern relationships among the classes.¹ Additional directions include using the classes to guide dynamic verification techniques.

Symbolic Analysis of MPI

The basic idea of symbolic execution is to execute the program using symbolic expressions in place of the usual (concrete) values held by program variables.¹⁸ The inputs and initial values of the program are symbolic constants X₀;X₁,…, so-called because they represent values that do not change during execution. Numerical operations are replaced by operations on symbolic expressions; for example, if program variables u and v hold values X₀ and X₁, respectively, then u+v will evaluate to the symbolic expression X₀ + X₁.

The situation is more complicated at a branch point. Suppose a branch is governed by condition u+v>0. Since the values are symbolic, it is not necessarily possible to say whether the condition evaluates to true or false; both possibilities must be explored. Symbolic execution handles this problem by introducing a hidden Boolean-valued symbolic variable, the path condition pc, to record the choices made at branch points. This variable is initialized to true. At a branch, a non-deterministic choice is made between the two branches, and pc is updated accordingly. To execute the branch on u+v > 0, pc would be assigned the symbolic value pc u+v > 0 if the true branch is selected; if this is the first branch encountered, pc will now hold the symbolic expression X₀ + X₁ > 0. If the false branch is chosen instead, pc will hold X₀ + X₁ ≤ 0. Hence the path condition records the condition the inputs must satisfy for a particular path to be followed. Model-checking techniques can then be used to explore all nondeterministic choices and verify a property holds on all executions¹⁷ or generate a test set. An automated theorem prover (such as CVC3⁴) can be used to determine if pc becomes unsatisfiable, in which case the current path is infeasible and pruned from the search.

One advantage of symbolic techniques is they map naturally to message-passing-based parallel programs. The Verified Software Lab’s Toolkit for Accurate Scientific Software (TASS),²⁷ based on CVC3, uses symbolic execution and state-exploration techniques to verify properties of such programs. The TASS verifier takes as input the MPI/C source program and a specified number of processes and instantiates a symbolic model of the program with that process count. TASS maintains a model of the state of the MPI implementation, including that of the message buffers. Like all other program variables, the buffered message data is represented as symbolic expressions. The TASS user may also specify bounds on input variables in order to make the model finite or sufficiently small. An MPI-specific partial-order-reduction scheme restricts the set of states explored while still guaranteeing that if a counterexample to one of the properties exists (within the specified bounds), a violation is reported. Examples are included in the TASS distribution, including where TASS reveals defects in the MPI code (such as a diffusion simulation code from the Functional Equivalence Verification Suite at http://vsl.cis.udel.edu/fevs/).

TASS can verify the standard safety properties, but its most important feature is the ability to verify that two programs are functionally equivalent; that is, if given the same input, they always return the same output. This is especially useful in scientific computing where developers often begin with a simple sequential version of an algorithm, then gradually add optimizations and parallelism. The production code is typically much more complex but intended to be functionally equivalent of the original. The symbolic technique used to compare two programs for functional equivalence is known as “comparative symbolic execution.”²⁸

To illustrate the comparative symbolic technique, see Figure 6, where the sequential program reads n floating-point numbers from a file, sums the positive elements, and returns the result. A parallel version divides the file into approximately equal-size blocks. Each process reads one block into a local array and sums the positive elements in its block. On all processes other than process 0, this partial sum is sent to process 0, which receives the numbers, adds them to its partial sum, and outputs the final result.

Ignoring round-off error, the two programs are functionally equivalent; given the same file, they output the same result. To see how the comparative symbolic technique establishes equivalence, consider the case n = nprocs = 2 and call the elements of the file X₀ and X₁. There are four paths through the sequential program, due to the two binary branches if a[i]>0.0. One of these paths, arising when both elements are positive, yields the path condition X₀ > 0 X₁ > 0 and output X₀ + X₁. The comparative technique now explores all possible executions of adder _ par in which the initial path condition is X₀ > 0 Λ X₁ > 0; there are many such executions due to the various ways the statements from the two processes can be interleaved. In each, the output is X₀ + X₁. A similar fact can be established for the other three paths through the sequential program. Taken together, these facts imply the programs will produce the same result on any input (for n = nprocs = 2).

The ability to uncover defects at small scales is an important advantage of symbolic approaches. Isolating and repairing a defect that manifests only in tests with thousands of processes and huge inputs is difficult. Several research projects have focused on making traditional debuggers scale to thousands of processes for just this reason. However, it would be more practical to force the same defect to manifest itself at smaller scales and then isolate the defect at those scales.

We propose a continuum of tools based on static analysis, dynamic analysis, symbolic analysis, and full-scale debugging, complemented by more traditional error-checking tools.

A real-life example illustrates this point: In 2008, a user reported a failure in the MPICH2 MPI implementation when calling the broadcast function MPI _ Bcast, which used 256 processes and a message of just over count = 3,200 integers. Investigation revealed the defect was in a function used to implement broadcasts in specific situations (see Figure 7a). For certain inputs, the “size” argument (nbytes-recv _ offset) to an MPI point-to-point operation—an argument that should always be nonnegative—could in fact be negative. For 256 processes and integer data (type _ size = 4), this fault occurs if and only if 3,201 ≤ count ≤ 3,251.

The problematic function is guarded by the code in Figure 7b, referring to three compile-time constants—MPIR _ BCAST _ SHORT _ MSG, MPIR _ BCAST _ LONG _ MSG, and MPIR _ BCAST _ MIN _ PROCS—defined elsewhere as 12,288, 524,288, and 8, respectively. Essentially, the function is called for “medium-size” messages only when the number of processes is a power of 2 and above a certain threshold. With these settings, the smallest configuration that would reveal the defect is 128 processes, with count = 3,073.

A symbolic execution technique that checks that the “size” arguments to MPI functions are always non-negative would readily detect the defect. If the tool also treats the three compile-time constants as symbolic constants, the defect can be manifest at the much smaller configuration of eight processes and count = 1 (in which case nbytes-recv _ offset = −1). Such an approach would likely have detected this defect earlier and with much less effort.

Arithmetic. In our analysis of the adder example, we interpreted the values manipulated by the program as the mathematical real numbers and the numerical operations as the (infinite precision) real operations. If instead these values are interpreted as (finite-precision) floating-point values and operations, the two programs are not functionally equivalent, since floating-point addition is not associative.¹¹ Which is right? The answer is it depends on what the user is trying to verify. For functional equivalence, the specification and implementation are rarely expected to be “bit-level” equivalent (recall the adder example), so real equivalence is probably more useful for the task. TASS uses a number of techniques specialized for real arithmetic, as when all real-valued expressions are put into a canonical form that is the quotient of two polynomials to facilitate the matching of expressions. For other tasks (such as detecting the defect in Figure 7), bit-level reasoning is more appropriate. Klee⁸ is another symbolic execution tool for (sequential) C programs that uses bit-precise reasoning. There is no reason why these techniques could not be extended to parallel MPI-based programs.

Static Analysis of MPI

In the sequential arena, compiler techniques have been successful at analyzing programs and transforming them to improve performance. However, analyzing MPI applications is difficult for four main reasons: the number of MPI processes is both unknown at compile time and unbounded; since MPI processes are identified by numeric ranks, applications use complex arithmetic expressions to define the processes involved in communications; the meaning of ranks depends closely on the MPI communicators used by the MPI calls; and MPI provides several nondeterministic primitives (such as MPI _ ANY _ SOURCE and MPI _ Waitsome). While some prior research (such as Strout et al.²⁹) explored analysis of MPI applications, none successfully addressed this challenge.

Some approaches treat MPI applications as sequential codes, making it possible to determine simple application behaviors (such as the relationship between writing to a buffer and sending the buffer). However, these approaches cannot represent or analyze the application’s communication topology. Other techniques require knowledge of the number of processes to be used at runtime, analyzing one copy of the application for each process. While this analysis can capture the application’s full parallel structure, it is inflexible and non-scalable.

The shortage of formal-methods researchers interested in HPC problems is perhaps the result of the severe historical disconnect between “traditional computer scientists” and HPC researchers.

Over the past few years, we have developed a novel compiler-analysis framework that extends traditional dataflow analyses to MPI applications, extracting the application’s communication topology and matching the send and receive operations that may communicate at runtime.⁶ The framework requires no runtime bound on number of processes and is formulated as a dataflow analysis over the Cartesian product of control flow graphs (CFGs) from all processes we refer to as a parallel CFG, or pCFG. The analysis procedure symbolically represents the execution of multiple sets of processes, keeping track of any send and receive. Process sets are represented through abstractions (such as lower and upper bounds on process ranks) and predicates (such as “ranks divisible by 4”). Sends and receives are periodically matched to each other, establishing the application’s communication topology. Tool users can instantiate the analysis framework through a variety of “client analyses” that leverage the communication-structure information derived by the framework to propagate dataflow information, as they do with sequential applications. Analyses and transformations include optimizations, error detection and verification, and information-flow detection.

Finally, since topological information is key to a variety of compiler transformations and optimizations, our ongoing work focuses on source-code annotations that can be used to describe a given MPI application’s communication topology and other properties. The techniques will then exploit this information to implement novel scalable analyses and transformations to enable valuable optimizations in complex applications.

Conclusion

This article’s main objective is to highlight the fact that both formal and semi-formal methods are crucial for ensuring the reliability of message-passing programs across the vast scale of application sizes. Unfortunately, discussion of these techniques and approaches is rare in the literature. To address this lacuna, we presented the perspectives of academic researchers, as well as HPC researchers in U.S. national laboratories, engaged in cutting-edge HPC deployment. We propose a continuum of tools based on static analysis, dynamic analysis, symbolic analysis, and full-scale debugging, complemented by more traditional error-checking tools.

Unfortunately, we only barely scratched the surface of a vast problem area. The shortage of formal-methods researchers interested in HPC problems is perhaps the result of the severe historical disconnect between “traditional computer scientists” and HPC researchers. This is especially unfortunate considering the disruptive technologies on the horizon, including many hybrid concurrency models to program many-core systems. There are also emerging message-passing-based standards for embedded multicores (such as MCAPI²³), with designs and tool support that would benefit from lessons learned in the MPI arena.

We propose two approaches to accelerate use of formal methods in HPC: First and foremost, researchers in formal methods must develop verification techniques that are applicable to programs employing established APIs. This would help sway today’s HPC practitioners toward being true believers and eventually promoters of formal methods. Moreover, funding agencies must begin tempering the hoopla around performance goals (such as “ExaFLOPs in this decade”) by also setting formal correctness goals that lend essential credibility to the HPC applications on which science and engineering depend.

Acknowledgments

This work is supported in part by Microsoft, National Science Foundation grants CNS-0509379, CCF-0811429, CCF-0903408, CCF-0953210, and CCF-0733035, and Department of Energy grant ASCR DE-AC0206CH11357. Part of this work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Figures

Figure 1. The Intrepid Blue Gene/P Open Science machine at Argonne National Laboratory, with 163,840 cores and 557TFlops/sec (peak).

Figure 2. overview of ISP.

Figure 3. Bug manifests on some runtimes.

Figure 4. Distributed MPI analyzer.

Figure 5. STAT process equivalence classes.

Figure 6. Programs that read an array from a file, sum positive elements, and output the result.

Figure 7. Excerpts from MPICH2 broadcast code; the fault occurs when the highlighted expression is negative.

Figure. Component of Blue Gene/P supercomputer at Argonne National Laboratory, Argonne, IL.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Formal Analysis of MPI-Based Parallel Programs

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2043174.2043194

December 2011 Issue

Published: December 1, 2011

Vol. 54 No. 12

Pages: 82-91

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 26 2024

Optimizing Energy Efficiency in Datacenters with Advanced Cooling Technologies

Alex Williams

Architecture and Hardware

Credit: Getty Images Servers in snowy setting.

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Key Insights

Full-Scale Debugging

Symbolic Analysis of MPI

Static Analysis of MPI

Conclusion

Acknowledgments

Figures

Formal Analysis of MPI-Based Parallel Programs

DOI

December 2011 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.