Architecture and Hardware Contributed articles

Debugging High-Performance Computing Applications at Massive Scales

Dynamic analysis techniques help programmers find the root cause of bugs in large-scale parallel applications.

By Ignacio Laguna, Dong H. Ahn, Bronis R. De Supinski, Todd Gamblin, Gregory L. Lee, Martin Schulz, Saurabh Bagchi, Milind Kulkarni, Bowen Zhou, Zhezhe Chen, and Feng Qin

Posted Sep 1 2015

Introduction
Key Insights
Background
Discovering Scaling Bugs
Behavior-Based Debugging
Stack Trace Analysis Tool
Software Defects in MPI
Scalability and Performance
Conclusion
Acknowledgments
References
Authors
Footnotes
Figures
Sidebar: Real-World Debugging Case

Debugging High-Performance Computing Applications at Massive Scales, illustrative photo

Breakthroughs in science and engineering are increasingly made with the help of high-performance computing (HPC) applications. From understanding the process of protein folding to estimating short- and long-term climate patterns, large-scale parallel HPC simulations are the tools of choice. The applications can run detailed numerical simulations that model the real world. Given the great public importance of such scientific advances, the numerical correctness and software reliability of these applications is a major concern for scientists.

Key Insights

Bugs in parallel HPC applications are difficult to debug because errors propagate quickly among compute nodes, must debug thousands of nodes or more, and bugs might manifest only at large scale.
Although conventional approaches such as testing, verification, and formal analysis can detect a variety of bugs, they struggle at massive scales and do not always account for important dynamic properties of program execution.
Dynamic analysis tools and algorithms that scale with an application’s input and number of compute nods can help programmers pinpoint the root cause of bugs.

Debugging parallel programs is significantly more difficult than debugging serial programs; human cognitive abilities are overwhelmed when dealing with more than a few concurrent events.¹² When debugging a parallel program, programmers must check the state of multiple parallel processes and reason about many different execution paths. The problem is exacerbated at large scale when applications run on top supercomputers with millions of concurrently executing processes. Traditional debugging tools scale poorly with massive parallelism, as they must orchestrate execution of a large number of processes and collect data from them efficiently. The push toward exascale computing has increased the need for scalable debugging techniques.

This article describes the fundamental advances being made in designing scalable debugging techniques, sharing the authors’ experience applying their solutions to current petascale supercomputers. It outlines a set of techniques that build on one another to accomplish large-scale debugging: discovery of scaling bugs, or those that manifest themselves when the application is deployed at large scale, behavioral debugging by modeling control-flow behavior of tasks,^a and software-defect detection at the communication layer. It also describes support tools critical to higher-level debugging tools and how they may evolve into the exascale era. Finally, it discusses the three broad open problems in this domain: programmability, performance bugs, and silent data corruptions.

Background

Most large-scale HPC applications use the Message Passing Interface (MPI) to communicate among parallel processes in a distributed-memory model. The MPI standard provides more than 300 functions,¹⁹ including data-transfer operations (such as send and receive for point-to-point transfer and broadcasts for collective transfer), and process synchronization operations (such as barriers). Many functions have subtle semantics (such as pairing calls in different processes); for example, a bug can be introduced by incorrectly copying a buffer from the application to the MPI library or by sending a message on one process without properly calling a receive function on a peer.

Software bugs originate not only in the application but also in the MPI library itself; for example, the MPI library might corrupt data while moving it between buffers in the application and buffers in the MPI library or misuse a lower-level communication API.²¹ These bugs can be devastating in a production supercomputing environment, as the majority of parallel applications depend on the MPI library. In addition, many MPI library implementations exploit the standard’s own flexibility to make aggressive optimizations to improve performance, possibly leading to unexpected numerical non-deterministic behavior; for example, a computation might produce different results each time it executes. Although they are not bugs per se, such unexpected non-determinism hinders debugging MPI applications, especially at large scale.

Scaling bugs. An insidious class of bugs specific to large-scale parallel programs, or “scaling bugs,” manifests only when an application is run at large scale. Scientific simulations are generally modeled and tested at small scale first. Then, when scientists believe the application is correct, they submit campaigns of large-scale production runs (such as with millions of parallel processes). Since resource constraints limit testing, and parallel program behavior is scale- and input-dependent, scaling bugs are often not encountered before these campaigns. Most debugging tools in HPC, including traditional breakpoint-based debugging and relative debugging tools, struggle at such massive scales. Although advances in relative debugging allow scalable comparisons of program runs,⁶ the process of finding bugs remains fundamentally manual. Most HPC centers provide access to commercial debuggers like TotalView and DDT, but, while they work at large scale to some degree, they are optimized for small- to medium-size. Further research is required to develop more-effective, more-automated scalable debugging methods like those explored in this article.

A challenge in developing large-scale applications is finding bugs that are latent at small scales of testing but manifest when the application is deployed at a large scale.

Scaling bugs appear for various reasons (overflow errors, resource exhaustion, and incorrect execution flow are common) and can lead to correctness and performance errors. Figure 1 is a sample bug that appeared in the implementation of the MPI_Allgather routine in the MPICH¹⁰ MPI implementation. Each parallel process gathers information from its peers through different communication topologies (such as ring and balanced binary tree) depending on the total amount of data to be exchanged. At a large-enough scale, an overflow occurs in the (internal) variable that stores the total size of data exchanged; see the line with the if statement. Consequently, it uses a non-optimal communication topology that leads to a slowdown of the target application. Determining that the root cause of this bug is a conditional overflow in a single MPI call has indeed proved challenging and time consuming.

Consequences of software defects. Software defects in HPC applications cause various levels of inconvenience. Hangs and crashes are common bug manifestations, but their causes can be difficult to pinpoint because tight communication dependencies between processes are characteristic of HPC applications; an error in one process spreads quickly to others. Race conditions and deadlocks are challenging because they often become visible at large scales, where messages are more likely to interleave in different, untested orderings. Performance degradation leads to suboptimal use of expensive supercomputing facilities, affecting the power budget at HPC facilities and the rate at which scientific questions can be answered. Numerical errors are among the most perplexing and deleterious, as they directly affect the validity of scientific discoveries. Scientists need effective, easy-to-use debugging tools to isolate these problems quickly.

Conventional approaches: software testing, verification, concurrency bugs. Software verification and testing are two conventional approaches for finding errors (such as by generating high-coverage tests). Such techniques have gained attention recently due to the significant advances in “constraint satisfiability” that permits development of practical symbolic execution techniques.⁴ Debugging techniques for both serial code¹³ and parallel code²⁷ have existed for many years, allowing developers to find programming and runtime errors. The main limitation of these techniques, with respect to the challenges we present here, is they focus on single-machine applications (the shared-memory programming model) or small- to medium-scale (with a few 10s of concurrent execution threads). In this article, we describe techniques for distributed-memory MPI applications that can operate at massive scales (such as millions of processes).

Static versus dynamic analysis. We distinguish between two classes of program analysis for debugging: static and dynamic analysis. Static analysis attempts to find bugs in a program without running it (such as by analyzing the source code), whereas dynamic analysis does the same by running the program (such as by analyzing traces that are obtained at runtime). Static-analysis debugging tools can catch a variety of defects but are inadequate for debugging large-scale parallel computing applications. They lack runtime information (such as exchanged messages among processes), which is a major disadvantage, as this information can be crucial in understanding the propagation of errors.

To illustrate the problem, consider “static program slicing,” a widely used static-analysis technique in debugging and software testing.²⁶ Slicing uses data-flow and control-flow analysis to identify a set of program statements, or the “slice,” that could affect the value of a variable at some point of interest in the program. Given an erroneous statement, developers use slicing to identify code that could have (potentially) caused the statement to fail. Consider Figure 2, where an MPI program has an error in statement 10 when saving the results of some computation. It shows a static slice on lines 8–10, because the result variable, used in line 10, depends on statements 8 and 9. However, due to the single-program-multiple-data nature of MPI applications, the programmer must account for data propagation among processes and highlight lines 3–6 as possible sources of the bug. The programmer could statically identify which send operation could possibly match each receive operation and use the match for static slicing, but this would be too inaccurate (and as a consequence inefficient) since matching parameters (such as source, destination, and tag) are often variables. Matching can be done accurately only through dynamic-analysis techniques, our focus here.

Most slicing techniques for debugging message-passing programs thus use dynamic slicing, which considers a particular program execution, including communicated data. However, dynamically slicing a message-passing program usually does not scale well; computing a dynamic slice for each MPI process takes at least O(p), where p is the number of processes. Further, dynamic slicing imposes a high computational cost to generate traces of each task (typically by code instrumentation) and aggregate those traces centrally to construct the slice. To avoid these limitations, the dynamic-analysis techniques presented here leverage lightweight tracing (such as stack tracing), as well as distributed and scalable trace merging.

Complementary approach: Formal methods for MPI. Formal analysis of MPI programs⁹ can detect a variety of bugs; for example, ISP²⁴ provides formal coverage guarantees against deadlocks and local safety assertions. A major limitation of these tools is that rule checking incurs high computational overhead due to state explosion. In addition, in implementations, MPI message scheduling is performed in a centralized engine, limiting scalability. The accuracy of the tools can be sacrificed to achieve better scalability for specific checks. However, these approaches have been shown to work up to only a few thousand processes; more research is needed to increase their capability for larger process counts. To detect a wider spectrum of bugs, formal analysis tools can be coupled with runtime MPI checkers (such as Umpire,²⁵ Marmot,¹⁴ and MUST¹¹). However, the debugging process is still predominantly manual, and the tools cover only bugs in the application, not in the MPI library.

Approach overview. Here, we focus on dynamic techniques for detecting bug manifestations, or software errors, in large-scale parallel applications that use MPI. Our techniques aid developers by pinpointing the root cause of an error at varying granularities.

Discovering scaling bugs. A challenge in developing large-scale applications is finding bugs that are latent at small scales of testing but manifest when the application is deployed at a large scale. A scaling bug can manifest either on strong or weak scaling. Statistical techniques do not work well because they require comparison with error-free runs at the same scale as when the problem manifests itself, and such error free runs are often unavailable at large scale. We later describe recent developments for dealing with scaling bugs.

Behavior-based debugging. These techniques follow the observation that, although large-scale runs involve a large number of processes, the processes usually fit in only a few behavioral groups. Previous debuggers for MPI applications (such as the Prism debugger²³) leveraged this notion (of behavioral grouping) to reduce the search space from thousands of processes to a few process groups to help the developer in the bug-hunting process. The techniques we present here extend this notion to build scalable and more automated debugging tools by considering the behavior across time of the processes, rather than an isolated snapshot in time.

Manifestation of a bug could be such that the behavior of a few of the processes is different from the majority of the processes. Abnormal behavior can thus be detected by identifying the abnormal processes. Later, we present AutomaDeD, a tool that automatically finds abnormal processes in MPI applications, modeling control-flow and the timing behavior of each MPI process as a Markov model. These models are used to automatically pinpoint suspicious processes and code regions in a scalable manner.

Our experience is that, when errors occur at large scale, only a limited number of behaviors are observed on processes, with only a few following the erroneous path where the fault first manifests. Later, we present the Stack Trace Analysis Tool, or STAT, which highlights these behaviors by attaching to all processes in a large-scale job, gathering stack traces, and merging the stack traces into a prefix tree to identify which processes are executing similar code.

Software defects in MPI. Although MPI is widely used in large-scale HPC applications, and its implementations are high quality in general, MPI library implementations continue to suffer from software bugs, especially when ported to new machines. Many bugs are subtle and difficult for the average programmer to detect.⁵ Due to the wide use of MPI libraries, such bugs may affect many parallel jobs. We later present FlowChecker, a technique for detecting communication-related bugs in MPI libraries, checking whether messages are correctly delivered from sources to destinations by comparing the flow of messages in the library against program intentions.

Discovering Scaling Bugs

As discussed earlier, scaling bugs are difficult to handle through traditional debugging techniques and are becoming more prominent with the incessant drive to execute HPC applications at massive scales. The goal is to detect them early and narrow down the originating code.

Zhou et al.²⁸ developed an early attempt to retool traditional statistical debugging techniques to scaling bugs, an approach that builds a statistical model from training runs expected by the programmer to not be affected by the bug. The model captures a program’s expected behavior with simple control flow statistics (such as number of times a particular branch is taken or number of times a particular calling context appears). Then, at deployment time, these statistics are measured at different points in the execution; if their values fall outside the model parameters, the deployed run is deemed to manifest a bug, since it is different from the training runs. Examination of the deviating program features localizes anomalies to find potential bugs.

Zhou et al.’s approach²⁸ failed to find scaling bugs for a simple but fundamental reason: If the statistical model is trained on only small-scale runs, statistical techniques can result in numerous false positives. Program behavior naturally changes as the level of concurrency increases; for example, the number of times a branch in a loop is taken depends on the number of loop iterations that can depend on the scale. Small-scale models will incorrectly label correct behaviors at large scales as anomalous, and therefore erroneous. This effect is particularly insidious with strong scaling, where scaling up divides work among more and more processes, but each process does progressively less work.

The most successful techniques for handling scaling bugs use scale as a parameter of the model and rely on the availability of bug-free runs at small scales.^28,29 Vrisha, a framework using many small-scale training runs, is an example of such a technique, building a statistical model (based on Kernel Canonical Correlation Analysis) from these runs to infer the relationship between scale and program behavior. The technique, in essence, builds a scaling model for the program that can extrapolate the aggregated behavioral trend as the input or system size scales up. Bugs can be detected automatically by identifying deviations from the trend. This detection can occur even if the program is run at a previously unmeasured scale; Figure 3 outlines this approach.

To isolate bugs on the largest supercomputers, programmers need scalable distributed algorithms that can isolate anomalous behaviors quickly.

However, Vrisha identifies only that the scaling trend has been violated, not which program feature violated the trend or where in the program the bug was manifest. The WuKong framework solved this limitation by observing that, if all features are combined and modeled (as with Vrisha), they cannot be “reverse-mapped” to identify which feature caused the deviation. WuKong therefore uses per-feature regression models, built across multiple training scales that can accurately predict the expected bug-free behavior at large scales. When presented with large-scale execution, WuKong uses these models to infer what the value of each feature would have been in a bug-free run. If any value deviates from the prediction, WuKong detects a bug. WuKong identifies the lines of code that result in unexpected behavior by ranking features based on their prediction error; program features are chosen carefully so they can be linked to particular regions of code. This ranked list provides a roadmap programmers can use to track down the bug.

Scaling models cannot predict program behaviors (or features) that do not correlate with scale. These WuKong-related techniques use cross-validation techniques to prune features that are difficult to model accurately from the training runs, an approach that limits the behaviors they can predict.

Behavior-Based Debugging

An application bug can manifest in such a way that the behavior of a few buggy parallel tasks is different from that of the majority of the tasks. We leverage this observation to focus the developer’s attention on the few buggy tasks, rather than requiring an exhaustive investigation. Our approach builds a behavioral model for tasks that allows us to compare the behavior of one task with another. To isolate bugs on the largest supercomputers, programmers need scalable distributed algorithms that can isolate anomalous behaviors quickly. We have implemented all of them in our AutomaDeD^b framework.^3,16

AutomaDeD uses a simple model of task behavior that captures timing information and patterns in each task’s control flow. The timing information allows programmers to detect performance problems, and the control flow model allows them to isolate them to particular code regions; Figure 4 is an overview of AutomaDeD. On each concurrent task, a measurement tool builds a model of execution at the granularity of code blocks and execution paths between them. To make our model lightweight, we did not design it to model basic blocks but to model code regions between communication (MPI) calls. AutomaDeD intercepts MPI calls dynamically, building its model of each task from these instrumentation functions. Each task’s behavior is stored as a semi-Markov model (SMM). States in the model represent communication code regions, or code within MPI routines, and computation code regions, or code executed between two MPI communication routines. A state consists of a call-stack trace sampled at each MPI call entry. The stack trace gives the dynamic calling context of each communication call. SMM edges represent transfer of control between two states. The modeling strategy we designed into AutomaDeD assigns two attributes to each edge: a transition probability and a distribution of execution times. The distribution models the time spent in the source state for each dynamic invocation where control is transferred next to the destination state.

Given per-task models, AutomaDeD identifies anomalous behaviors through a clustering algorithm that takes as input a set of per-task models and a (dis)similarity metric that compares two models. The algorithm outputs groups of tasks with similar models. Clusters of tasks are considered unusual if AutomaDeD does not expect them based on the clusters in error-free training runs, or error-free iterations of the same run, in an unsupervised setting. For example, consider a master-slave parallel application that, in an error-free run, can be clustered into two behavioral clusters: a master cluster and a slave cluster. If AutomaDeD finds a third cluster in a faulty run, it flags it as unusual, focusing attention on the tasks in the cluster. A similar algorithm identifies the state (or code region) in which the error first manifests in those tasks. We have found other techniques to identify the task outliers are also useful, including in finding tasks abnormally far from its cluster center and tasks on average far from their neighbors using nearest-neighbor algorithms.¹⁶

Hangs and deadlocks are common bug manifestations in HPC applications and notably difficult to diagnose at massive scale. Large-scale HPC applications are tightly coupled (such as an MPI collective operation involving participation by multiple tasks), so a fault in one task quickly propagates to others. We developed the concept of a progress-dependence graph (PDG) of tasks that captures the partial ordering of tasks with respect to their progress toward the final computation result.^15,20 The PDG is then used to determine the least-progressed task, the likely root cause of the problem. AutomaDeD uses Markov models to create the PDG. Progress dependencies are calculated based on how the MPI tasks visit different application states. For example, suppose two tasks, x and y, are blocked on different states, S_x and S_y, respectively. If y has visited S_x before (and no path exists from y to x in the model), y probably depends on x to make progress, and an edge is created in the PDG from y to x. Figure 5 is a snippet of a PDG with three classes of tasks (0–2, 3, 4–1,000). Task 3 has made the least progress according to the PDG and will then be examined further (as root cause of the fault).

The combination of local, per-task models, and distributed outlier and progress analysis allows AutomaDeD to focus developers’ attention on the erroneous period of time, parallel task, and specific code regions affected by faults and performance problems in parallel applications. This focus substantially reduces the bug search space, relieving the developer of having to compare potentially millions of tasks and even more possible execution paths.

All detection and diagnosis analysis steps in AutomaDeD (including outlier detection and progress dependence) are performed in a distributed fashion and thus scale to large numbers of parallel tasks.¹⁶ Outlier task isolation uses scalable clustering,⁷ and progress dependence is calculated through a combination of binomial tree-based reductions and per-task local analysis. The complexity of these procedures with respect to the number of parallel tasks is, in the worst case, logarithmic. AutomaDeD has been used to diagnose the origin of real-world bugs that appear only at large scales. Laguna et al.¹⁵ detailed how we used AutomaDeD to identify the offending task when a hang manifested in a molecular dynamics application with more than 8,000 tasks on the BlueGene/L supercomputer.

Stack Trace Analysis Tool

The Stack Trace Analysis Tool (STAT)² is a lightweight, highly scalable debugging tool for identifying errors in code running on the world’s largest supercomputers. STAT gathers stack traces from a parallel application and merges them into a call-graph prefix tree. The merge operation groups together traces from different processes with the same calling sequence, labeling the edges of these merged traces with the count and set of MPI ranks that exhibited that calling sequence (see Figure 6). Further, nodes in the prefix tree visited by the same set of processes are given the same color, giving the programmer a quick means to identify the various process equivalence classes.

Process-equivalence classes are STAT‘s main debugging idiom to help programmers focus on a manageable number of processes. Bulk-synchronous applications, particularly in a hang state, typically demonstrate a small number of classes behaving in different ways, with a few processes taking an anomalous call path, a similarly small number of processes waiting for point-to-point communication, and the remaining processes stuck in a collective operation. Programmers can thus effectively debug a large-scale application, even with more than one million MPI processes,¹⁷ by focusing on a small subset of tasks, in particular, a single representative of each equivalence class.

While conceptually simple, STAT is effective at isolating highly elusive errors that often emerge in production computing environments (see the sidebar “Real-World Debugging Case“). In addition, STAT development, deployment, and use give programmers a blueprint for scalable tools. On extreme-scale systems, we learned an effective tool must scale in several dimensions. For one, it must be capable of presenting large amounts of debug information without overwhelming programmers. Equally important, it must become a highly scalable application on its own, collecting, managing, and reducing debug data without suffering a scalability bottleneck.

To become a highly scalable application, STAT was designed on scalable infrastructures (such as Launch-MON¹ for tool start-up and bootstrapping and the MRNet²² tree-based overlay network for communication). Even so, as we have tested and deployed it on increasingly larger systems, scalability bottlenecks have continually surfaced. We have thus had to keep innovating to improve its key attributes, from internal data representations¹⁸ to file access patterns and testing methods.¹⁸

Software Defects in MPI

Bugs in MPI libraries can be devastating in a production supercomputing environment, as most parallel supercomputing applications depend on the MPI library. MPI bugs can be especially frustrating for application developers since they could appear to be bugs in client code. Further, these bugs can be tedious for library developers to catch and fix.²¹ MPI users’ machines can have architectures or compilers that differ from library developers’ machines. Worse, the library developers may not have a system as large as the one on which the bug manifests. Certain bugs occur only on large-scale systems.² Others manifest only in large MPI applications, and if the applications are sensitive or proprietary, users usually have to generate a small reproducing test program to send to developers, a possibly time-consuming process. In many cases, library developers may simply be unable to reproduce a scaling bug.

FlowChecker is a low-overhead method for detecting bugs in MPI libraries⁵ (see Figure 7). Its main function is to check whether the underlying MPI libraries correctly deliver messages from the sources to the destinations as specified by the MPI applications. Like a delivery service’s package-tracking system, FlowChecker‘s detection process includes extraction of message-passing intention (source and destination addresses), message flow tracking (package transmission and delivery), and message-delivery verification (user confirmation).

FlowChecker first extracts the intentions of message passing (“MP-intentions”) from MPI applications. Normally, MP intentions are implied by MPI function calls and corresponding arguments in MPI applications; for example, FlowChecker can extract MP intentions based on a matched pair of MPI_Send and MPI_Recv function calls collected by instrumenting the MPI applications.

Moreover, for each MP intention, FlowChecker tracks the corresponding message flows by following relevant data-movement operations, starting from sending buffers at the source process. Data movement-operations move data from one memory location to another within one process or between two processes.^5,8 Examples of data movement include memory copy and network send/receive collected by instrumenting the MPI libraries. This step allows FlowChecker to understand how MPI libraries perform message transmission and delivery.

Finally, FlowChecker checks message flows (from the second step) against MP intentions (from the first step). If FlowChecker finds a mismatch, it reports a bug and provides further diagnostic information (such as faulty MPI functions or incorrect data movements).

In Figure 7, which reflects the main idea of FlowChecker, process 1 invokes MPI_Send to send a message stored at the buffer {A1, A2} to process 2, and process 2 invokes MPI_Recv to receive the message and store the message at the buffer {D1, D2}. The MP-intention is {A1→D1, A2→D2}, and the message flows are A1→B1→C1→D1 and A2→B2→C2→D2′. Comparing the MP-intention with the corresponding message flow, FlowChecker detects the data in A2 is delivered to an incorrect destination D2′, instead of D2, as specified by the application. Additionally, FlowChecker will provide further diagnosis information—the last broken message flow {C2→D2′}.

Scalability and Performance

The scalability of the techniques—Vrisha, Wukong, AutomaDeD, STAT, and FlowChecker—is mainly a function of algorithmic complexity with respect to scale, including number of processes and size of input data. All are dynamic, so they first collect application traces online, or as the application runs, then process them, online or offline. A programmer would determine their performance by measuring application slowdown (caused by gathering traces) and time to process the collected traces, or analysis time. Slowdown is the ratio of the execution time with the tool to the execution time without the tool.

We have had debugging sessions with STAT on up to 1,572,864 MPI processes on the IBM Sequoia Blue Gene/Q supercomputer, demonstrating STAT‘s slowdown and analysis time is small. We have used AutomaDeD up to 100,000 MPI processes. Its slowdown, which depends on number of MPI calls intercepted, is on average 1.2 (based on 10 tested applications), and its analysis time is less than 10 seconds for up to 100,000 tasks.

We have used Vrisha, Wukong, and FlowChecker at smaller scales (on the order of 1,000 processes); however, there is no fundamental limitation in their design to running them at the largest scale (such as no centralized element or algorithmically costly procedures with respect to scale). Performance is affected by having to use binary instrumentation to gather traces and storage overhead to save them. The models Vrisha and Wukong use take less than a second to train, but application slowdown can be between 1.08 and 1.3. FlowChecker has a very small slowdown (less than 1.1), but the size of the traces can grow quickly, with application execution causing data rates between 1.77 MB/min and 10.08 MB/min. For tools like AutomaDeD this is not a problem since traces are processed online.

Conclusion

We have argued that a robust set of dynamic debugging tools is required for deploying large-scale parallel applications, complementing ongoing work in static analysis, relative debugging, and formal methods for development of correct parallel applications. We surveyed the state-of-the-art techniques for dynamic debugging. The first class of techniques we discussed targets software bugs that arise only when the application runs at large scales, using it to extrapolate from behavior at small-scale runs that are assumed correct. The second class is behavior-based detection via clustering, where behavior includes control flow and timing of a certain granularity of code regions. A sub-class within this technique collects stack traces from each individual process and then, using the insight the stack traces exhibit great commonality, merges multiple traces into one, giving a developer a limited, manageable amount of information. The third class targets bugs in communication libraries by verifying program intent matches observed behavior. These software packages build on robust, usable tools that include stack tracing, dynamic binary instrumentation, scalable data analytics (such as clustering), and runtime profiling.

Looking ahead, discoveries are needed in three main research directions. First, higher-level software development environments must be created for parallel programming to enable faster development with fewer defects. These environments should support standard principles of software engineering (such as refactoring, code completion, quick-fix support, and intuitive user interfaces). Second, debugging support must be deployed to address performance bugs, in addition to the current focus of correctness bugs; application models must thus include performance measures and support extrapolation of performance characteristics in various dimensions (such as larger process counts, larger data sizes, and different datasets). Third, the programmer’s attention must focus on data corruption resulting from software bugs that are often not within the purview of existing detection techniques or do not cause hangs, crashes, or performance slowdowns but “silently” corrupt the data output. Such bugs lead to the specter of incorrect science all programmers surely wish to avoid.

Dynamic debugging techniques have spurred significant innovation and robust tool development. We will build on this promising base to address further challenges, some we have explored here.

Acknowledgments

The research and development related to this article was performed partially under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DEAC52-07NA27344 (LLNL-JRNL-652400) and support from National Science Foundation awards CNS-0916337, CCF-1337158, CCF-0953759, and CNS-0403342.

Figures

Figure 1. Sample scaling bug in the MPICH2 library caused by an overflow in the if statement when run at large process counts or with a large amount of data.

Figure 2. Debugging example using static analysis.

Figure 3. An overview of the Vrisha-based approach to detect and diagnose scaling bugs, relying on scale as a model parameter and training through error-free runs at small execution scales.

Figure 4. Overview of the AutomaDeD framework.

Figure 5. Progress-dependence graph.

Figure 6. View of STAT for a run on more than one million MPI processes.

Figure 7. An example of how FlowChecker is used to find bugs in MPI communication libraries.

Sidebar: Real-World Debugging Case

A scientist experienced hangs in a laser-plasma interaction code (named PF3D) when scaling it to 524,288 MPI processes on Lawrence Livermore National Laboratory’s Sequoia BlueGene/Q system. The scientist spent months trying to debug the problem through print statements to no avail and was reluctant to run a traditional parallel debugger at such large scale. Moreover, the scientist was unable to reproduce the hang at smaller scales where these fully featured, heavyweight debuggers would be more plausible.

However, within a few minutes of launching STAT, the scientist could see the application was deadlocked in a computation-steering module. Working with colleagues, the scientist analyzed STAT‘s output to determine the hang was the result of a race condition between two distinct but overlapping communication regions, the result of PF3D migrating from one version to another more scalable but incompatible one. During migration, PF3D ran through a compatibility layer that introduced the race condition and ultimately caused these timing- and scale-dependent hangs. Following the analysis, the scientist expedited the task of porting to the new version, after which the simulations could proceed without hangs.

STAT reported the precise location of the hang in a few minutes, and a few hours of analysis later determined the cause of the hang. By contrast, the same scientist had previously spent months of effort, and perhaps millions of CPU hours, and was still unable to identify the problem through the common procedure of print debugging.

This is just one example of many problems STAT is designed to help debug at large scales, with others extending beyond 1.5 million MPI processes and problems ranging from user-application errors to system software bugs and even to hardware faults.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Debugging High-Performance Computing Applications at Massive Scales

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2667219

September 2015 Issue

Published: September 1, 2015

Vol. 58 No. 9

Pages: 72-81

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 25 2025

Reversing the Fossilization of Computer Science Conferences

Bertrand Meyer

Computing Profession

person speaking from a stage, illustration

Practice Apr 25 2025

Call for Papers: CACM Practice Section

Architecture and Hardware

BLOG@CACM Apr 24 2025

The Narrative Power of Data

Angelica Lo Duca

Data and Information

masks of tragedy and comedy on a digital grid

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Key Insights

Background

Discovering Scaling Bugs

Behavior-Based Debugging

Stack Trace Analysis Tool

Software Defects in MPI

Scalability and Performance

Conclusion

Acknowledgments

Figures

Sidebar: Real-World Debugging Case

Debugging High-Performance Computing Applications at Massive Scales

DOI

September 2015 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.