acm-header
Sign In

Communications of the ACM


3,291 - 3,300 of 3,299 for bentley


Locality Transformations for Nested Recursive Iteration Spaces

There has been a significant amount of effort invested in designing scheduling transformations such as loop tiling and loop fusion that rearrange the execution of dynamic instances of loop nests to place operations that access the same data close together temporally. In recent years, there has been interest in designing similar transformations that operate on recursive programs, but until now these transformations have only considered simple scenarios: multiple recursions to be fused, or a recursion nested inside a simple loop. This paper develops the first set of scheduling transformations for nested recursions: recursive methods that call other recursive methods. These are the recursive analog to nested loops. We present a transformation called recursion twisting that automatically improves locality at all levels of the memory hierarchy, and show that this transformation can yield substantial performance improvements across several benchmarks that exhibit nested recursion.

https://dl.acm.org/ft_gateway.cfm?id=3037720&dwn=1

Optimistic parallelism requires abstractions

Irregular applications, which manipulate large, pointer-based data structures like graphs, are difficult to parallelize manually. Automatic tools and techniques such as restructuring compilers and run-time speculative execution have failed to uncover much parallelism in these applications, in spite of a lot of effort by the research community. These difficulties have even led some researchers to wonder if there is any coarse-grain parallelism worth exploiting in irregular applications.

In this paper, we describe two real-world irregular applications: a Delaunay mesh refinement application and a graphics application thatperforms agglomerative clustering. By studying the algorithms and data structures used in theseapplications, we show that there is substantial coarse-grain, data parallelism in these applications, but that this parallelism is very dependent on the input data and therefore cannot be uncoveredby compiler analysis. In principle, optimistic techniques such asthread-level speculation can be used to uncover this parallelism, but we argue that current implementations cannot accomplish thisbecause they do not use the proper abstractions for the data structuresin these programs.

These insights have informed our design of the Galois system, an object-based optimistic parallelization system for irregular applications. There are three main aspects to Galois: (1) a small number of syntactic constructs for packaging optimistic parallelism as iteration over ordered and unordered sets, (2)assertions about methods in class libraries, and (3) a runtime scheme for detecting and recovering from potentially unsafe accesses to shared memory made by an optimistic computation.

We show that Delaunay mesh generation and agglomerative clustering can be parallelized in a straight-forward way using the Galois approach, and we present experimental measurements to show that this approach is practical. These results suggest that Galois is a practical approach to exploiting data parallelismin irregular programs.

https://dl.acm.org/ft_gateway.cfm?id=1250759&dwn=1

Automatically enhancing locality for tree traversals with traversal splicing

Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointer-based data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like Barnes-Hut and nearest neighbor, previous work has proposed point blocking, a technique analogous to loop tiling in regular programs, to improve locality. However point blocking is highly dependent on point sorting, a technique to reorder points so that consecutive points will have similar traversals. Performing this a priori sort requires an understanding of the semantics of the algorithm and hence highly application specific techniques. In this work, we propose traversal splicing, a new, general, automatic locality optimization for irregular tree traversal codes, that is less sensitive to point order, and hence can deliver substantially better performance, even in the absence of semantic information. For six benchmark algorithms, we show that traversal splicing can deliver single-thread speedups of up to 9.147 (geometric mean: 3.095) over baseline implementations, and up to 4.752 (geometric mean: 2.079) over point-blocked implementations. Further, we show that in many cases, automatically applying traversal splicing to a baseline implementation yields performance that is better than carefully hand-optimized implementations.

https://dl.acm.org/ft_gateway.cfm?id=2384643&dwn=1

PAM: parallel augmented maps

Ordered (key-value) maps are an important and widely-used data type for large-scale data processing frameworks. Beyond simple search, insertion and deletion, more advanced operations such as range extraction, filtering, and bulk updates form a critical part of these frameworks.

We describe an interface for ordered maps that is augmented to support fast range queries and sums, and introduce a parallel and concurrent library called PAM (Parallel Augmented Maps) that implements the interface. The interface includes a wide variety of functions on augmented maps ranging from basic insertion and deletion to more interesting functions such as union, intersection, filtering, extracting ranges, splitting, and range-sums. We describe algorithms for these functions that are efficient both in theory and practice.

As examples of the use of the interface and the performance of PAM we apply the library to four applications: simple range sums, interval trees, 2D range trees, and ranked word index searching. The interface greatly simplifies the implementation of these data structures over direct implementations. Sequentially the code achieves performance that matches or exceeds existing libraries designed specially for a single application, and in parallel our implementation gets speedups ranging from 40 to 90 on 72 cores with 2-way hyperthreading.

https://dl.acm.org/ft_gateway.cfm?id=3178509&dwn=1

Modeling program predictability

Basic properties of program predictability --- for both values and control --- are defined and studied. We take the view that program predictability originates at certain points during a program's execution, flows through subsequent instructions, and then ends at other points in the program. These key components of predictability: generation, propagation, and termination; are defined in terms of a model. The model is based on a graph derived from dynamic data dependences and a predictor.Using the SPEC95 benchmarks, we analyze the predictability phenomena both separately and in combination. Examples are provided to illustrate relationships between model-based characteristics and program constructs. It is shown that most predictability derives from program control structure and immediate values, not program input data. Furthermore, most predictability originates from a relatively small number of generate points. The analysis of obtained results suggests a number of ramifications regarding predictability and its use.

https://dl.acm.org/ft_gateway.cfm?id=279371&dwn=1

TSOtool: A Program for Verifying Memory Systems Using the Memory Consistency Model

In this paper, we describe TSOtool, a program to check thebehavior of the memory subsystem in a shared memorymultiprocessor. TSOtool runs pseudo-randomly generatedprograms with data races on a system compliant with theTotal Store Order (TSO) memory consistency model; it thenchecks the results of the program against the formal TSOspecification. Such analysis can expose subtle memory errorslike data corruption, atomicity violation and illegalinstruction ordering.While verifying TSO compliance completely is an NP-completeproblem, we describe a new polynomial timealgorithm which is incorporated in TSOtool. In spite of beingincomplete, it has been successful in detecting several bugs inthe design of commercial microprocessors and systems,during both pre-silicon and post-silicon phases of validation.

https://dl.acm.org/ft_gateway.cfm?id=1006710&dwn=1

Fractal++: closing the performance gap between fractal and conventional coherence

Cache coherence protocol bugs can cause multicores to fail. Existing coherence verification approaches incur state explosion at small scales or require considerable human effort. As protocols' complexity and multicores' core counts increase, verification continues to be a challenge. Recently, researchers proposed fractal coherence which achieves scalable verification by enforcing observational equivalence between sub-systems in the coherence protocol. A larger subsystem is verified implicitly if a smaller sub-system has been verified. Unfortunately, fractal protocols suffer from two fundamental limitations: (1) indirect-communication: sub-systems cannot directly communicate and (2) partially-serialinvalidations: cores must be invalidated in a specific, serial order. These limitations disallow common performance optimizations used by conventional directory protocols: replyforwarding where caches communicate directly and parallel invalidations. Therefore, fractal protocols lack performance scalability while directory protocols lack verification scalability. To enable both performance and verification scalability, we propose Fractal++ which employs a new class of protocol optimizations for verification-constrained architectures: decoupled-replies, contention-hints, and fully-parallel-fractal-invalidations. The first two optimizations allow reply-forwarding-like performance while the third optimization enables parallel invalidations in fractal protocols. Unlike conventional protocols, Fractal++ preserves observational equivalence and hence is scalably verifiable. In 32- core simulations of single- and four-socket systems, Fractal++ performs nearly as well as a directory protocol while providing scalable verifiability whereas the best-performing previous fractal protocol performs 8% on average and up to 26% worse with a single-socket and 12% on average and up to 34% worse with a longer-latency multi-socket system

https://dl.acm.org/ft_gateway.cfm?id=2665733&dwn=1