As computers become parallel, performance-challenged programs must also become parallel. For some algorithms, this is not a great challenge as the underlying problem divides naturally into independent pieces that can be computed concurrently. Other problems are not so lucky. Their constituent computations are tightly interdependent, and a parallel implementation requires considerable coordination and synchronization and may still perform poorly.

Recursive algorithms fall into this category. The recursive call is a dependence between the calculations in successive function invocations, which makes it difficult to overlap their executions significantly. There is no general formula for transforming a recursive function for parallel execution. It is necessary to understand the intrinsic structure of the underlying computation and to find a way to preserve the essential relationships while running in parallel.

The following paper shows how some instances of dynamic programming, an important recursive algorithmic paradigm, can be effectively parallelized by taking advantage of the algebraic properties of the underlying computations. Dynamic programming uses a table to store the values of subcomputations and consults these intermediate results to compute new values. Depending on which values are accessed, it is possible to execute some calculations concurrently along a column or diagonal of the table. However, each of these calculations is typically small. The overhead of communication and synchronization limits which computations can profitably execute in parallel. In addition, the amount of parallelism is limited by the size of the table.

The authors demonstrate another way to parallelize these computations by dividing the table into independent chunks, each of which can be computed independently, and then these intermediate results "patched" to correctly account for missing dependencies from calculations that should have been performed earlier. The benefits of this approach are clear—large independent computations execute well on multicore processors and incur less overhead than fine-grained computation.

A dynamic programming algorithm is a sequence of recursive calls, where each such call is a subcomputation on a prefix of the input that produces a result for the computation on the current input. The table memorizes subcomputations so they need be computed only once.

Suppose we want to execute this sequence of calls concurrently on *P* processors. If we divide the sequence into *P* chunks and run each independently, all computations except the first one are likely to produce a wrong answer, since they lack the results that should have been put in the table earlier. This paper shows how to fix these incorrect answers for an important class of problems.

The method applies to dynamic programming problems in which the computation can be described by a tropical semiring, an algebraic structure with two operators and a zero element. For dynamic programming, the semiring is defined over matrices, with the standard matrix product redefined by replacing multiplication with addition and addition with max. The key insight of this paper is that sequentially applying this matrix product operation never increases the rank of the result matrix and, in practice, a sequence of these operations often converges to a rank-1 matrix. At this point, the final result of the sequence of matrix products is parallel to the rank-1 intermediate results, differing only in magnitude.

This paper is a nice reminder of the value of looking beyond the natural formulation of a computation to its underlying structure.

This insight leads to an efficient coarse-grained parallelization. Break this sequence into *P* independent computations, each starting on a contiguous block of product computations. Each computation, except the first one, may be wrong since it ignores the earlier computations. However, they can be fixed by sequentially propagating correct results from the prior computation and redoing the product calculation until it produces a rank-1 matrix, at which point the rest of the calculations can be skipped since the final result differs only by an easily calculable offset.

In practice, for many problems, convergence to rank-1 is very quick; in others, it is slower or never occurs. But, in the cases where convergence is rapid (for example, Viterbi and Smith-Waterman) and the input is large, the resulting algorithm performs very well, even producing near-linear speedup on the latter problem for greater than 100 cores.

This paper is a nice reminder of the value of looking beyond the natural formulation of a computation to its underlying structure when a program does not naturally parallelize.

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment