Say you want to simulate the motion over time of the stars in a galaxy to learn about how galaxies formed and why the universe appears as it does to us. This seems like a simple computational problem—the laws of motion are governed by Newton’s laws (let’s ignore relativity) and simply require computing the force of every star on every other. For N stars, that is N2 operations, and for billions of stars, that is a billion billion operations. Is it feasible on a single processor? Sadly, even today’s fastest single processor and one single moment in time, the answer is "no." Even worse, this computation must be repeated millions or billions of times, as the motion of each star is followed. (If you prefer biology or materials science, replace "stars" with "atoms.")
So is parallelism the answer? Unfortunately not. Even the biggest systems today do not provide enough processing power to handle these problems (supercomputers may be fast, but they are not infinitely fast).
What can we do?
The following paper by Lashuk et al. shows the way. Solving such a difficult problem requires taking advantage of everything possible. The authors start with the algorithm. While the obvious way of computing the mutual interaction of N bodies on each other takes quadratic time in the number of bodies, there is a clever method that, with rigorous error bounds, can compute those interactions to any specified accuracy in linear time! This algorithm, developed 25 years ago by Greengard and Rohklin, exploits a hierarchical representation of the forces. But even a linear-time algorithm on a single processor is not fast enough for the problems that scientists want to solve. So, having picked an excellent algorithm, the authors proceed to parallelize it. However, the choice of algorithm does not make that easy. Because the algorithm uses a hierarchical representation, it is natural that the data structures involved are trees, and because the distribution of bodies, be they stars or atoms, is non-uniform, the trees are unbalanced. This makes effective parallelization difficult because a major impediment to efficient parallelization is effective load balance. The paper then tackles parallelism at multiple levels, starting with distributed memory parallelism, then continuing to multicore processes and finishing with GPUs. A very pragmatic argument is made for the use of hybrid parallel programming models (specifically, MPI combined with OpenMP and/or CUDA): a single programming model is not always the best solution for real applications.
Why should you read the following paper? Because it provides a clear description of solving a difficult problem by effectively combining the best algorithms with the best computers, while making use of the most appropriate strategies for parallelization.
Why should you read the following paper? Because it provides a clear description of solving a difficult problem by effectively combining the best algorithms with the best computers, while making use of the most appropriate strategies for parallelization. Throughout this work you will find the use of simple yet effective complexity analysis to guide and explain the choices that are made. This is high-performance computing at its best—using the best algorithms with the biggest machines; using complexity analysis to engineer a high-performance solution, using a hierarchy of programming models, each providing high performance for that part of a parallel computer. The result is a tool that can be used by scientists to solve problems that cannot be solved by any other means—an example of how computing has become critical to the advancement of science.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment