HPC: Computational Performance vs. Human Productivity

I have attended the Supercomputing Conference (SC) several times over the past decade, including SC14 in New Orleans last month. One of the themes that I heard at this year’s conference was a new variation on an old tradeoff: productivity vs. performance.

The old tradeoff dates back to the days of the first high-level languages. When creating a new piece of software, one could choose to write it in:

Assembly language, which let the programmer improve the run-time performance by using machine-specific optimizations; or
A high-level language, which let the programmer reduce the time it took to write the software, as well as the time required to port the software to a different architecture.

Since the assembly version of the software ran faster, it had better performance than the high-level version. Since a human could develop the high-level version faster and it required less work to port/maintain, it allowed the human to be more productive than the assembly version, if the performance difference was not too great. If you were (say) a scientist working at a national laboratory where a new supercomputer was installed every few years, you would be willing to use a high-level language and pay a performance penalty, rather than use an assembly language and have to rewrite your software for each new supercomputer. It was acceptable to trade off higher productivity for lower performance, so long as the performance penalty associated with the high-level language was not too large.

(Lest the reader think this is ancient history and that today's compilers can solve the problem for us, consider the work of Kazushige Goto, whose hand-optimized assembly versions of the Basic Linear Algebra Subprograms (BLAS) libraries for the Intel and AMD x86 architectures regularly ran at least 20% faster than compiled versions. 20% was a big enough performance difference that most supercomputers included Goto's version of BLAS until just a few years ago, when accelerators began to dominate the supercomputing scene.)

Jumping forward to today: The top supercomputers are heterogeneous systems of distributed nodes, each containing multicore processors and accelerators. These accelerators may be either general purpose graphics processing units (GPUs) or co-processors (e.g., Intel’s Xeon Phi). To use the hardware in these supercomputers efficiently, software developers most often use MPI+X, where:

MPI is the message passing interface, a library that is the de facto standard for distributing processes to the nodes and letting those processes communicate with one another.
X is one or more of the following:
- OpenMP, a library for writing multithreaded processes for multicore CPUs, designed to make it relatively easy to add parallelism to legacy code.
- CUDA, a language and library that lets a programmer write and tune code that takes advantage of Nvidia GPUs.
- OpenCL, a library that lets a programmer write and tune code that will run on each of a node’s cores: CPU cores, coprocessor cores, or GPU cores, regardless of vendor.
- OpenACC, a library for writing code similar to OpenMP, but which will take advantage of an accelerator’s cores.

In choosing X, one must make a productivity-performance tradeoff decision. The following overview skips over many details, but it will hopefully provide the reader with a high-level understanding of some of the tradeoffs involved.

CUDA

CUDA lets a software developer write highly efficient code for Nvidia GPUs, which dominate the accelerator market. It is the most mature of the accelerator technologies, and there is a wealth of available documentation and examples on the Internet.

However to maximize software performance on a given Nvidia GPU, one’s CUDA code must be tuned for that device. In particular, the programmer must explicitly organize the GPU’s threads into blocks, organize the blocks into a grid, move data from the computer’s main memory to the GPU’s memory and back, place data on either the GPU’s (larger but slower) global memory or its (smaller but faster) shared memory, and so on. The optimal way to specify all of these depends on the characteristics of a given GPU, which means one’s code may have to be retuned for each new generation of Nvidia device, at least to get maximum performance.

People I talked to at SC14 loved the performance CUDA can deliver on Nvidia GPUs, but did not love the work required to create and tune their programs the first time around, or the work required to retune their programs when those programs were ported to a new machine. They view every minute of that kind of work as lost productivity.

OpenCL

OpenCL lets a software developer write highly efficient code for CPU cores, co-processor cores, or GPU cores, not just Nvidia GPU cores. However to achieve this flexibility, OpenCL programs are far more complex than their CUDA counterparts. They also require similar tuning, in that a programmer must explicitly organize work-items (threads) into work-groups (blocks), organize the work-groups into an index-space (grid), move data from the computer’s main memory to an accelerator’s memory and back, place data on the (large but slow) global memory or the (small but fast) local memory, and so on.

As with CUDA, one’s code may well need to be retuned each time it is ported to a new architecture. In head-to-head comparisons, CUDA is generally faster than OpenCL, at least on the Nvidia devices where the two can be directly compared, making tuning at least as important in OpenCL as it is in CUDA.

As mentioned above, OpenCL programs are more complicated than their CUDA counterparts. This added complexity tends to increase the time-expense of OpenCL software development and maintenance, reducing one's productivity even more.

OpenACC

OpenACC seeks to improve one’s productivity by (i) making it easier to write code that will run on an accelerator, and (ii) making the task of porting one’s code to a new architecture as simple as recompiling that code. For example, to exploit a GPU, one may assign code to workers (threads), organize workers into gangs (blocks), and designate chunks of worker-code that are amenable to SIMD execution as vector chunks. However one may either specify values for each of these, or let the compiler choose values. OpenACC also provides built-in support for some parallel patterns that are commonly needed on the accelerator, such as the reduction pattern.

Researchers at SC14 reported that they had compared the performance of OpenACC and CUDA or OpenCL in both summer 2013 and 2014, and that while OpenACC had a substantial performance penalty in 2013, that penalty had been greatly reduced (but remains significant) in summer 2014. Anecdotally, I heard one researcher say that in 2013, he had written an OpenACC program in 50% of the time it took him to write a CUDA version, but that his CUDA version ran 50% faster than his OpenACC version. In 2014, the OpenACC compiler had improved to where his CUDA version was less than 25% faster than his OpenACC version. Compiler vendors like Cray and Portland Group International have a commercial interest in improving the performance of OpenACC, so the performance gap between OpenACC and the other technologies seems likely to continue to shrink in the future.

OpenMP

OpenMP has traditionally been an easy to use library for implicit multithreading on multicore CPUs. However the OpenMP 4 specification adds directives for running code on target devices (accelerators), on which one may identify work to be done by threads, organize those threads into teams (an abstraction for blocks) and organize teams into a league (an abstraction for grids or other device-specific mappings). OpenMP also provides a simd directive for marking loop-code that is amenable to vectorization. As in OpenACC, the exact mappings of these abstractions to the hardware may be left to the compiler or may be explicitly specified.

As of this writing, compiler writers are still implementing support for the OpenMP 4 specification, so this technology is far less mature than OpenACC, which as we saw above, trails CUDA and OpenCL in performance. (At one time, there was talk of OpenACC and OpenMP merging, but this seems unlikely, according to those at SC14.) When mature, OpenMP has the potential to provide everything one needs to exploit parallelism on CPU and accelerator cores, making it a strong candidate for balancing performance with productivity.

Closing Thoughts

All of these technologies bear watching in the future, as they continue to evolve and mature. While CUDA and OpenCL are a far cry from assembly, the need to be knowledgeable about the hardware details of one’s accelerator in order to achieve optimal (or even good) performance makes these technologies feel much closer to the hardware than a typical high-level language. In terms of being productive. OpenACC seems promising, but it still has work to do to close the performance gap. OpenMP seems even more promising, but has even further to go before it is ready for use.

In the meantime, CS educators face the interesting problem of deciding what technologies to use in the classroom to expose their students to accelerator-based computing. This brings me to these questions for those who have read this far:

Since accelerators are increasingly common and affordable, what software technology (or technologies) are you using to teach students about programming accelerators, and why did you choose it?
Where in your curriculum (i.e., which courses, at what level) are you teaching students about accelerators and their programming?
What technologies should we be using to best prepare our students for the future?

I look forward to hearing your views!