No, Maybe, Yes, Obviously: Telling the Future the Past

If you are in academia, you know that the semester’s end is drawing nigh, with examination preparation and determination of final grades as certain as the changing seasons. Whether a student or a faculty member, you also know the familiar ritual, where old examinations are made available to help students prepare for the upcoming examinations.

When I was at the University of Illinois, I always shared these examinations with one important proviso — the types of questions do not change, but the answers do. Why, you might ask, did I offer a Zen koan as examination guidance? Though there is enlightenment to be found in meditations on scoreboarding or MESI protocols, my real motivation was to encourage students to think about the interplay of shifting technologies and system optimization.

At any given time, an effective computer system design is determined by a judicious combination of component performance, capacity and cost. These ratios for computing, networking and storage components, determine what is effective at any point. Often these ratios evolve smoothly, but in other cases there are discontinuities due to technology transitions (e.g., from core memory to DRAM). (See Simple HPC Wins — Usually.)

In high-performance computing, we have seen this transition many times, as vector supercomputers were supplanted by symmetric multiprocessors (SMPs) then by commodity clusters. The latter were recently augmented by GPUs. Each of these disruptive discontinuities brings community challenges. Indeed, the Kubler-Ross model of the stages of grief is sometimes apt — denial, anger, bargaining, depression and acceptance, with many analogs in the culture of technology change.

I was reminded of this recently when sorting some old boxes of academic papers. I unearthed the reviews for a hardware proposal that we wrote in 2000 when I was director of NCSA, working with our Alliance Partnership for Advanced Computational Infrastructure (PACI).

Succinctly, in 2000, we proposed to transition from an array of SMPs to commodity clusters based on Linux, with deployment of a system with much higher peak performance than had been proposed before. By that time, commodity processors had become both inexpensive enough and powerful enough, that when coupled with an interconnect such as Myrinet, high-performance computing at a different scale became economically possible. However, this was a major culture shift, with new programming models based on MPI and a different model of community software support.

The proposal was rejected, and one of the reviews summarized why:

The main risk in this proposal is that commodity cluster computing on the scale proposed might not be able to deliver high-quality service to a user community. The actual usability of the machine for scientific breakthrough computing is somewhat open, as it probably depends very highly on message passing programming, and some (maybe many) university research codes are not yet adapted to message passing. I would the risk as substantial that rather than being a highly productive scientific machine, the whole project will get diverted into a giant R&D project for cluster computing.

The reviewer was right; it was a major inflection point in approach, architecture and programming model, and there were real risks. Of course, there were also risks in not embracing a major technology change.

Fortunately, this story has a happy ending. Just a few months later, we successfully deployed two 1 teraflop commodity clusters for national production use, and we and the community never looked back. With the exception of GPU extensions, arguably almost all international high-performance computing systems are based on some variant of this commodity design. Our community moved through a series of perspectives, from no, this cannot work, through doubt to acceptance as the conventional wisdom.

The moral (theorem) is clear; it is better to be right too early than to be wrong too late, but one must also assist in the cultural transition from "no" through "probably not" and "well, maybe" to "yes" and "well, obviously." Failure to do so can have big consequences, technically and socially.

There is a corollary worth considering as well. What is today’s technology inflection point in trans-petascale and exascale computing. I posit that it centers on low power designs derived from mobile devices.