I took my first parallel programming class in graduate school, approximately 20 years ago. Having attended a fairly well-funded and well-equipped graduate program, my fellow students and I had at our disposal a shared-memory multiprocessor system we could use for experiments. Moreover, we had access to a Cray supercomputer at a nearby supercomputing center. The tools available at the time for parallel programming amounted to auto-vectorizing compilers, some vendor-specific performance libraries, and non-standard threading application programming interfaces. Much of the hardware we used was relatively unstable and the tools buggy. It was a daunting task, though we were merely asked to develop parallel versions of simple algorithms, like sorts. The zeitgeist was one of experimentation with immature programming models on cold-war-fueled supercomputer innovation.
How things have changed. While some of the functional components we experimented with were quite large, they really did not approach the vast complexities faced by today's developers. With the exception of the rarified air of research labs (academic or otherwise), sequential microprocessors comprised the vast majority of computers that were used. However, in the two decades since, mainstream microprocessors have abruptly evolved to become increasingly (overtly) parallel devices.
Calling modern applications complex does not quite do justice to the state of affairs. It may surprise some readers to learn that hardware vendors often have insight into the architectures of a broad range of applications. For Intel, this spans a fairly large variety of market segments. The reason, of course, is that hardware platform companies have a vested interest in ensuring software runs well (running best is the goal) on their products.
Applications that span hundreds of thousands or millions of lines of code are the norm. The use of externally sourced libraries is fairly commonplace. Many application frameworks are implemented at such a level of abstraction that the effective control paths and data dependencies are, for all intents and purposes, opaque to compilers used to build the applications (an interesting analysis is presented in Mitchell et al.3). Applications often simultaneously deploy what is viewed as distinct functional idioms in application development. Manipulation of dynamic and persistent databases, functionality implemented via remote servers or peers, event-based programming, graphical processing, other computeintensive processing (of many flavors), and more occur in many modern client applications (that is, applications you might be running on your laptop).
Productivity is one of the primary drivers of software development technology, usually even more of a factor than performance. For many market segments, time to market or deployment is the biggest influencer of tool use. Productivity is also another way to quantify development costs. It is because of this emphasis on productivity that we are not at a plateau; instead, the trend toward increasing software abstraction and heterogeneity to manage complexity is accelerating.
The trend toward increasing software abstraction and heterogeneity to manage complexity is accelerating.
And that is the challenge that hardware and software developers face in the age of mainstream highly parallel processing. Before diving into that, I will take a diversion into the trends driving increasing (software-exposed) parallelism in hardware.
The oft-cited Moore's Law has been remarkably accurate at predicting the macro trends in transistor scaling on silicon manufacturing processes for the past four decades. In a nutshell, transistor densities double every two years, indirectly to a doubling of performance for the same power budget. For the most part, this performance improvement was manifested for single-threaded applications running on microprocessors. In this era, new features in microprocessors were evaluated by how effectively they used area in return for performance.
However, in the last five years or so, semiconductor manufacturers ran into a power wall, which was essentially a slowing of the power-scaling trend. In another nutshell, whereas power density per area of silicon was roughly flat from generation to generation, power was increasing somewhat. There are several physical reasons for this (some discussed in Borkar1), but one manifestation was a slowing of voltage scaling trends (factoring in frequency, power is effectively cubic in voltage). So, in this era, new performance-oriented features in microprocessors are largely evaluated by how they use the power budget for performance.
Examples of power-efficient performance features are simpler cores that increase IPC (instructions per cycle), utilize less power-intensive techniques for latency hiding like simultaneous multithreading (rather than deeper, more speculative execution), wider execution payload per instruction (as in vector instructions, like Intel's Streaming SIMD Extensions or SSE), and more cores per die. (To be sure, microprocessor vendors are adding features that enable many other important features, like improved manageability, virtualization, and so on.) The basic impact of these performance features is to use software-exposed parallelism to drive much of the performance improvement. For example, not utilizing SSE instruction and multiple cores in a quad-core microprocessor leaves over 90% of the peak floating point performance on the floor. Simply stated: the trend to software-exposed parallelism is also accelerating (at the exponential pace of Moore's Law).
For the last three decades, software development has largely evolved to improve productivity while hardware development has largely evolved to transparently deliver sustained performance improvements to this software. This has resulted in a divergence that must be reconciled. Many of the productivity-driven software development trends are either at odds with hardware performance trends or are outpacing the abilities of tools and various frameworks to adapt. If this seems like a very hardware-centric way to look at things, I will restate this from a software developer's perspective: microprocessor architecture is evolving in a direction that existing software components will be hard-pressed to leverage.
This may seem like an overly bleak outlook; it is intended to be a reality check. It is almost certain that software developers will adapt to parallelism incrementally. It is also a certainty that the physics of semiconductor manufacturing is unlikely to change in the coming years.
I have been on both sides of the discussion between software and hardware vendors. Software vendors demand improved performance for their applications through hardware and tool enhancements (aka "the free lunch"). Hardware vendors ask software vendors to make somewhat risky (from a productivity and adoption point of view) efforts to tap new performance opportunities.
Most of the time, a middle path between these perspectives is taken wherein software vendors incrementally adopt performance features while hardware vendors enhance tools and provide consultative engineering support to ensure this happens. The end result is/will be that the path to a complete refactoring of their applications will take a longer, more gradual road. This middle road may be all you can hope for applications that do not evolve significantly in terms of either usage modes and data intensiveness.
Parallel programming in the mainstream is relatively new, while many of the tools and accumulated knowledge were informed by niche uses.
However, for many applications there should be a long list of features that are enabled or enhanced by the parallelism in hardware. For those developers, there is a better, though still risky, option: Embrace parallelism now and architect your software (and the components used therein) to anticipate very high degrees of parallelism.
The software architect and engineering manager need only look at published hardware roadmaps and extrapolate forward a few years to justify this. Examining Intel's public roadmap alone, we get a sense of what is happening very concretely. The width of vector instructions is going from 128 bits (SSE), to 256 bits (AVX2), to 512 bits (Larrabee4) within a two-year period. The richness of the vector instructions is increasing significantly, as well. The core counts have gone from one to two to four to six within a couple of years. And, two- and four-way simultaneous multithreading is back. Considering a quad-core processor that has been shipping for over a year, the combined software-exposed parallelism (in the form of multiple cores and SSE instructions) is 16-way (in terms of single-precision floatingpoint operations) in a single-socket system (for example, desktop personal computers). Leaving this performance "on the floor" diminishes a key ingredient in the cycle of hardware and software improvements and upgrades that has so greatly benefitted the computing ecosystem.
There is another rationalization I have observed in certain market segments. In some application spaces, performance translates more directly to perceived end-user experience today. Gaming is a good example of this. Increasing model complexity, graphics resolution, better game-play (influenced by game AI), and improved physical simulation are often enabled directly by increasing performance headroom in the platform. In application domains like this, the first-movers often enjoy a significant advantage over their competitors (thus rationalizing the risk).
I do not mean to absolve hardware and tool vendors from responsibility. But, as I previously mentioned, hardware vendors tend to understand the requirements from the examples that software developers provide. Tool vendors are actively working to provide parallel programming tools. However, this work is somewhat hampered by history. Parallel programming in the mainstream is relatively new, while many of the tools and accumulated knowledge were informed by niche uses. In fact, the usage models (for example, scientific computing) that drive parallel computing are not all that different from the programs I was looking at 20 years ago in that parallel programming class. Re-architecting software now for scalability onto (what appears to be) a highly parallel processor roadmap for the foreseeable future will accelerate the assistance that hardware and tool vendors can provide.
2. Intel. Intel AVX (April 2, 2008); http://softwareprojects.intel.com/avx/.
3. Mitchell, N., Sevitsky, G., and Srinivasan, H. The Diary of a Datum: Modeling Runtime Complexity in Framework-Based Applications. (2007); http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html/$FILE/diary%20talk.pdf.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2009 ACM, Inc.