Opinion
Computing Applications Point/counterpoint

On the Model of Computation: Counterpoint: Parallel Programming Wall and Multicore Software Spiral: Denial Hence Crisis

Posted
two figures in discussion, illustration
  1. Parallel Programming Wall and Multicore Software Spiral: Denial Hence Crisis
  2. References
  3. Author
two figures in discussion, illustration

Background. the software spiral (SWS) for single CPU cores and the RAM algorithmic model. Andy Grove (Intel's business leader until 2004) termed "software spiral" the exceptionally resilient business model behind general-purpose CPUs. Application software is the defining component of SWS: Code written once could yet benefit from performance scaling of later CPU generations. SWS is comprised of several abstraction levels. The random access machine, or model (RAM) is most relevant for the current Counterpoint Viewpoint (CPV): each serial step of an algorithm features a basic operation taking unit time ("uniform cost" criterion). The RAM has long been the gold standard for algorithms and data structures. Salient aspects of the RAM included: its simplicity; importing from mathematics its own gold standard: mathematical induction for describing algorithms (or their imperative programming code) and proving their correctness; and good enough support by computer systems based on the von Neumann architecture. Other abstraction levels and means included a variety of benchmark suits guiding balanced performance over a range of tasks, software compatibility, object or binary code compatibility, operating systems, and a variety of standards, for example, Figure 1.8 on functional requirements in Hennesey and Patterson.5 The single core CPU business became synonym with making every effort to advancing architectures and optimizing compilers for keeping this SWS on track, making it, as well as the RAM, so resilient, and clear role models for the road ahead.

There Is More to a Lead Model of Computation Than Specialized Efficiencies. The Point Viewpoint (PV) makes a strong case for optimizations based on quantifiable costs at the hardware level. This CPV concurs with applying the PECM model of computation the PV proposes to specialized routines whose use in workloads merits it, as well as to accelerators. However, the foremost problem of today's multicore parallelism is dearth of programmers, since programming such systems is simply too difficult. Imposing the PECM implied optimizations on programmers is unlikely to bring programmers back, thus qualifying its applicability. I am also not aware of demonstrated success of PECM for general-purpose programming. Using computer architecture lingo, the upshot of the current paragraph is that architects of manycore platforms must recognize not only the so-called "memory wall" and "energy wall," but also a "parallel programming wall."

A Broader Perspective. This CPV demonstrates how to approach the debate on a computation model using a different premise. Learn from the traditional business model of general-purpose single core CPUs, the aforementioned SWS. SWS enabled the remarkable success of such CPUs. These CPUs are arguably the biggest success story of the founders' generation of the whole information technology sector. SWS provided superb sustainability and resilience. It allowed CPU applications to grow from a handful of scientific applications to today's ubiquity, as can be seen in desktop, laptop, server and smartphone apps. However, current exploitation of parallelism for general-purpose application performance falls far behind the exploitation of performance of single core CPUs. It is time to recognize the source of this crisis: after nearly two decades of multicore CPUs domination, SWS stagnated, failing to extend to general-purpose multicore CPUs. CPU vendors have simply gone AWOL on this matter. Lacking cost-effective means to exploit parallelism is at the heart of the problem: most application programmers and their employers simply stay away from even trying to program for parallelism today's multicores, or domain-specific-driven accelerators such as GPUs. This CPV aspires to rectify this failure. Namely: seek to instate a multi-core SWS (MSWS) for general-purpose CPUs.

Cost-Effective Programming Is Critical Even for Performance Programming. A product must accommodate its customers. Multicore CPUs are no different. An MSWS will need to appeal to a wide range of programmers, whose software will then make it appealing to application users at large. For sustainability and resilience, a new MSWS will need to provide: cost-effective programming and as an MSWS takes hold, backward compatibility on prior MSWS application software. These features are necessary to drive investment in new and current applications. The RAM model of computation played a leading role in enabling easy programming for performance for serial computing, so it is only natural to turn to the PRAM model, its parallel counterpart for such programming. In comparing the PRAM to PECM, the primacy of cost-effectiveness appeal to diverse programmers will tilt the scales toward the PRAM. The PECM over-reliance on specialized efficiencies will make current systems even more brittle, exacerbating current predicaments rather than curing them. This CPV will explain how the PRAM coupled with properly designed architecture and optimizing compilers could sustain backward compatibility in a future MSWS. For instance, the type of optimization for locality used for matrix multiplication on GPUs, which is currently associated with unavoidable performance algorithms and programming, would instead be characterized as "compiler algorithms." The point being that with few exceptions application programmers must not be responsible for such optimizations for a manycore to reach broad use. Instead, once developed these optimizations as well as PECM-type optimizations belong in a compiler.

Multicore Software Spiral: Industry Aspiration and Current Crisis. Parallelism has become the dominant source for performance improvement for application developers and innovators, circa 2005, once CPU clock rate acceleration started slowing down. But, what happened to SWS? In a 2008 interview,3 Patrick Gelsinger (then CTO of Intel Corporation and its current CEO) was asked whether the SWS concept still holds, with multicore chips already available but without a lot of SW that can run on more than one processor at a time. He answered that he sees the move to multicore as a new turn in the spiral. Unfortunately, the premise of the question remains as valid today: not much SW exploits parallel processing on multicores. Also, SW written for yesterday's multicore may need to be performance tuned for today's multicore. Thus, the turn in the spiral is yet to happen. This widening MSWS crisis is at the very core of computer science and engineering. It must be widely recognized and addressed. Multicores are on every desk and laptop. Yet, the programming and business barriers for multicore software utilization preempt bringing enough programmers on board.


The foremost problem of today's multicore parallelism is dearth of programmers, since programming such systems is simply too difficult.


The Original Rationale for PRAM. Since the transition to multicore CPUs, the question remains: which computational model should replace RAM as the lead model? The PV proposes the parallel explicit communication model (PECM), though without reference to MSWS. In contrast, this CPV suggests that the PRAM model leads overcoming the MSWS crisis. In the PRAM abstraction, each serial step of an algorithm comprises an unlimited number of basic operations, all are assumed to execute concurrently in unit time. Realizing that serial algorithm knowledge coupled with compilers alone will not be sufficient for effective use of parallel processing, the original motivation for the PRAM at the time was to address that while retaining as much of the simplicity of the RAM as possible. The idea was that programmers express all the parallelism they see, but nothing else must be added to the RAM. The synchronous lockstep abstraction retained also tight reliance on mathematical induction. In contrast, many threaded models of parallelism abandoned the latter, leading7 to his stern warning concerning their utility.

PRAM Has What It Takes to Overcome the MSWS Crisis. Envisioning a manycore CPU on chip within a decade from the late 1990s, a vertically integrated computer framework, called XMT or PRAM-On-Chip,11 was developed and thoroughly prototyped at the University of Maryland (UMD). Prior well-cited papers such as Culler 2 claimed impossibility of ever supporting the PRAM due to technology limitations. XMT aimed to directly refute them by providing effective support to the PRAM model. XMT started with enhancing serial and parallel hardware designs with several low overhead hardware primitives and system software. The main effort was to bridge the considerable tension between: the need for multithreaded operation of a fixed number of hardware contexts, limiting synchrony among them for limiting power and for taking advantages of locality; and the PRAM ideal (or illusion) comprising lock-step unit time access to shared memory, including highly irregular pointer-based access, and (possibly drastically) changing amounts of parallelism from step to step. The XMT solution included hardware supported prefix-sum and dynamic load balancing, high bandwidth low latency on-chip interconnection networks, integrated parallel and serial memories and serial/parallel mode transitions, as well as refined notions of locality- and predictability-of-reference,9,12 dodging the need for system enforced cache coherence, scalable architecture (toward future backward compatibility of the new MSWS) and gradual development of optimizing compiler. XMT incorporated a variety of prefetching techniques as well as loop unrolling, relying on both hardware and software methods to hide latencies and improve locality. Each compiler "generation" required manual tuning of PRAM algorithm followed by having the tuning automated in the next generation of the compiler. Optimizing a measure called LSRTM, for the length of the sequence of round trips to memory, has been a prime target of the design. Hardware (commitment to silicon using FPGA) and software prototyping demonstrated competitive speedups over contemporary off-the-shelf multicores and GPUs. In view of the PV, the power consumption demonstration Keceli et al.6 is specifically noted. The main target for competitive speedups were irregular problems, where memory access is data dependent and cannot be fully predicted at runtime. (This is in contrast to when memory access is data independent, as in standard matrix multiplication, where optimizing a parallel algorithm per the PECM model is more likely to make a difference, in line with the PV.) The XMT multidecade effort culminated in Ghanim et al.,4 which demonstrated supporting PRAM algorithms (and the PRAM illusion, or virtual model) as-is with no additional programming effort. It is also worth noting that recent applications of locality sensitive hashing for performance improvements of some salient deep learning tasks on multicore CPUs over GPUs1 suggest a greater arsenal of potential techniques than pursued by XMT for general-purpose multicore CPUs.

Ease of Programming Superlatives. 1. The PRAM parallel algorithmic gravitas. The PRAM model has also been challenged for around four decades by numerous competing parallel algorithmic models. Still, none of its competitors ended up reaching the magnitude and breadth of its parallel algorithmic knowledge base and techniques, on which the PRAM is second only to the serial RAM model. 2. The XMT/PRAM broadening of access to parallel programming education. First, an example directly benefitting from PRAM support. Nearly a thousand students at a single high school, the Thomas Jefferson High School, Alexandria, VA, programmed XMT over the last 12 years, freeing them to pursue problems beyond "embarrassingly parallel" ones in contrast to the programming of off-the-shelf systems.8 Cost-effective broadening of parallel programming to more people including students from middle school to graduate school as well as for more challenging problem have been also demonstrated. Such broadening of participation can help overcoming the MSWS crisis once industry-grade PRAM-supporting platforms become available.

A Lesson from the Commercial Success of GPUs. Prior to the rise of deep learning, the performance of GPUs for multiplication of full matrices was already ahead of much of the competition. The serendipitous application of matrix multiplication to deep learning (via a stochastic gradient descent primitive for backpropagation of neural nets) changed the fortune of GPUs, leading to market caps of GPU vendors exceeding CPU ones. If our generation of IT professionals will finally deliver the needed MSWS, such parallel platform will immediately become an open invitation to an unlimited range of applications, many are yet to be invented. The outcome could far exceed the impact of deep learning on GPUs, continuing the exponential SWS trajectory that flattened once the MSWS crisis commenced approximately 2005.


A computation model may be less negotiable than a physical technology one. It is time for vendors to come to terms with that and adjust.


Exceptionalism of Computation Models. It is intuitive to regard physical technology modeling as non-negotiable as the PV does, but treat computational model as fully negotiable manmade artifacts. Indeed, over the four decades of my involvement in parallel computing research claims that the RAM and PRAM models are "unrealistic" coupled with "more accurate" alternative models flourished in papers, producing, in my estimate, many more publications than strictly PRAM ones. These alternative models offered publication opportunity to publication hungry academics specifically because their modeling forced complex solutions even for simple problem, supporting the type of "depth" and "originality" claims needed to merit publications. However, these models ran into applicability challenges once significant applications and more complex problems had to be solved. Still, 40 years later, the PRAM remains the primary target for similar attacks. I believe withstanding so many attacks over so many years provides powerful attestation to unique robustness. This may support a perhaps surprising conclusion. Namely, that a computation model may be less negotiable than a physical technology one. After 40 years, the PRAM with (the centuries-old) mathematical induction by its side, remain unchallenged. It is time for architects and vendors to come to terms with that, as the term parallel programming wall introduced earlier in this CPV suggests, and adjust.

MSWS Merits Government Funding. The U.S government is in the process of awarding $52 billion to support U.S.-based semiconductor research and production. A small fraction of such funding level should allow kickstarting MSWS. Even targeting industry-grade hardware and software would be much less expensive. It will also be a better deal. The resulting gold mine of application opportunities could prompt vendors and investors to then also fund semiconductors on their own. Overall, a better outcome for the U.S. economy with lower cost to government.

    1. Chen, B. et al. SLIDE: In defense of smart algorithms over hardware acceleration for large scale deep learning systems. In Proceedings of the Conference on Machine Learning and Systems (MLSys) (2020).

    2. Culler, D. et al. LogP: Towards a Realistic Model of Parallel Computation. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1993, 1–12.

    3. Gelsinger, P. Interview. Network World (2008); https://bit.ly/3AOsy8N

    4. Ghanim, F., Barua, R., and Vishkin, U. Easy PRAM-based high-performance parallel programming with ICE. IEEE Transactions on Parallel and Distributed Systems 29:2, 2018

    5. Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, 2019.

    6. Keceli, F., Moreshet, T. and Vishkin, U. Power-performance comparison of single-task driven many-cores. In Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems (IEEE ICPADS), (Tainan, Taiwan, Dec. 7–9, 2011).

    7. Lee. E.A. The problem with threads. IEEE Computer 39, 5 (May 2006), 33–42.

    8. Torbert, S. Is teaching parallel algorithmic thinking to high-school students possible? One teacher's experience. In Proceedings of the 41st ACM Technical Symposium on Computer Science Education (SIGCSE), (Milwaukee, WI, Mar. 10–13, 2010).

    9. Vishkin, U. Can parallel algorithms enhance serial implementation? Commun. ACM 39, 9 (Sept. 1996), 88–91.

    10. Vishkin, U. Algorithmic approach to designing an easy-to-program system: Can it lead to a HW-enhanced programmer's workflow add-on? In Proceedings of the International Conference on Computer Design (ICCD), Lake Tahoe, CA, October 4–7, 2009.

    11. Vishkin, U. Using simple abstraction to reinvent computing for parallelism. Commun 57, 1 (Jan. 2011), 75–85.

    12. Vishkin, U. Is multicore hardware for general-purpose parallel processing broken? Commun 57, 4 (Apr. 2014), 35–39.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More