Standardized interfaces play an important role in many industries; for example, a processor’s instruction set architecture (ISA) defines the interface between hardware and software. A stable, well-defined interface facilitates independent innovation on both sides of the boundary. Unchanging (or slowly changing) ISAs have meant that new programming languages, compilers, and applications can run on old computers. At the same time, new computers, with improved microarchitectures, can run existing programs faster.
Recently, however, processors have been extended with a wide range of architecturally visible features that change their ISA, for example, support for vector processing, virtual memory, cryptography, and secure execution.
When does it make sense to extend an interface and introduce a new feature that breaks backward compatibility? Clearly, with something as pervasive as an ISA, change is not to be taken lightly. Software that uses a new feature will benefit only from new computers, which initially are rare. This software must also emulate the feature on older machines, so changing the ISA increases the complexity of both hardware and software and raises the cost of testing and error diagnosis.
The following paper proposes a modest hardware extension to support a new parallel execution model for small, non-numeric loops. Under realistic assumptions, HELEX-RC increases the performance of a handful of well-known, non-numeric benchmarks by over a factor of 6. Many programs contain similar loops and might benefit as well. Although the hardware is simple, it requires pervasive changes to the software stack because the feature is tied to a specific execution model. Existing compiled programs will not benefit from HELEX.
The authors, however, present a powerful argument for paying this price. Short, non-numeric loops are very common, but have long been considered poor candidates for parallel execution. Each of their iterations typically runs for a short amount of time (in the benchmarks, 25 clock cycles on the average, 50% requiring 10 or few clock cycles). Moreover, non-numeric loops commonly execute conditional branches, so their execution behavior is data dependent. Neither consideration is fatal by itself, but these loops also pass values among iterations. It is difficult to analyze these loop-carried dependencies and to generate low-overhead code that respects them.
So, in general, these loops run sequentially. In the era when computer performance doubled every two years, sequential execution was not a pressing problem. But, CPU clock speed stopped increasing over a decade ago, and multicore parallelism became the preferred approach to utilize the additional transistors provided by Moore’s Law. But, today, this parallelism is not widely used outside of datacenters. Mobile phones and laptops contain only 2–4 cores.
Much of the software that runs on these platforms is non-numeric and contains loops with short, data-dependent bodies. The authors argue that our inability to parallelize these loops is a fundamental impediment to exploiting parallel hardware. Amdahl’s Law relates the parallel speedup of a program to the fraction of its execution that is sequential. In a program with 50% of its time spent in sequential loops, even with perfect parallelization of the other code, the largest possible speedup is 2. Clearly something must be done with these non-numeric loops.
HELEX-RC is a simple hardware extension that reduces or eliminates communication overhead.
The paper proposes a simple and elegant approach to non-numeric parallelization. The HELEX compiler identifies the sequential portions of a loop, which will run code tied to the loop-carried dependencies. These portions must execute in sequential order, and values produced by one iteration must be sent to subsequent iterations, even if they are running on another processor. Normal shared-memory communication can cost hundreds of cycles and so are too expensive for fine-grain iterations that may only run for a faction of that time. Parallel speedup comes from executing the other portions of a loop—those not involved in the loop-carried dependencies—concurrently.
HELEX-RC is a simple hardware extension that reduces or eliminates communication overhead. It introduces a small amount of buffered memory for loop-carried dependencies and synchronization variables. These values are proactively sent to the processors that will execute subsequent iterations, so these values will be in local memory when an iteration starts, thereby ensuring it is not delayed.
HELEX-RC is not a general-purpose hardware feature like a cache or register file. Its design is tightly tied to the HELEX execution model. Clever compiler writers will no doubt find other uses for it, but until then, it is a specialized solution to one problem. But, small improvements are sometimes the key to unlock the value of much larger architectural features such as multicore parallelism.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment