General-purpose processors (GPPs), which traditionally rely on a Von Neumann-based execution model, incur burdensome power overheads, largely due to the need to dynamically extract parallelism and maintain precise state. Further, it is extremely difficult to improve their performance without increasing energy usage. Decades-old explicit-dataflow architectures eliminate many Von Neumann overheads, but have not been successful as stand-alone alternatives because of poor performance on certain workloads, due to insufficient control speculation and communication overheads.
We observe a synergy between out-of-order (OOO) and explicit-dataflow processors, whereby dynamically switching between them according to the behavior of program phases can greatly improve performance and energy efficiency. This work studies the potential of such a paradigm of heterogeneous execution models, by developing a specialization engine for explicit-dataflow (SEED) and integrating it with a standard out-of-order (OOO) core. When integrated with a dual-issue OOO, it becomes both faster (1.33x) and dramatically more energy efficient (1.70x). Integrated with an in-order core, it becomes faster than even a dual-issue OOO, with twice the energy efficiency.
As transistor scaling trends continue to worsen, power limitations make improving the performance and energy efficiency of general purpose processors (GPPs) ever more intractable. The status quo approach of scaling processor structures consumes too much power to be worth the marginal improvements in performance. On top of these challenges, a series of recent microarchitecture level vulnerabilities (Meltdown and Spectre9) exploit the underlying techniques which modern processors already rely on for exploiting instruction-level parallelism (ILP).
Fundamental to these issues is the Von Neumann execution model adopted by modern GPPs. To make the contract between the program and the hardware simple, a Von Neumann machine logically executes instructions in the order specified by the program, and dependences are implicit through the names of storage locations (registers and memory addresses). However, this has the consequence that exploiting ILP effectively requires sophisticated techniques. Specifically, it requires (1) dynamic discovery of register/memory dependences, (2) speculative execution past unresolved control flow instructions, and (3) maintenance of the precise program state at each dynamic instruction should it be need to be recovered (e.g., an exception due to a context switch).
The above techniques are the heart of modern Von Neumann out-of-order (OOO) processors, and each technique requires significant hardware overhead (register renaming, instruction wakeup, reorder-buffer maintenance, speculation recovery, etc.). In addition, the instruction-by-instruction execution incurs considerable energy overheads in pipeline processing (fetch, decode, commit, etc.). As for security, the class of vulnerabilities known as Meltdown and Spectre all make use of speculative execution of one form or another, adding another reason to find an alternative.
Interestingly, there exists a well-known class of architectures that mitigate much of the above called explicit-dataflow (e.g., Tagged Token Dataflow,1 TRIPS,3 WaveScalar20). Figure 1 shows that the defining characteristic of this execution model is how it encodes both control and data dependences explicitly, and the dynamic instructions are ordered by these dependences rather than a total order. Thus, a precise program state is not maintained at every instruction. The benefit is extremely cheap exploitation of instruction-level parallelism in hardware, because no dynamic dependence construction is required.
However, explicit-dataflow architectures show no signs of replacing conventional GPPs for at least three reasons. First, control speculation is limited by the difficultly of implementing efficient dataflow-based squashing. Second, the latency cost of explicit data communication can be prohibitive.2 Third, compilation challenges for general workloads have proven hard to surmount.5 Although a dataflow-based execution model may help many workloads, it can also significantly hamper others.
Unexplored opportunity: What is unexplored so far is the fine-grained interleaving of explicit-dataflow with Von Neumann execution—that is, the theoretical and practical limits of being able to switch with low cost between an explicit-dataflow hardware/ISA and a Von Neumann ISA. Figure 2(a) shows a logical view of such a heterogeneous architecture, and Figure 2(b) shows the capability of this architecture to exploit fine-grain (thousands to millions of instructions) application phases. This is interesting now, as trends mean that on-chip power is more limited than area; this creates “dark-silicon,” portions of the chip that cannot be kept active due to power constraints. The two major implications are that energy efficiency is the key to improving scalable performance, and that it becomes rationale to add specialized hardware which is only in-use when profitable.
With such a hardware organization, many open questions arise: Are the benefits of fine-grained interleaving of execution models significant enough? How might one build a practical and small footprint dataflow engine capable of serving as an offload engine? Which types of GPP cores can get substantial benefits? Why are certain program region-types suitable for explicit-dataflow execution?
To answer these questions we make the following contributions. Most importantly, we identify (and quantify) the potential of switching between OOO and explicit-dataflow at a fine grain. Next, we develop a specialization engine for explicit-dataflow (SEED) by combining known dataflow-architecture techniques, and specializing the design for program characteristics where explicit-dataflow excels as well as simplifying and common program structures (loops/nested loops). We evaluate the benefits through a design-space exploration, integrating SEED into little (in-order), medium (OOO2), and big (OOO4) cores. Our results demonstrate large energy benefits over >1.5×, and speedups of 1.67×, 1.33×, and 1.14× across little, medium, and big cores. Finally, our analysis illuminates the relationship between workload properties and dataflow profitability: code with high memory parallelism, instruction parallelism, and control noncriticality is highly profitable for dataflow execution. These are common properties for many emerging workloads in machine learning and data processing.
2. Understanding Von Neumann/Dataflow Synergy
Understanding the trade-offs between a Von Neumann machine, which reorders instructions implicitly, and a dataflow machine, which executes instructions in dependence order, can be subtle. Yet, the trade-offs have profound implications. We attempt to distill the intuition and quantitative potential of a heterogeneous core as follows.
The intuitive trade-off between the two execution models is that explicit-dataflow is more easily specializable for high issue width and instruction window size (due to lack of need to discover dependences dynamically), whereas an implicit-dataflow architecture is more easily specializable for speculation (due to its maintenance of precise state of all dynamic instructions in total program order).
The performance implications can be seen in an example in Figure 3(a), which has a single control decision labeled as . In (b), we show the program instruction order for one iteration of this code, assuming the left branch was taken. Figure 3(c) shows the ideal schedule of these instructions on an ideal machine (one instruction per cycle). The key to the ideal execution is both the reordering of dependent instructions ( , ) before the control decision is resolved, as well as being able to execute many instructions in parallel.
A Von Neumann OOO machine has the advantage of speculative execution, but the disadvantage is the complexity of implementing hardware for issuing multiple instructions per cycle (issue width) when the dependences are determined dynamically. Therefore, (d) shows how a dual-issue OOO takes five cycles because there was not enough issue bandwidth for both and before the third cycle.
A dataflow processor can easily be designed for high issue width due to dependences being explicitly encoded into the program representation. However, we assume here that the dataflow processor does not perform speculation, because of the difficulty of recovering when a precise order is not maintained. Therefore, in Figure 3(e), the dataflow processor’s schedule, and ; must execute after the .
Although the example suggests the benefits of control specialization and wide issue widths are similar, in practice, the differences can be stark, which we can demonstrate with slight modifications to the example. If we add several instructions to the critical path of the control decision (between and ), the OOO core can hide these through control speculation. If instead we add more parallel instructions, the explicit-dataflow processor can execute these in parallel, whereas these may be serialized in the OOO Von Neumann machine. Explicit-dataflow can also be beneficial if the is unpredictable, and the OOO is anyway serialized.
A natural next question is how much potential benefit could a heterogeneous Von Neumann/dataflow core provide. The potential benefits of an ideal hybrid architecture (ideal dataflow + four-wide OOO) relative to a standard OOO core are as shown in Figure 4(a), which where each speedup bar is labeled with the percentage of execution time that dataflow execution is profitable. Figure 4(b) shows the overall energy and performance trends for three different GPPs.
These results indicate that dataflow specialization has significant potential, up to 1.5× performance for an OOO4 GPP (2× for OOO2), as well as over 2× average energy-efficiency improvement. Furthermore, the preference for explicit-dataflow is frequent, covering around 65% of execution time, but also intermittent and application-phase dependent. The percentage of execution time in dataflow mode varies greatly, often between 20% and 80%, suggesting that phase types can exist at a fine grain inside an application.
Overall, this suggests that a heterogeneous Von Neumann/explicit-dataflow architecture with fine-granularity switching can provide significant performance improvements along with power reduction, and thus lower energy.
Remaining challenge: Although many high-performance explicit-dataflow architectures have been proposed over the last several decades, the remaining challenge is how to achieve the same benefits while avoiding a more heavyweight general-purpose explicit-dataflow engine (for example, WaveScalar20 or TRIPS,3 see Figure 5). The approach we will take is to combine known dataflow mechanisms, while simplifying and specializing for the common workload characteristics where dataflow excels.
3. Seed: An Architecture For Fine-Grain Dataflow Specialization
Based on our previous analysis and insights, there are three primary requirements for a dataflow specialization engine: (1) low area and power, so integration with the GPP is feasible; (2) enough generality to target a wide variety of workloads; and (3) achieving the benefits of dataflow execution with few overheads.
The dataflow processor is only constrained by the program’s control and data-dependencies, but retains the same memory system. Note that no nonlocal program modifications are performed (no loop reordering/tiling/layout-transforms/etc.). It is also nonspeculative and incurs latency when transferring values between control regions. For energy, only functional units and caches are considered.
First, we propose that requirement 1, low area and power, can be addressed by focusing on a common, yet simplifying case: fully-inlined loops and nested loops with a limited total static instruction count. This helps limit the size of the dataflow tags and eliminates the need for an instruction cache; both of which reduce hardware complexity. In addition, ignoring recursive regions and only allowing in-flight instructions from a single context eliminates the need for tag matching hardware. Targeting nested-loops also satisfies requirement 2: these regions can cover a majority of real applications’ dynamic instructions.
For low-overhead dataflow execution, requirement 3, communication must be lowered while maintaining parallelism. For this, we first use a distributed-issue architecture, which enables high-instruction throughput with low-ported RAM structures. Second, we use a multibus network for sustaining instruction communication throughput at low latency. Third, we use compound instructions to reduce communication overhead. The proposed design is SEED: specialization engine for explicit-dataflow, shown at a high level in Figure 6, and explained next.
Adaptive execution: To adaptively apply explicit-dataflow specialization, we use a technique similar to bigLITTLE, except that we restrict the entry points of specializable regions to fully-inlined loops or nested loops. This simplifies integration with a different ISA. Targeting longer nested-loop regions reduces the cost of configuration and GPP core synchronization.
GPP integration: SEED uses the same cache hierarchy as the GPP, which facilitates fast switching (no data-copying through memory) and preserves cache coherence. SEED adds architectural state, which must be maintained at context switches. Lastly, functional units (FUs) could be shared with the GPP to save area (by adding bypass paths); this work considers stand-alone FUs.
SEED’s execution model closely resembles prior dataflow architectures, but is restricted for loops/nested-loops, and adds the use of compound instructions.
We use a running example to aid explanation: a simple linked-list traversal where a conditional computation is performed at each node. Figure 7(a) shows the original program, (b) the Von Neumann control flow graph (CFG) representation, and (c) SEED’s explicit-dataflow representation.
Data-dependence: Similar to other dataflow representations, SEED programs follow the dataflow firing rule: instructions execute when their operands are ready. To initiate computation, live-in values are sent from the host. During dataflow execution, each instruction forwards its outputs to dependent instructions, either in the same iteration (solid line in Figure 7(c) ), or in a subsequent iteration (dotted line). For example, the
a_next value loaded from memory is passed on to the next iteration for address computation.
Control-flow strategy: Control dependencies between instructions are converted into data dependencies. SEED uses a switch instruction, which forwards values to one of two possible destinations depending on the input control signal. In the example, depending on the
v2 is forwarded to either the
else branch. This strategy enables control-equivalent regions to execute in parallel.
Enforcing memory-ordering: SEED uses a primarily software approach to enforce memory-ordering. When the compiler identifies dependent (or aliasing) instructions, the program serializes these through explicit tokens. In this example, the stores of
n_val can conflict with the load from the next iteration (e.g., when the linked list contains a loop), and therefore, memory dependence edges are added.
Executing compound instructions: To mitigate communication overheads, the compiler groups primitive instructions (e.g., adds, shifts, switches, etc.) into subgraphs and executes them on compound functional units (CFUs). These are logically executed atomically. The example program contains four subgraphs, mapped to two CFUs.
SEED achieves high instruction parallelism and simplicity by using eight distributed computation units. Each of these SEED units is organized around one CFU, and units communicate together over a network, as shown in Figure 6.
Compound functional unit (CFU): CFUs are composed of a fixed network of primitive FUs (adders, multipliers, logical units, switch units, etc.), where unused portions of the CFU are bypassed when not in use. Long latency instructions (e.g., loads) can be buffered and passed by subsequent instructions. Our design uses the CFU mix from existing work,7 where CFUs contain 2–5 operations. CFUs which have memory units will issue load and store requests to the host’s memory management unit. Load requests access a store buffer for store-to-load forwarding.
Instruction management unit (IMU): The IMU has three responsibilities. First, it stores up to 32 compound instructions, each with a maximum of four operands each for up to four dynamic loop iterations (equivalent to a 1024-entry instruction window). Second, it selects instructions with ready operands for execution on the CFU, giving priority to the oldest instruction. Third, the IMU routes incoming values from the network to appropriate storage locations based on the incoming instruction tag.
Communication: The ODU is responsible for distributing the output values and destination packets (SEED unit + instruction location + iteration offset), to the bus network, and buffering them during bus conflicts. A bus interconnect forwards output packets from the ODU to SEED unit IMU’s which use the corresponding operands. Therefore, dependent instructions communicating over the bus cannot execute in back-to-back cycles, a limitation of distributed dataflow.
4. Seed Compiler Design
The two main responsibilities of the compiler are determining which regions to specialize and scheduling instructions into CFUs inside SEED regions.
Region selection: The compiler must find or create fully-inlined nested-loop regions, which are small enough to match SEED’s operand/instruction storage. Also, the inner loop should be unrolled for instruction parallelism. An Amdahl-tree based approach can be used to select regions.16 Also, we should avoid regions where the OOO core (through control speculation) or the SIMD units would have performed better. One approach is to use simple heuristics, for example, avoid control-critical regions. A dynamic approach can be more flexible; for example, training online predictors to give a runtime performance estimate based on per-region statistics. Related work explores this in detail,16, 18 and this work simply uses a static oracle scheduler (see Section 5).
Instruction scheduling: The instruction scheduler forms compound instructions and assigns them to units. Its job is to balance communication cost by creating large compound instructions, while also ensuring that combining instructions does not artificially increase the critical path length. To solve this, we use integer linear programming, specifically extending a general scheduling framework for spatial architectures17 with the ability to model instruction bundling.
5. Evaluation Methodology
For evaluating SEED, OOO core specialization techniques, and the other designs we compare to, we employ a TDG-based modeling methodology.15 We use Mc-PAT11 with 22nm technology to estimate power and area. Von Neumann core configurations are given in Table 1.
The benchmarks we chose were from SPECint and Media-bench,10 representing a variety of control and memory irregularity, as well as some regular benchmarks. To eliminate compiler/runtime heuristics on when to use which architecture, we use an oracle scheduler, which uses previous runs to decide when to use the OOO core, SEED, or SIMD.
6. Evaluating Dataflow Specialization Potential
To understand the potentials and trade-offs of dataflow specialization, we explore the prevalence of required program structure, per-region performance, and overall heterogeneous core benefits.
Nested loop prevalence: Figure 8 shows cumulative distributions of dynamic instruction coverage with varying dynamic region granularity, assuming maximum 1024 instructions. Considering regions with a duration of 8K dynamic instructions or longer (x-axis), nested loops can cover 60% of total instructions, whereas inner loops cover only 20%. Nested loops also greatly increase the region duration for a given percentage coverage (1K–64K for 40% coverage).
Compound instruction prevalence: Figure 9 is a histogram of per-benchmark compound instruction sizes which the compiler created, showing on average 2–3 instructions. This is relatively high considering that compound instructions cannot cross control regions. Some singletons are necessary; however, either because control regions lack dependent computation, or because combining certain instructions would create additional critical-path dependencies.
First, we compare the speedups of SEED to our most aggressive design (OOO4) on a per-region basis. Figure 10 shows SEED’s speedup for the recurring nested-loop program regions (each >1% total insts), where the region with the highest contribution to the execution time of the original program is shown in red. Overall, speedup varies dramatically due to the significant differences in program characteristics. Around 3x–5x speedup is possible, and many regions show significant speedup. We next examine the reasons for performance differences of the highest-contribution regions in several categories, as follows.
Performance and energy benefit regions: Compared to the OOO4-wide core, SEED can provide high speedups by exploiting ILP in compute-intensive regions (e.g.,
djpeg) and from using an effectively larger instruction window for achieving higher memory parallelism (e.g.,
429.mcf). The latter have high cache miss rates and clog the instruction window of the OOO processor. Across these regions, indirect memory access is common, which precludes SIMD vectorization.
Energy benefit-only regions: These regions have similar performance to the OOO4, but are more energy efficient by 2x–3x. Here, ILP tends to be lower, but control is mostly off the critical path, allowing dataflow to compete (e.g.,
164.gzip actually have high potential ILP, they are burdened by communication between SEED units. Contrastingly,
jpg2000enc have significant control, but still perform close to the OOO core. These benchmarks make up for the lack of speculation by avoiding branch mispredictions and relying on the dataflow-based control.
Performance loss regions: The most common reason for performance loss is communication latency on the critical path (e.g.,
403.gcc, mpeg2dec, and
mpeg2enc), as well as predictable data-dependent control (e.g.,
401.bzip2). These are fundamental dataflow limitations. In two cases, configuration overhead was burdensome (
197.parser). Finally, some of these regions are vectorized on the GPP, and SEED is not optimized to exploit data-parallelism. In practice, the above regions would not be executed on SEED.
In summary, speedups come from exploiting higher memory parallelism and instruction parallelism, and avoiding mispeculation on unpredictable branches. Slowdowns come from the extra latency cost on more serialized computations.
Finally, we consider the overall performance when integrated with a little, medium, and big core, and compare explicit-dataflow specialization with existing techniques for targeting irregular codes. In-place loop execution, similar to Revolver,8 locks looping instructions into the instruction window to prevent redundant pipeline overheads such as fetch/decode/execute, but does not otherwise change the OOO execution model. Conservation cores21 use software-defined region-specific accelerators, but they do not exploit dataflow-based control (only in-order control and memory). Figure 11 shows the relative performance and energy benefits, normalized to the in-order core alone.
SEED improves performance and energy efficiency across GPP cores types, significantly more than existing accelerator and microarchitectural approaches. For the little, medium, and big cores, SEED provides 1.65×, 1.33×, and 1.14× speedup, and 1.64×, 1.7×, and 1.53× energy efficiency, respectively. The energy benefits come primarily from the prevalence of regions where dataflow execution can match the host core’s performance; this occurs 71%, 64%, and 42% of the time, for the little, medium, and big Von Neumann cores, respectively.
Understanding disruptive trade-offs: Perhaps more interesting is the disruptive changes that explicit-dataflow specialization introduces for computer architects. First, the OOO2+SEED is actually reasonably close in performance to an OOO4 processor on average, within 15%, while reducing energy 2.3x. Additionally, our estimates suggest that an OOO2+SEED occupies less area than an OOO4 GPP core. Therefore, a hybrid dataflow system introduces an interesting path toward a high-performance, low-energy microprocessor: start with an easier-to-engineer modest OOO core, and add a simple, nongeneral-purpose dataflow engine.
An equally interesting trade-off is to add a hybrid dataflow unit to a larger OOO core—SEED+OOO4 has much higher energy efficiency (1.54x) with additional performance improvements of 1.14x. This is a significant leap for energy-efficiency, especially considering the difficulty of improving the efficiency for complex, irregular workloads such as SpecINT.
Overall, all cores can achieve significant energy benefits, little and medium cores can achieve significant speedup, and big cores receive modest performance improvement.
Dataflow specialization is a broadly applicable principle for both general-purpose processors and accelerators. We outline our view on the potentially disruptive implications in these areas as well as potential future directions.
In this work, we showed how a dataflow processor can more efficiently take over and execute certain phases of application workloads, based on their properties. This can be viewed visually, as shown in Figure 12, where we show architecture affinity for programs along dimensions of control and memory regularity. Figure 12(a) shows how prior programmable specialization techniques only focus on a narrow range of workloads—for example, SIMD can speedup highly regular program phases only.
Figure 12. Program phase affinity by application characteristics. Memory ranges from regular and data-independent, to irregular and data-dependent but with parallelism, to latency bound with no parallelism. Control can range from noncritical or not present, critical but repeating, not repeating but predictable, to unpredictable and data-dependent.
Figure 12(b) shows how dataflow specialization further cuts into the space of programs that traditional architectures are best at. Specifically, when the OOO processor’s issue width and instruction window size limits the achievable ILP (region ), explicit-dataflow processors can exploit this through distributed dataflow, as well as more efficient execution under control unpredictability (region ). Beyond these region types, dataflow specialization can be applied to create engines that target other behaviors, such as repeatable control , or to further improve highly regular regions by combining dataflow with vector-communication .
Future directions: The disruptive potential of exploiting common program phase behavior using a heterogeneous dataflow execution model can have significant implications leading to several important directions:
- Reduced importance of aggressive out-of-order: Dataflow engines which can exploit high ILP phases can reduce the need for aggressive and power-inefficient out-of-order cores. As a corollary, the design of modest-complexity loosely coupled cores should in principle be less design effort than a complex OOO core. This could lower the cost-of-entry into the general-purpose core market, increasing competition and spurring innovation.
- Radical departure from status quo: The simple and modular integration of engines targeting different behaviors, combined with microarchitecture-level dynamic compilation for dataflow ISAs22 can enable such designs to be practical. This opens the potential of exploring designs with radically different microarchitectures and software interfaces, ultimately opening a larger and more exciting design space.
- An alternative secure processor: An open question is how to build future secure processors that are immune to attacks such as Meltdown and Spectre.9 One approach is to simply avoid speculation; this work shows that an in-order core plus SEED may only lose on average around 20% performance with respect to an OOO core alone, at much lower energy.
In contrast to general-purpose processors, accelerators are purpose-built chips integrated at a coarse grain with computing systems, for workloads important-enough to the market to justify their design and manufacturing cost. A persistent challenge facing accelerator design is that in order to achieve desired performance and energy efficiency, accelerators often sacrifice generality and programmability, using application or domain-specific software interfaces. Their architecture and microarchitecture is narrowly tailored to the particular domain and problem being solved.
The principle of heterogeneous Von Neumann/dataflow architectures can help to create a highly efficient accelerator without having to give up on domain-generality. Inspired by the insights here, we demonstrated that domain-specific accelerators rely on fundamentally common specialization principles: specialization of computation, communication, concurrency, data-reuse, and coordination.14 A dataflow model of computation is especially suitable for exploiting the first three principles for massive parallel computation, whereas a Von Neumann model excels at the coordination of control decisions and ordering. We further addressed programmable specialization by proposing a Von Neumann/dataflow architecture called stream-dataflow,13 which specifies memory access and communication as streams, enabling effective specialization of data-reuse in caches and scratchpad memories.
Future directions: The promise of dataflow specialization in the accelerator context is to enable freedom from application-specific hardware development, leading to two important future directions.
- An accelerator architecture: The high energy and area-efficiency of a Von Neumann/dataflow accelerator, coupled with a well-defined hardware/software interface, enables the almost paradoxical concept of an accelerator architecture. We envision that a dataflow-specialized ISA such as stream-dataflow, along with essential hardware specialization principles, can serve as the basis for future innovation for specialization architectures. Its high efficiency makes it an excellent baseline comparison design for new accelerators, and the ease of modifying its hardware/software interface can enable integration of novel forms of computation and memory specialization for challenging workload domains.
- Compilation: How a given program leverages Von Neumann and dataflow mechanisms can have tremendous influence on attainable efficiency, and some methodology is required to navigate this design space. The fundamental compiler problem remains extracting and expressing parallelism and locality. The execution model and application domains make these problems easier to address. Applications for which accelerators are amenable are generally well-behaved (keeping to a minimum or avoiding pointers, etc.). The execution model and architecture provides interfaces to cleanly expose the application’s parallelism and locality to the hardware. This opens up exciting opportunities in compiler and programming languages research to target accelerators.
This article observed a synergy between Von Neumann and dataflow processors due to variance in program behaviors at a fine grain and used this insight to build a practical processor, SEED. It enables potentially disruptive performance and energy efficiency trade-offs for general-purpose processors, pushing the boundary of what is possible given only a modestly complex core. This approach of specializing for program behaviors using heterogeneous dataflow architectures could open a new design space, ultimately reducing the importance of aggressive OOO designs and lead to greater opportunity for radical architecture innovation.