Computing Applications Viewpoint

Moving from Petaflops to Petadata

The race to build ever-faster supercomputers is on, with more contenders than ever before. However, the current goals set for this race may not lead to the fastest computation for particular applications.

By Michael J. Flynn, Oskar Mencer, Veljko Milutinovic, Goran Rakocevic, Per Stenstrom, Roman Trobec, and Mateo Valero

Posted May 1 2013

Introduction
Rationales
Justification
Suggestions
Conclusion
References
Authors
Footnotes
Figures

Oak Ridge National Laboratory's Titan supercomputer.

The supercomputer community is now facing an interesting situation: Systems exist that, for some sophisticated applications and some relevant performance measures, demonstrate an order of magnitude higher performance^11,14,24 compared to the top systems from the TOP500 supercomputers list,² but are not on that list. Most of the TOP500 machines reach more than 80% of efficiency if they run LINPACK, on the other hand, if these machines run real engineering applications, they reach significantly less (~5%), due to the non-optimal manipulation with matrices, or due to the need for execution of non-numerical operations.

A creator of the TOP500 supercomputers list rightfully claimed the list sheds light on only one dimension of modern supercomputing,⁶ which is a relatively narrow one. This Viewpoint is intended to induce thinking about alternative performance measures for ranking, possibly ones with a much wider scope.²⁰ This Viewpoint is not offering a solution; it is offering a theme for brainstorming.

To demonstrate this need for such thinking, we will use the example of a particular type of systems, based on a kind of dataflow approach. Namely, we will focus on the solutions developed by Maxeler Technologies.¹² Typical applications of such systems include: geomechanical simulations,¹¹ financial stochastic PDEs,²⁴ and seismic modeling in the oil and gas industry.¹⁴ There are several other efficient solutions with even more application-specific approaches, for example, the Anton machine for the calculation of interparticle forces in molecular dynamics simulation.¹⁹

Our perspective is that the performance metric should become multidimensional—measuring more than just FLOPS, for example, performance per watt, performance per cubic foot, or performance per monetary unit (dollar, yen, yuan, euro, and so forth).

Here, we concentrate on the following issues: rationales (what are the evolutionary achievements that may justify a possible paradigm shift in the ranking domain); justification (what are the numerical measurements that require rethinking); suggestions (what are the possible avenues leading to potential improvements of the ranking paradigm). We conclude by specifying to whom all this might be most beneficial and opening possible directions for future research.

Rationales

The current era of supercomputing is referred to as the petascale era. The next big HPC challenge is to break the exascale barrier. However, due to technological limitations,^16,23 there is growing agreement that reaching this goal will require a substantial shift toward hardware/software co-design.^3,7,10,18 The driving idea behind the custom dataflow supercomputers (like the Maxeler solution), falls into this category: To implement the computational dataflow in a custom hardware accelerator. In order to achieve maximum performance, the kernel of the application is compiled into a dataflow engine. The resulting array structure can be hundreds to thousands of pipeline stages deep. Ideally, in the static dataflow form, data can enter and exit each stage of the pipeline in every cycle. It cannot be precisely itemized what portion of the improved performance is due to the dataflow concept, and what portion is due to customization; this is because the dataflow concept is used as a vehicle that provides customization in hardware.

For these dataflow systems, utilization of a relatively slow clock is typical, while the entire dataflow is completed more efficiently. This slow clock is not a problem for big data computations, since the speed of computation depends on pin throughput and local memory size/bandwidth inside the computational chip. Even when the dataflow is implemented using FPGA chips, and thus the general-purpose connections in FPGA chips bring a clock slowdown, this does not affect the performance: pin throughput and local memory size/bandwidth are the bottleneck. The sheer magnitude of the dataflow parallelism can be used to overcome the initial speed disadvantage. Therefore, if counting is oriented to performance measures correlated with clock speed, these systems perform poorly. However, if counting is oriented to performance measures sensitive to the amount of data processed, these systems may perform richly. This is the first issue of importance.

The current era of supercomputing is referred to as the petascale era. The next big HPC challenge is to break the exascale barrier.

The second important issue is related to the fact that, due to their lower clock speed, systems based on this kind of a dataflow approach consume less power, less space, and less money compared to systems driven by a fast clock. Weston²⁴ shows the measured speedups (31x and 37x) were achieved while reducing the power consumption of a 1U compute node. Combining power and performance measures is a challenge that is already starting to be addressed by the Green 500 list. However, evaluating radically different models of computation such as dataflow remains yet to be addressed, especially in the context of total cost of ownership.

In addition to the aforementioned issues, the third issue of importance is that systems based on a kind of dataflow approach perform poorly on relatively simple benchmarks, which are typically not rich in the amount and variety of data structures. However, they perform fairly well on relatively sophisticated benchmarks, rich in the amount and variety of data structures.

Justification

Performance of an HPC system depends on the adaption of a computational algorithm to the problem, discretization of the problem, mapping onto data structures and representable numbers, the dataset size, and the suitability of the underlying architecture compared to all other choices in the spectrum of design options. In light of all these choices, how does one evaluate a computer system’s suitability for a particular task such as climate modeling or genetic sequencing?

If we examine the Top500 list (based on LINPACK, a relatively simple benchmark dealing with LU decomposition), the top is dominated by traditional, control-flow based systems. One would expect these systems to offer the highest performance. However, if we turn to a relatively data-intensive workload (for example, order of gigabytes) used in banking environments, we see a system that shows a speedup of over 30 times compared to a traditional control-flow driven system.²⁴ On a highly data-intensive workload (for example, order of terabytes) used by geophysicists, a speedup of 70 times has been demonstrated.¹¹ On an extremely data-intensive workload (for example, order of petabytes) used by petroleum companies the same dataflow system shows an even greater speedup, close to 200 times.¹⁴

Of course, these results are obtained by creating a custom dataflow architecture for the specified problems. The question may arise: Could not a LINPACK implementation reveal the same potential? Indeed, custom hardware implementations have been shown to yield speedups. However, these are on a much lesser scale, 26 times.^17,22 Furthermore, we believe all-out efforts to create LINPACK-targeting (as a benchmark) custom machines, and the informativeness of such results would not be highly useful, especially since even the implications of LINPACK results produced by some more general-purpose systems are already questioned.¹³ An additional confirmation of our opinion can be found in the fact that the TOP500 does not accept or publish results from systems with LINPACK custom hardware.

Suggestions

Taking all of these issues into account leads to the following two statements and suggestions.

1. The FLOPS count does not, on its own, sufficiently cover all aspects of HPC systems. To an extent it does provide estimates of HPC performance; however, it does not do so equally effectively for different types of systems.

This statement is not novel to the HPC community, as indicated by: Faluk,⁴ Pancake,¹⁵ and Wolter.²⁵ In fact, assessing the productivity of HPC systems has been one of the emphases of the Defense Advanced Research Projects Agency (DARPA) High Productivity Computing Systems (HPCS) program, with the International Journal of High Performance Computing Applications devoting a special issue just to this topic.⁹ Yet, the FLOPS count seems to persist as the dominate measure of performance of HPC systems.

We are not suggesting that the FLOPS count be eliminated, but rather that a data-centric measure could shed some more light on other aspects of HPC systems.

One idea is to look at result data generation rate per second (petabytes per second), per cubic foot, and per watt, for a particular algorithm and dataset size. The initial justification for expanding the FLOPS count can be found in the following fact: The floating-point arithmetic no longer dominates the execution time, even in computationally intensive workloads, and even on conventional architectures.⁴ Furthermore, some of the unconventional systems (like the Maxeler systems) have relatively large amounts of on-chip memory and avoid some types of instructions altogether, which further blurs the image obtained from looking at FLOPS. Consequently, for data processing applications, the rate of producing results is the logical measure, regardless of the type and number of operations required to generate that result.

Of course, financial considerations play a major role in computing. However, it is unreasonable to include non-transparent and ever-negotiated pricing information into an engineering measure. We know the cost of computer systems is dictated by the cost of the chips and the cost of the chips is a function of the regularity of the design, the VLSI process, chip area, and most importantly, volume. Encapsulating all these components in a measure remains a challenge.

2. LINPACK, the workload used to create the TOP500 Supercomputer list, is, on its own, not a sufficient predictor of performance.

Again, this point is not novel to the HPC community, as indicated in Anderson,¹ Gahvari,⁵ Geller,⁶ and in Singh²⁰ by a creator of the TOP500 list. Alternatives and expansions have been suggested, some of the most notable ones being the Graph 500 and the HPC Challenge. Both of them aim to include a wider set of measures that substantially contribute to the performance of HPC systems running real-world HPC applications. When put together, these benchmarks provide a more holistic picture of an HPC system. However, they are still focused only on control flow computing, rather than on a more data-centric view that could scale the relevance of the included measures to large product applications.

As we direct efforts to break the exascale barrier, we must ensure the scale itself is appropriate.

This Viewpoint offers an alternate road to consider. Again, we do not suggest LINPACK, Graph 500, or HPC Challenge to be abandoned altogether, but supplemented with another type of benchmarks: Performance of the systems when used to solve real-life problems, rather than generic benchmarks. Of course, the question is how to choose these problems. One option may be to analyze the TOP500 and/or Graph 500 and/or a list of the most expensive HPC systems, for example, and to extract a number of problems that top-ranking systems have most commonly been used for. Such a ranking would also be of use to HPC customers, as they could look at the list for whatever problem is the most similar to the problem of their own.

Finally, this type of ranking would naturally evolve with both the HPC technology and the demands presented to HPC systems by periodically updating the list. A generic benchmark, on the other hand, must be designed either by looking at a current HPC system and its bottlenecks, or typical demands of current HPC problems, or both. As these two change in time, so must the benchmarks.

Conclusion

The findings in this Viewpoint are pertinent to those supercomputing users who wish to minimize not only the purchase costs, but also the maintenance costs, for a given performance requirement. Also to those manufacturers of supercomputing-oriented systems who are able to deliver more for less, but are using unconventional architectures.²¹

Topics for future research include the ways to incorporate the price/complexity issues and also the satisfaction/profile issues. The ability issues (availability, reliability, extensibility, partition ability, programmability, portability, and so forth) are also of importance for any future ranking efforts.

Whenever a paradigm shift happens in computer technology, computer architecture, or computer applications, a new approach has to be introduced. The same type of thinking happened at the time when GaAs technology was introduced for high-radiation environments, and had to be compared with silicon technology, for a new set of relevant architectural issues. Solutions that ranked high until that moment suddenly obtained new and relatively low-ranking positions.⁸

As we direct efforts to break the exascale barrier, we must ensure the scale itself is appropriate. A scale is needed that can offer as much meaning as possible and can translate to real, usable, performance, to the highest possible degree. Such a scale should also feature the same two properties, even when applied to unconventional computational approaches.

Figures

Figure. Oak Ridge National Laboratory’s Titan supercomputer.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Moving from Petaflops to Petadata

View in the ACM Digital Library

DOI

10.1145/2447976.2447989

May 2013 Issue

Published: May 1, 2013

Vol. 56 No. 5

Pages: 39-42

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Dec 23 2024

Images Give Robots a Sharper Focus

Samuel Greengard

Architecture and Hardware

BLOG@CACM Dec 20 2024

Strengthening Security Throughout the ML/AI Lifecycle

Alex Vakulov

Artificial Intelligence and Machine Learning

News Dec 18 2024

iBuyers, AI, and Real Estate

Gregory Goth

Architecture and Hardware

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More