Opinion
Computing Applications Viewpoint

Moving from Petaflops to Petadata

The race to build ever-faster supercomputers is on, with more contenders than ever before. However, the current goals set for this race may not lead to the fastest computation for particular applications.
Posted
  1. Introduction
  2. Rationales
  3. Justification
  4. Suggestions
  5. Conclusion
  6. References
  7. Authors
  8. Footnotes
  9. Figures
Titan supercomputer
Oak Ridge National Laboratory's Titan supercomputer.

The supercomputer community is now facing an interesting situation: Systems exist that, for some sophisticated applications and some relevant performance measures, demonstrate an order of magnitude higher performance11,14,24 compared to the top systems from the TOP500 supercomputers list,2 but are not on that list. Most of the TOP500 machines reach more than 80% of efficiency if they run LINPACK, on the other hand, if these machines run real engineering applications, they reach significantly less (~5%), due to the non-optimal manipulation with matrices, or due to the need for execution of non-numerical operations.

A creator of the TOP500 supercomputers list rightfully claimed the list sheds light on only one dimension of modern supercomputing,6 which is a relatively narrow one. This Viewpoint is intended to induce thinking about alternative performance measures for ranking, possibly ones with a much wider scope.20 This Viewpoint is not offering a solution; it is offering a theme for brainstorming.

To demonstrate this need for such thinking, we will use the example of a particular type of systems, based on a kind of dataflow approach. Namely, we will focus on the solutions developed by Maxeler Technologies.12 Typical applications of such systems include: geomechanical simulations,11 financial stochastic PDEs,24 and seismic modeling in the oil and gas industry.14 There are several other efficient solutions with even more application-specific approaches, for example, the Anton machine for the calculation of interparticle forces in molecular dynamics simulation.19

Our perspective is that the performance metric should become multidimensional—measuring more than just FLOPS, for example, performance per watt, performance per cubic foot, or performance per monetary unit (dollar, yen, yuan, euro, and so forth).

Here, we concentrate on the following issues: rationales (what are the evolutionary achievements that may justify a possible paradigm shift in the ranking domain); justification (what are the numerical measurements that require rethinking); suggestions (what are the possible avenues leading to potential improvements of the ranking paradigm). We conclude by specifying to whom all this might be most beneficial and opening possible directions for future research.

Back to Top

Rationales

The current era of supercomputing is referred to as the petascale era. The next big HPC challenge is to break the exascale barrier. However, due to technological limitations,16,23 there is growing agreement that reaching this goal will require a substantial shift toward hardware/software co-design.3,7,10,18 The driving idea behind the custom dataflow supercomputers (like the Maxeler solution), falls into this category: To implement the computational dataflow in a custom hardware accelerator. In order to achieve maximum performance, the kernel of the application is compiled into a dataflow engine. The resulting array structure can be hundreds to thousands of pipeline stages deep. Ideally, in the static dataflow form, data can enter and exit each stage of the pipeline in every cycle. It cannot be precisely itemized what portion of the improved performance is due to the dataflow concept, and what portion is due to customization; this is because the dataflow concept is used as a vehicle that provides customization in hardware.

For these dataflow systems, utilization of a relatively slow clock is typical, while the entire dataflow is completed more efficiently. This slow clock is not a problem for big data computations, since the speed of computation depends on pin throughput and local memory size/bandwidth inside the computational chip. Even when the dataflow is implemented using FPGA chips, and thus the general-purpose connections in FPGA chips bring a clock slowdown, this does not affect the performance: pin throughput and local memory size/bandwidth are the bottleneck. The sheer magnitude of the dataflow parallelism can be used to overcome the initial speed disadvantage. Therefore, if counting is oriented to performance measures correlated with clock speed, these systems perform poorly. However, if counting is oriented to performance measures sensitive to the amount of data processed, these systems may perform richly. This is the first issue of importance.


The current era of supercomputing is referred to as the petascale era. The next big HPC challenge is to break the exascale barrier.


The second important issue is related to the fact that, due to their lower clock speed, systems based on this kind of a dataflow approach consume less power, less space, and less money compared to systems driven by a fast clock. Weston24 shows the measured speedups (31x and 37x) were achieved while reducing the power consumption of a 1U compute node. Combining power and performance measures is a challenge that is already starting to be addressed by the Green 500 list. However, evaluating radically different models of computation such as dataflow remains yet to be addressed, especially in the context of total cost of ownership.

In addition to the aforementioned issues, the third issue of importance is that systems based on a kind of dataflow approach perform poorly on relatively simple benchmarks, which are typically not rich in the amount and variety of data structures. However, they perform fairly well on relatively sophisticated benchmarks, rich in the amount and variety of data structures.

Back to Top

Justification

Performance of an HPC system depends on the adaption of a computational algorithm to the problem, discretization of the problem, mapping onto data structures and representable numbers, the dataset size, and the suitability of the underlying architecture compared to all other choices in the spectrum of design options. In light of all these choices, how does one evaluate a computer system’s suitability for a particular task such as climate modeling or genetic sequencing?

If we examine the Top500 list (based on LINPACK, a relatively simple benchmark dealing with LU decomposition), the top is dominated by traditional, control-flow based systems. One would expect these systems to offer the highest performance. However, if we turn to a relatively data-intensive workload (for example, order of gigabytes) used in banking environments, we see a system that shows a speedup of over 30 times compared to a traditional control-flow driven system.24 On a highly data-intensive workload (for example, order of terabytes) used by geophysicists, a speedup of 70 times has been demonstrated.11 On an extremely data-intensive workload (for example, order of petabytes) used by petroleum companies the same dataflow system shows an even greater speedup, close to 200 times.14

Of course, these results are obtained by creating a custom dataflow architecture for the specified problems. The question may arise: Could not a LINPACK implementation reveal the same potential? Indeed, custom hardware implementations have been shown to yield speedups. However, these are on a much lesser scale, 2–6 times.17,22 Furthermore, we believe all-out efforts to create LINPACK-targeting (as a benchmark) custom machines, and the informativeness of such results would not be highly useful, especially since even the implications of LINPACK results produced by some more general-purpose systems are already questioned.13 An additional confirmation of our opinion can be found in the fact that the TOP500 does not accept or publish results from systems with LINPACK custom hardware.

Back to Top

Suggestions

Taking all of these issues into account leads to the following two statements and suggestions.

1. The FLOPS count does not, on its own, sufficiently cover all aspects of HPC systems. To an extent it does provide estimates of HPC performance; however, it does not do so equally effectively for different types of systems.

This statement is not novel to the HPC community, as indicated by: Faluk,4 Pancake,15 and Wolter.25 In fact, assessing the productivity of HPC systems has been one of the emphases of the Defense Advanced Research Projects Agency (DARPA) High Productivity Computing Systems (HPCS) program, with the International Journal of High Performance Computing Applications devoting a special issue just to this topic.9 Yet, the FLOPS count seems to persist as the dominate measure of performance of HPC systems.

We are not suggesting that the FLOPS count be eliminated, but rather that a data-centric measure could shed some more light on other aspects of HPC systems.

One idea is to look at result data generation rate per second (petabytes per second), per cubic foot, and per watt, for a particular algorithm and dataset size. The initial justification for expanding the FLOPS count can be found in the following fact: The floating-point arithmetic no longer dominates the execution time, even in computationally intensive workloads, and even on conventional architectures.4 Furthermore, some of the unconventional systems (like the Maxeler systems) have relatively large amounts of on-chip memory and avoid some types of instructions altogether, which further blurs the image obtained from looking at FLOPS. Consequently, for data processing applications, the rate of producing results is the logical measure, regardless of the type and number of operations required to generate that result.

Of course, financial considerations play a major role in computing. However, it is unreasonable to include non-transparent and ever-negotiated pricing information into an engineering measure. We know the cost of computer systems is dictated by the cost of the chips and the cost of the chips is a function of the regularity of the design, the VLSI process, chip area, and most importantly, volume. Encapsulating all these components in a measure remains a challenge.

2. LINPACK, the workload used to create the TOP500 Supercomputer list, is, on its own, not a sufficient predictor of performance.

Again, this point is not novel to the HPC community, as indicated in Anderson,1 Gahvari,5 Geller,6 and in Singh20 by a creator of the TOP500 list. Alternatives and expansions have been suggested, some of the most notable ones being the Graph 500 and the HPC Challenge. Both of them aim to include a wider set of measures that substantially contribute to the performance of HPC systems running real-world HPC applications. When put together, these benchmarks provide a more holistic picture of an HPC system. However, they are still focused only on control flow computing, rather than on a more data-centric view that could scale the relevance of the included measures to large product applications.


As we direct efforts to break the exascale barrier, we must ensure the scale itself is appropriate.


This Viewpoint offers an alternate road to consider. Again, we do not suggest LINPACK, Graph 500, or HPC Challenge to be abandoned altogether, but supplemented with another type of benchmarks: Performance of the systems when used to solve real-life problems, rather than generic benchmarks. Of course, the question is how to choose these problems. One option may be to analyze the TOP500 and/or Graph 500 and/or a list of the most expensive HPC systems, for example, and to extract a number of problems that top-ranking systems have most commonly been used for. Such a ranking would also be of use to HPC customers, as they could look at the list for whatever problem is the most similar to the problem of their own.

Finally, this type of ranking would naturally evolve with both the HPC technology and the demands presented to HPC systems by periodically updating the list. A generic benchmark, on the other hand, must be designed either by looking at a current HPC system and its bottlenecks, or typical demands of current HPC problems, or both. As these two change in time, so must the benchmarks.

Back to Top

Conclusion

The findings in this Viewpoint are pertinent to those supercomputing users who wish to minimize not only the purchase costs, but also the maintenance costs, for a given performance requirement. Also to those manufacturers of supercomputing-oriented systems who are able to deliver more for less, but are using unconventional architectures.21

Topics for future research include the ways to incorporate the price/complexity issues and also the satisfaction/profile issues. The ability issues (availability, reliability, extensibility, partition ability, programmability, portability, and so forth) are also of importance for any future ranking efforts.

Whenever a paradigm shift happens in computer technology, computer architecture, or computer applications, a new approach has to be introduced. The same type of thinking happened at the time when GaAs technology was introduced for high-radiation environments, and had to be compared with silicon technology, for a new set of relevant architectural issues. Solutions that ranked high until that moment suddenly obtained new and relatively low-ranking positions.8

As we direct efforts to break the exascale barrier, we must ensure the scale itself is appropriate. A scale is needed that can offer as much meaning as possible and can translate to real, usable, performance, to the highest possible degree. Such a scale should also feature the same two properties, even when applied to unconventional computational approaches.

Back to Top

Back to Top

Back to Top

Back to Top

Figures

UF1 Figure. Oak Ridge National Laboratory’s Titan supercomputer.

Back to top

    1. Anderson, M. Better benchmarking for supercomputers. IEEE Spectrum 48, 1 (Jan. 2011), 12–14.

    2. Dongarra, J., Meuer, H., and Strohmaier, E. T0P500 supercomputer sites; http://www.netlib.org/benchmark/top500.html.

    3. Dosanjh, S. et al. Achieving exascale computing through hardware/software co-design. In Proceedings of the 18th European MPI Users' Group Conference on Recent Advances in the Message Passing Interface (EuroMPI'11), Springer-Verlag, Berlin, Heidelberg, (2011), 5–7.

    4. Faulk, S. et al. L. Measuring high performance computing productivity. International Journal of High Performance Computing Applications 18 (Winter 2004), 459–473; D01:10.1177/1094342004048539.

    5. Gahvari, H. et al. Benchmarking sparse matrix-vector multiply in five minutes. In Proceedings of the SPEC Benchmark Workshop (Jan. 2007).

    6. Geller, T. Supercomputing's exaflop target. Commun, ACM 54, 8 (Aug. 2011), 16–18; D0I: 10.1145/1978542.1978549.

    7. Gioiosa, R. Towards sustainable exascale computing. In Proceedings of the VLSI System on Chip Conference (VLSI-SoC), 18th IEEE/IFIP (2010), 270–275.

    8. Helbig, W. and Milutinovic, V. The RCA's DCFL E/D MESFET GaAs 32-bit experimental RISC machine. IEEE Transactions on Computers 36, 2 (Feb. 1989), 263–274.

    9. Kepner, J. HPC productivity: An overarching view. International Journal of High Performance Computing Applications 18 (Winter 2004), 393–397; DOI: 10.1177/1094342004048533

    10. Kramer, W. and Skinner, D. An exascale approach to software and hardware design. Int. J. High Perform. Comput. Appl. 23, 4 (Nov. 2009), 389–391.

    11. Lindtjorn, O. et al. Beyond traditional microprocessors for geoscience high-performance computing applications. IEEE Micro 31, 2 (Mar/Apr. 2011).

    12. Maxeler Technologies (Oct. 20, 2011); http://www.maxeler.com/content/frontpage/

    13. Mims, C. Why China's new supercomputer is only technically the world's fastest. Technology Review (Nov. 2010).

    14. Oriato, D. et al. Finite difference modeling beyond 70Hz with FPGA acceleration. In Proceedings of the SEG 2010, HPC Workshop, Denver, (Oct. 2010).

    15. Pancake, C. Those who live by the flop may die by the flop. Keynote Address, 41st International Cray User Group Conference (Minneapolis, MN, May 24–28 1999).

    16. Patt, Y. Future microprocessors: What must we do differently if we are to effectively utilize multi-core and many-core chips? Transactions on Internet Research 5, 1 (Jan. 2009), 5–10.

    17. Ramalho, E. The LINPACK benchmark on a multi-core multi-FPGA system. University of Toronto, 2008.

    18. Shalf, J. et al. Exascale computing technology challenges. VECPAR (2010), 1–25; https://www.nersc.gov/assets/NERSC-Staff-Publications/2010/ShalfVecpar2010.pdf.

    19. Shaw, D.E. et al. Anton, a special-purpose machine for molecular dynamics simulation. Commun. ACM 51, 7 (July 2008), 91–97; DOI: 10.1145/1364782.1364802.

    20. Singh, S. Computing without processors. Commun. ACM 54, 8 (Aug. 2011), 46–54; DOI: 10.1145/1978542.1978558.

    21. Stojanovic, S. et al. A comparative study of selected hybrid and reconfigurable architectures. In Proceedings of the IEEE ICIT Conference, (Kos, Greece, Mar. 2012).

    22. Turkington, K. et al. FPGA-based acceleration of the LINPACK benchmark: A high level code transformation approach. In Proceedings of the IEEE International Conference on Field Programmable Logic and Applications (Madrid, Spain, Aug. 2006), 375–380.

    23. Vardi, M.Y. Is Moore's party over? Commun. ACM 54, 11 (Nov. 2011); DOI: 10.1145/2018396.2018397.

    24. Weston, S. et al. Rapid computation of value and risk for derivatives portfolio. Concurrency and Computation: Practice and Experience, Special Issue (July 2011); DOI: 10.1002/cpe.1778.

    25. Wolter, N. et al. What's working in HPC: Investigating HPC user behavior and productivity. CT Watch Quarterly (Nov. 2006).

    This research was supported by discussions at the Barcelona Supercomputing Centre, during the FP7 EESI Final Project Meeting. The strategic framework for this work was inspired by Robert Madelin and Mario Campolargo of the EC, and was presented in the keynote of the EESI Final Project Meeting. The work of V. Milutinovic and G. Rakocevic was partially supported by the iii44006 grant of the Serbian Ministry of Science.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More