Research and Advances
Architecture and Hardware Research highlights

Technical Perspective: For Better or Worse, Benchmarks Shape a Field

Posted
  1. Article
  2. Author
  3. Footnotes
Read the related Research Paper

Like other IT fields, computer architects initially reported incomparable results. We quickly saw the folly of this approach. We then went through a sequence of performance metrics, with each being an improvement on its predecessor: average instruction time, millions of instructions per second (MIPS), millions of floating point operations per second (MEGAFLOPS), synthetic program performance (DHRYSTONES), and ultimately average performance improvement relative to a reference computer based on a suite of real programs (SPEC CPU).

When a field has good benchmarks, we settle debates and the field makes rapid progress. Indeed, the acceleration in computer performance from 25% to 50% per year starting in the mid-1980s is due in part to our ability to fairly compare competing designs as well as to Moore’s Law. Similarly, computer vision made dramatic advances in the last decade after it embraced benchmarks to evaluate innovations in vision algorithms.a

Sadly, when a field has bad benchmarks, progress can be problematic. For example, despite being discredited in textbooks since 1990,b embedded computing still reports DHRYSTONES when making performance claims. How do we know whether a new embedded processor is a genuine breakthrough, or simply the result of cynical benchmarketering, in that it runs the benchmark quickly but real programs slowly? The answer is we cannot know from DHRYSTONE reports.

In the following paper, the authors point out that while computer architecture has a glorious past, it has become a victim of its own success. The SPEC organization has been selecting old programs written in old languages that reflect the state of programming in the 1980s. Given the 1,000,000X improvement in cost-performance since C++ was unveiled in 1979, most programmers have moved on to more productive languages. Indeed, a recent survey supports that claim: only 25% of programs are being written in languages like C and C++.c Hence, the authors supplement SPEC’s C and C++ programs, which manage storage manually, with Java programs that manage storage automatically. They called the former programs native languages and the latter managed.

The paper reflects a second important trend. The power limit of what a chip could dissipate forced microprocessor manufacturer to switch from a single high-clock rate processor per chip to multiple processors or cores per chips. Thus, the authors include both sequential and parallel benchmarks; they call the former non-scalable and the latter scalable. Moreover, the authors report on power and energy in addition to performance. In this PostPC Era, battery life can trump performance in the client, and the architects of warehouse-scale computers try to optimize the costs of powering and cooling 100,000 servers as well as improving cost-performance. Just as we learned that measuring time in seconds is a safer measure of program performance than a rate like MIPS, we are learning that Joules is a better measure than a rate like Watts, which is just Joules/second. The authors report both Watts and Joules in addition to relative performance.

Given this measurement framework, the authors then measured eight very different Intel microprocessors built over a seven-year period. The authors evaluate these eight microprocessors using 61 programs, which each fit into one of the four quadrants in the matrix here.

This treasure chest of data—recorded in large tables in the ACM Digital Library in addition to this paper—allows the authors (and the rest of us) to ask and answer many questions based on real hardware. This opportunity is a refreshing change from research results based on simulation, which has dominated the literature for the last decade.

Here are four examples of questions we can now address:

  • Do performance, power, or energy of the red quadrant programs predict the results of any of the programs from the other quadrants? (If not, the architects must extend their conventional evaluation methods.)
  • Do the native parallel (yellow) programs predict the managed parallel (green) programs? (If not, given the shift in popularity of programming languages, then many evaluations of multiprocessors should be redone.)
  • Is multicore more energy efficient than multithreading? (If not, then multicore designs that do not offer multithreading may be suspect.)
  • Surely recent in-order execution processors are more energy efficient than modern four-instruction-issue, out-of-order execution processors? (If not, conventional wisdom on energy efficient processor design is incorrect.)

Hint: The answers are: No, No, No, and No.

Readers interested in trying to find the answer to these and computing’s other persistent questions just need to read the following paper.

Back to Top

Back to Top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More