Research and Advances
Architecture and Hardware Research highlights

Technical Perspective: A Methodology For Evaluating Computer System Performance

  1. Article
  2. Author
  3. Footnotes
Read the related Research Paper

Computer science has long had a solid foundation for evaluating the performance of algorithms. The asymptotic complexity of the time required by an algorithm is well defined and usually tractable, allowing for a clear evaluation of whether one algorithm provides a fundamental improvement over another. More nuanced and alternative evaluations, such as amortized and randomized analysis, provide additional insights into the fundamental advantages of different algorithms.

Unfortunately, the situation is even grimmer when evaluating the performance of a computer system, whether that system is a computer architecture, a compiler, a graphics processor, or a runtime system. Given a specific application, it is often fairly straightforward to execute the application on various systems and evaluate which system offers faster execution of that application on the provided input. Of course, once an application has been run on a particular input, one generally does not need to rerun it on that same input.

What programmers really want is some way to evaluate which system is likely to provide better performance on applications and data sets run in the future, thus making it the "better" system. Benchmarks also provide a way to examine how various system components behave and interact under load. Benchmarks should give repeatable results, even when rerun by an independent researcher or testing organization. A benchmark can be either a real or a synthetic application.

A synthetic application doesn’t compute anything useful but is designed to have performance characteristics that are representative of a range of real applications.

Benchmarks have an established history in computer science. The first widely used synthetic benchmark was the Whetstone benchmark written in the Algol60 programming language in 1972 and later translated into many other languages. Several benchmarks became well known and widely used in research or commercial settings, or both. Examples include the Livermore loops, the Dhrystone benchmark, the Linpack benchmark, and the Perfect Club benchmarks.

The design and selection of benchmarks, however, has traditionally been a matter of art and taste, as much science as engineering. The paper here by the DaCapo team is the best effort I’ve seen in providing a sound basis for selecting benchmarks. Historically, there has not been any standard methodology for deciding whether or not a benchmark did indeed provide a representative measure of a system’s performance within a particular domain. A more serious problem with benchmarks is that they age poorly. Benchmarks often do a reasonable job of evaluating the performance of applications at the time they are proposed. However, three things tend to make benchmarks grow less useful over time:

  • As machines and memories grow faster and larger, the sizes of application data sets grow as well. What was considered a reasonable problem size when a benchmark was proposed soon becomes a trivial example that fits in the on-processor cache.
  • The actual applications that people use systems for evolve over time, and benchmarks that were once representative become less so.
  • The weight attached to benchmark performance encourages developers of computer systems to optimize, tune, and tweak their systems in ways that improve their performance on the benchmarks but not more generally, making the benchmarks—again—less representative.

Almost every systems researcher and commercial software developer has a personal horror story about a poorly designed benchmark that was difficult to use, produced misleading results, or focused attention on the wrong problem for too long. One such story in my own experience involves the SPEC JVM98 db benchmark intended to represent a database benchmark. Several early papers on removing redundant or useless synchronization from Java programs focused on this benchmark, since removing such synchronization could produce a 20% to 30% speed improvement in the benchmark. However, closer examination revealed that more than 70% of the CPU time for this benchmark was spent in a badly written 20-line Shell sort; replacing the handwritten sort with a call to the built-in sort function doubled the execution speed, even without removing the useless synchronization.

The DaCapo research group offers what seems to be an exceptionally well engineered set of benchmarks for evaluating Java computer systems. This includes not only selecting the benchmark applications, but designing a substantial infrastructure to support the execution and evaluation of benchmark executions.

Far more important than the actual selection of the benchmarks and the engineering infrastructure, the DaCapo team has put together an excellent description of best practices for using benchmarks to evaluate Java system performance, as well as a principled approach for evaluating whether a suite of benchmark applications is, in fact, sufficiently diverse. This approach involves measuring a number of characteristics of each application, and then applying principal component analysis (PCA) to determine whether the applications do have fundamental differences, or if they basically measure the same aspects of a system. I hope the methodology described in the paper will allow the DaCapo benchmark suite, and others, to be evaluated so they can evolve in ways that make them useful as well as meaningful for more than just a moment in time.

Back to Top

Back to Top

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More