Sign In

Communications of the ACM


Experiments as Research Validation: Have We Gone Too Far?

View as: Print Mobile App ACM Digital Library Full Text (PDF) In the Digital Edition Share: Send by email Share on reddit Share on StumbleUpon Share on Hacker News Share on Tweeter Share on Facebook
Experiments as Research Validation, illustration

Credit: Alicia Kubista / Andrij Borys Associates

I recently submitted a paper to a conference, and when I got the reviews back, I noticed the review form had a question referees are required to answer, about whether the experiments were well carried out, with choices like "believable" and "not believable." The reviewers had a bit of trouble with that question, because my paper had no experiments; it was a paper about computational complexity of MapReduce algorithms. Two of the reviewers said the nonexistent experiments were not believable, which is wrongyou have to see something to disbelieve it. I did not write this Viewpoint to complain about being mistreated; in fact the paper was accepted. However, the existence of this question on the review form convinced me there is something seriously wrong with how computer-science research is being evaluated today.a

There was a time when experimental evidence was not assumed necessary for publication. In fact, just the opposite was true: you needed to provide mathematical analysis of your conclusions, often in the form of proofs that your algorithms did what you claimed and "big-oh" analysis of their performance. For example, long ago we discovered algorithms to sort in O(n log n) time. These algorithms were analyzed formally, using an appropriate model for the main memory of a computer. We did not have to run the algorithms and plot their running time to know they were better than the obvious O(n2) algorithms. Experiments were only necessary to address questions such as how many elements must there be before Quicksort really beats Bubblesort. And of course when you ran out of main memory, the model no longer applied and you had an unpredicted increase in running time as you suddenly needed to move data to and from disk. Yet the basic O(n log n) idea still applies even when you use secondary storage.

Even in a far less fundamental setting than sorting algorithms, we expected to see a mathematical analysis rather than experiments. For example, if you had an idea for a better disk-scheduling algorithm, you would define a dozen or more parameters that represented the physical setup: block size, number of tracks, speed of head movement, and so on. You would develop the formula for performance of your algorithm in terms of these parameters and compare that formula with similar formulas for alternative algorithms.

So how did we get from a world where ideas were evaluated by mathematical analysis to a world where reviewers expect to see experimental results in absolutely every paper? In the time when sorting algorithms were a central issue for computer scientistsin the early to mid-1960sif you were at a major corporation or school, you would have access to one computer for the whole institution, probably an IBM 7090. If you wanted to do the experiments to show an O(n log n) algorithm ran faster than an O(n2) algorithm, you would have to wait two hours for each of the many runs needed to develop the evidence. And your institution probably had much better things to do with the one computer they owned than to let you fool around demonstrating again what you had already demonstrated mathematically.

By the 1980s, however, it was possible for each department, or even each research group to have its own time-shared computer, and by the end of the 1980s, each researcher had one. Experiments could be performed almost instantaneously. This change in what was feasible led many subfields of computer science to start writing papers that included experiments. I am going to address principally research papers in the database community, because it is that area with which I am most familiar. However, there are many fields of CS in which a similar evolution in style has occurred, even if the details differ from field to field. To get an idea of the scope of the transformation, I decided to look at three SIGMOD conference proceedings, from 1970, 1990, and 2010. I examined the published papers to find the fraction that had some sort of experimental evidence for their work. In 1970, the conference was actually called SIGFIDET (File Description and Translation). It had 24 papers, only about half of which seemed to be research papers per se. The others looked more like committee reports. However, only one of the 24 had anything that looked like an experiment. Interestingly, the exception was the famous paper by Bayer and McCreight on B-trees, so draw from that whatever conclusion you like. In 1990, of the 37 research papers, 14 had experiments. By 2010, the SIGMOD conference had grown, so I got lazy and sampled every third research paper. Of the 25 papers I looked at, every one had some form of experiment. I also sampled the industrial papers. Only two out of six had experiments, possibly because companies are often reluctant to release performance measures for their own systems.

It (reliance on experiments) encourages the writing of papers that really should not be written at all, because they are so incremental and specialized that their use in practice is unlikely.

There are many reasons to view the addition of experimental evidence as an improved research methodology. The Viewpoint by Michael Mitzenmacher in this issue of Communications addresses arguments in favor of experimental evidence in the context of theoretical papers. In addition, experimental evidence makes sense in many situations, such as the following:

  • Sometimes, the problem being solved is much less straightforward to state than "sort n numbers." If so, mathematical expression of an algorithm's performance may involve many different parameters or may hardly be parameterizable at all. Experiments can help quantify performance better than a complex formula, even if the experiments describe performance on only a tiny subset of the cases covered by the formula.
  • Other times, the worst-case analysis of an algorithm does not represent the performance on average cases or typical cases. Experiments showing results on selected representative cases can be more informative than worst-case analysis. It may even be the case that Algorithm A beats Algorithm B in the worst case, but loses on typical cases.
  • For some problems, there are well thought out benchmarks the community believes represent typical instances of the problem. Experiments that report results on these benchmarks are accepted as realistic representations of performance.
  • The value of an algorithm may depend not only on the big-oh running time, but on the constant factor; the latter usually can be determined only by experiment.
  • There are appropriate concerns that an analysis may omit critical parameters of a problem. By running an experiment, we are at least assured the ideas are useful in at least one situation.

However, now it appears the pendulum has swung way too far, to the point where experiments are considered the only way to validate ideas. It is time to restore the balance. Experiments should be used when appropriate, such as in the cases just mentioned. However, ideas that require analysis rather than experiments should be handled appropriately, rather than "justified" by inappropriate and meaningless experiments.

Many good ideas stand on their own, without the need for experiments. Consider the two database ideas that have won the Turing award:b the relational model (Ted Codd) and two-phase locking (I know Jim Gray won for many contributions, but this one is, I think, the centerpiece). Neither paper was about experiments. Should we have rejected a paper that said "let's organize data into relations" just because there was no experiment to prove its queries were executable more efficiently? Would we want to reject a paper on two-phase locking because it did not measure experimentally the space required by locking tables?

The argument against reliance on experiments as the only way to justify ideas is not just that some papers do not need experiments. There is real harm being done by the excessive focus on experiments as validation. First, experiments are conducted on particular data or under particular assumptions. They rarely tell you what happens in other situations. In contrast, a proper formal analysis of the resources required by an algorithm applies generally. An interesting case in point was suggested by one of the referees, who points out that while cache-aware algorithms (for example, for join) made sense when they were published, today's hardware is smart enough that these algorithms are not needed, and experiments would demonstrate that fact. While true, we cannot be sure there will not be another evolution of hardware that makes cache-aware algorithms again important, perhaps at another level of the memory hierarchy. By writing a paper with the formal analysis, the ideas are relevant in many different environments, including those not contemplated when the ideas were first formulated.

We should not accept experiments as a substitute for a more careful and general analysis.

But there is a more damaging effect of taking experimental results too seriously. It encourages the writing of papers that really should not be written at all, because they are so incremental and specialized that their use in practice is unlikely. There are many areas of research where the nature of the data can vary greatly, and performance of different algorithms will vary with the data. Think of multidimensional indexes, or clustering, for example. In research areas of this kind, it is very easy to find a special form of data and invent an algorithm that works well in this narrow special case. You then run your experiments on data for which your algorithm is best suited and compare it with others, whichsurprise, surprisedo not work as well. But were you to run your algorithm on the common cases, or random cases, you would do less well or not well at all. It does not matter; you can still publish yet another paper about yet another algorithm for doing this or that.

Suppose we expected that a paper describing algorithms should measure their performance in some manner. One possible manner is experimentation, which should be a fair comparison with the best-known algorithms, since simply measuring the running time of your own algorithm says nothing about how good it is. A second way to justify the importance of your ideas is with a big-oh analysis of the worst- or average-case running time. A third way is to offer a proof of optimality of your algorithm, at least according to one relevant measure such as time, space, communication, or whatever measure you can justify as the "cost" of executing the algorithm.

It is time to recenter the pendulum. So I propose that, as reviewers of submissions to conferences or journals, we should start by asking whether the value of the proposed ideas have been analyzed for the general case or at least some well-defined realistic subset of all possible cases. We should not accept experiments as a substitute for a more careful and general analysis, unless there really is no way to parameterize the input space suitably. And we should not accept experiments on contrived, specialized data under almost any circumstances. As authors, we should stop thinking of experiments as a substitute for analysis and deep understanding of why our algorithms work better than others that have been proposed. A little self-censorship might be a benefit to the community as well. Not every algorithm that works in some narrow window has to be published. And of course questions for reviewers about experiments should be replaced by a broader question about whether the value of the work has been justified in some appropriate manner.

Back to Top


Jeffrey D. Ullman ( is the Stanford W. Ascherman Professor of Computer Science (Emeritus) at Stanford University, Stanford, CA.

Back to Top


a. Prior to publication of this Viewpoint, I learned this item on the review form had been changed, and now includes other validation as an alternative to experiments. That change is exactly what I had hoped for when writing this Viewpoint.

b. This Viewpoint was written before the 2014 Turing Award to Mike Stonebraker was announced. However, the same arguments could be applied to most of the achievements in the database arena upon which that award was based.

Copyright held by author.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2015 ACM, Inc.


Bernaridho Hutabarat

I somehow agree, with slightly different reason. I live in Indonesia where science is largely neglected by government, educational institution (including private ones) and companies. Such environments prohibit the making of code-translators (compiler, linker, interpreter) for a programming-language I wrote. If experiments are absolutely necessary, it is not my language that is judged, but my ability to get fund to write the code-translators. With my age, I'm not strong enough to solely create compiler and linker here in Indonesia. Yes, I agree with Ullman. For any of you who wants to look about my programming-language, see Thanks.

Scott Cotton

Thanks for the article!

Experimental evidence is an interesting and difficult question in computer science. On the one hand, there are problem domains, such as SAT and SMT, where competition and benchmarking brought about huge advances in efficiency and hence applicability. On the other hand, the community is not so advanced at evaluating such data compared to other communities like machine learning groups, and often times the underlying ideas become comparitively vague because we don't understand rigorously why some things work and others do not experimentally.

So I appreciate your central point: "We should not accept experiments as a substitute for a more careful and general analysis." But there are some caveats IMHO. Algorithms are enumerable and meta-algorithms can explore algortithms automatically. What if we stumble upon some algorithm for some problem which experimentally works in a wide band much better than alternatives but there is no analytic reason for it except to say, more or less, it works as far as we can tell? There are lots of publications about, for example, understanding clause learning for SAT solvers because we didn't, and to some extent, still don't understand clearly why it works. But that particular example was an invaluable result, even with limited understanding.

Another issue with experimental results is diminishing practical relevance due to increasingly dominating (and hard to repeat or predict) factors from industrial application/calling context. Academic datasets tend to focus on a small band of interest of industrial problems, where that band is by construction an outlier -- the problems that can't be solved well in industry. Often good published experimental results end up being difficult at best in practice because of the issues surrounding identification, targeting, and integration for the band in question. It would help here if data/problem sets from industry were more carefully constructed for easy adaptation or general use.


Displaying all 2 comments