Sign In

Communications of the ACM


Experiments as Research Validation: Have We Gone Too Far?

Experiments as Research Validation, illustration

Credit: Alicia Kubista / Andrij Borys Associates

I recently submitted a paper to a conference, and when I got the reviews back, I noticed the review form had a question referees are required to answer, about whether the experiments were well carried out, with choices like "believable" and "not believable." The reviewers had a bit of trouble with that question, because my paper had no experiments; it was a paper about computational complexity of MapReduce algorithms. Two of the reviewers said the nonexistent experiments were not believable, which is wrong—you have to see something to disbelieve it. I did not write this Viewpoint to complain about being mistreated; in fact the paper was accepted. However, the existence of this question on the review form convinced me there is something seriously wrong with how computer-science research is being evaluated today.a

There was a time when experimental evidence was not assumed necessary for publication. In fact, just the opposite was true: you needed to provide mathematical analysis of your conclusions, often in the form of proofs that your algorithms did what you claimed and "big-oh" analysis of their performance. For example, long ago we discovered algorithms to sort in O(n log n) time. These algorithms were analyzed formally, using an appropriate model for the main memory of a computer. We did not have to run the algorithms and plot their running time to know they were better than the obvious O(n2) algorithms. Experiments were only necessary to address questions such as how many elements must there be before Quicksort really beats Bubblesort. And of course when you ran out of main memory, the model no longer applied and you had an unpredicted increase in running time as you suddenly needed to move data to and from disk. Yet the basic O(n log n) idea still applies even when you use secondary storage.


Bernaridho Hutabarat

I somehow agree, with slightly different reason. I live in Indonesia where science is largely neglected by government, educational institution (including private ones) and companies. Such environments prohibit the making of code-translators (compiler, linker, interpreter) for a programming-language I wrote. If experiments are absolutely necessary, it is not my language that is judged, but my ability to get fund to write the code-translators. With my age, I'm not strong enough to solely create compiler and linker here in Indonesia. Yes, I agree with Ullman. For any of you who wants to look about my programming-language, see Thanks.

Scott Cotton

Thanks for the article!

Experimental evidence is an interesting and difficult question in computer science. On the one hand, there are problem domains, such as SAT and SMT, where competition and benchmarking brought about huge advances in efficiency and hence applicability. On the other hand, the community is not so advanced at evaluating such data compared to other communities like machine learning groups, and often times the underlying ideas become comparitively vague because we don't understand rigorously why some things work and others do not experimentally.

So I appreciate your central point: "We should not accept experiments as a substitute for a more careful and general analysis." But there are some caveats IMHO. Algorithms are enumerable and meta-algorithms can explore algortithms automatically. What if we stumble upon some algorithm for some problem which experimentally works in a wide band much better than alternatives but there is no analytic reason for it except to say, more or less, it works as far as we can tell? There are lots of publications about, for example, understanding clause learning for SAT solvers because we didn't, and to some extent, still don't understand clearly why it works. But that particular example was an invaluable result, even with limited understanding.

Another issue with experimental results is diminishing practical relevance due to increasingly dominating (and hard to repeat or predict) factors from industrial application/calling context. Academic datasets tend to focus on a small band of interest of industrial problems, where that band is by construction an outlier -- the problems that can't be solved well in industry. Often good published experimental results end up being difficult at best in practice because of the issues surrounding identification, targeting, and integration for the band in question. It would help here if data/problem sets from industry were more carefully constructed for easy adaptation or general use.


Displaying all 2 comments

Log in to Read the Full Article

Sign In

Sign in using your ACM Web Account username and password to access premium content if you are an ACM member, Communications subscriber or Digital Library subscriber.

Need Access?

Please select one of the options below for access to premium content and features.

Create a Web Account

If you are already an ACM member, Communications subscriber, or Digital Library subscriber, please set up a web account to access premium content on this site.

Join the ACM

Become a member to take full advantage of ACM's outstanding computing information resources, networking opportunities, and other benefits.

Subscribe to Communications of the ACM Magazine

Get full access to 50+ years of CACM content and receive the print version of the magazine monthly.

Purchase the Article

Non-members can purchase this article or a copy of the magazine in which it appears.