I recently submitted a paper to a conference, and when I got the reviews back, I noticed the review form had a question referees are required to answer, about whether the experiments were well carried out, with choices like "believable" and "not believable." The reviewers had a bit of trouble with that question, because my paper had no experiments; it was a paper about computational complexity of MapReduce algorithms. Two of the reviewers said the nonexistent experiments were not believable, which is wrong—you have to see something to disbelieve it. I did not write this Viewpoint to complain about being mistreated; in fact the paper was accepted. However, the existence of this question on the review form convinced me there is something seriously wrong with how computer-science research is being evaluated today.a
There was a time when experimental evidence was not assumed necessary for publication. In fact, just the opposite was true: you needed to provide mathematical analysis of your conclusions, often in the form of proofs that your algorithms did what you claimed and "big-oh" analysis of their performance. For example, long ago we discovered algorithms to sort in O(n log n) time. These algorithms were analyzed formally, using an appropriate model for the main memory of a computer. We did not have to run the algorithms and plot their running time to know they were better than the obvious O(n2) algorithms. Experiments were only necessary to address questions such as how many elements must there be before Quicksort really beats Bubblesort. And of course when you ran out of main memory, the model no longer applied and you had an unpredicted increase in running time as you suddenly needed to move data to and from disk. Yet the basic O(n log n) idea still applies even when you use secondary storage.
I somehow agree, with slightly different reason. I live in Indonesia where science is largely neglected by government, educational institution (including private ones) and companies. Such environments prohibit the making of code-translators (compiler, linker, interpreter) for a programming-language I wrote. If experiments are absolutely necessary, it is not my language that is judged, but my ability to get fund to write the code-translators. With my age, I'm not strong enough to solely create compiler and linker here in Indonesia. Yes, I agree with Ullman. For any of you who wants to look about my programming-language, see www.excellent-programming.com. Thanks.
Thanks for the article!
Experimental evidence is an interesting and difficult question in computer science. On the one hand, there are problem domains, such as SAT and SMT, where competition and benchmarking brought about huge advances in efficiency and hence applicability. On the other hand, the community is not so advanced at evaluating such data compared to other communities like machine learning groups, and often times the underlying ideas become comparitively vague because we don't understand rigorously why some things work and others do not experimentally.
So I appreciate your central point: "We should not accept experiments as a substitute for a more careful and general analysis." But there are some caveats IMHO. Algorithms are enumerable and meta-algorithms can explore algortithms automatically. What if we stumble upon some algorithm for some problem which experimentally works in a wide band much better than alternatives but there is no analytic reason for it except to say, more or less, it works as far as we can tell? There are lots of publications about, for example, understanding clause learning for SAT solvers because we didn't, and to some extent, still don't understand clearly why it works. But that particular example was an invaluable result, even with limited understanding.
Another issue with experimental results is diminishing practical relevance due to increasingly dominating (and hard to repeat or predict) factors from industrial application/calling context. Academic datasets tend to focus on a small band of interest of industrial problems, where that band is by construction an outlier -- the problems that can't be solved well in industry. Often good published experimental results end up being difficult at best in practice because of the issues surrounding identification, targeting, and integration for the band in question. It would help here if data/problem sets from industry were more carefully constructed for easy adaptation or general use.
Displaying all 2 comments