In 2014, the organizers of the Conference on Neural Information Processing Systems (NeurIPS, then still called NIPS) made an interesting experiment.1 They split their program committee (PC) in two and let each half independently review a bit more than half of the submissions. That way, 10% of all submissions (166 papers) were reviewed by two independent PCs. The aimed at acceptance rate per PC was 23%. The result of the experiment was that among these 166 papers, the set of accepted papers from the two PCs overlapped by only 43%. That is, more than half of the papers accepted by one PC were rejected by the other. This led to a passionate flare-up of the old debate of how effective or random peer-reviewing really is and what we should do about it.
The experiment left open a number of interesting questions:
To answer these questions, in 2018 I conducted an experiment similar to the NIPS experiment, but with richer data and a deeper analysis. The target was the 26th edition of the "European Symposium on Algorithms" (ESA), a venerable algorithms conference. ESA receives around 300 submissions every year and has two tracks: the more theoretical Track A and the more practical Track B. For the experiment, I picked Track B, which received 51 submissions that year. Two independent PCs were set up, each with 12 members and tasked with an acceptance rate of 24%. A total of 313 reviews were produced. These numbers are smaller than for the NIPS experiment, but still large enough to yield meaningful results. Importantly, they were small enough to allow for a time-intensive deeper analysis.
Both PCs followed the same standard reviewing process, which was agreed on and laid out in advance as clearly as possible:
PC members were explicitly asked and repeatedly reminded to also update the score of their review whenever they changed something in their review. This allowed a quantitative analysis of the various phases of the reviewing process. For more details on the setup, the results, the data, and a script to evaluate and visualize the data in various ways, see the website of the experiment.2
Let us first get a quick overview of the results and then, in Part 3, discuss their implications.
What is the overlap in the set of accepted papers? In the NIPS experiment, the overlap was 43%. In the ESA experiment, the overlap was 58%. The acceptance rates were almost the same. To put these figures into perspective: if the reviewing algorithm was deterministic, the overlap would be 100%. If a random subset of papers was accepted by each PC, the expected overlap would be 24%. If 10% / 20% / 20% / 50% of the papers were accepted with probabilities 0.8 / 0.6 / 0.1 / 0.0, the expected overlap would be around 60%. The overlap is not the best number to look at, since it depends rather heavily on the number of accepted papers; see below.
How many clear accepts were there? The score range for each review was +2, +1, 0, -1, -2. The use of 0 was discouraged and it was communicated beforehand that only papers with a +2 from at least one reviewer were considered for acceptance. For a paper that received only +2 scores, there was no incentive for discussion and these papers were accepted right away. There was little agreement between the two PCs concerning such "clear accepts." Out of nine papers that were clear accepts in one PC, four were rejected by the other PC and only two also were clear accepts in the other PC (that is, 4% of all submissions). If papers that are "clear accepts" exist at all, they are very few.
How many clear rejects were there? A paper was counted as a clear reject if one reviewer gave a -2 and no reviewer gave a +1 or +2. There were 20 such clear rejects in PC1 and 17 in PC2. None of these papers were even considered for acceptance in the other PC. At least one-third of the submissions were thus clear rejects in the sense that it is unlikely that any other PC would have accepted any of them. There was only a single paper with a score difference of 3 or more between the two PCs; it was a clear accept in one PC (all reviewers gave it a +2, praising the strong results), while the other PC was very critical of its meaningfulness.
Is there a natural cutoff to determine the set of accepted papers? If both PCs accepted only their best 10%, the overlap in the set of accepted papers would have been 40% (corresponding to the 4% "clear accepts"). For acceptance rates between 14% to 40%, the overlap varied rather erratically between 54% and 70%. Increasing the rate of accepted papers beyond that showed a steady increase in the overlap (due to the "clear rejects" at the bottom). There is no natural cutoff short of the "clear rejects."
How effective were the various reviewing phases? We have seen that the overlap for a fixed acceptance rate is a rather unreliable measure. I therefore also compared the rankings of the two PCs among those papers which were at least considered for acceptance. Ranking similarity was computed via the Kendall tau correlation (1 for identical rankings, 0 for random rankings, -1 if one is the reverse of the other). Again, see the website for details.2 This similarity was 46% after Phase 1, 63% after Phase 2, and 58% after Phase 3, where the increase after Phase 1 is statistically significant (p = 0.02). This suggests that the per-paper discussions play an important role for objectifying paper scores, while any further discussions add little or nothing in that respect. This correlates well with the experience that PC members are willing to adapt their initial scores once, after reading the reviews from the other PC members. After that, their opinion is more or less fixed.
In summary, the PCs did a good job in separating the wheat from the chaff. There appeared to be at least a partial order in the wheat, but there is no natural cutoff. The fewer papers are accepted, the more random is the selection. The initial per-paper discussions helped to make the review scores more objective. Any further discussions had no measurable effect.
The above results are probably an an upper bound for the objectivity of the reviewing process at a computer science conference, for the following reasons:
Larger conferences, two-tier PCs, unresponsive PC members, underspecified guidelines, and variance in diversity most likely all further increase the randomness in the reviewing process.
I see four main conclusions from this experiment:
First, we need more experiments of this kind. We have the NIPS experiment and now the ESA experiment.3 They give a first impression, but important questions are still open. For example, it would be very valuable to redo the experiment above for a larger and more heterogeneous conference. One argument I often hear is that it is too much effort, in particular, with respect to the additional number of reviewers needed. I don't buy this argument. There are so many conferences in computer science, many of them very large. If we pick one of these conferences from time to time to make an experiment, the additional load is negligible in the big picture. Another argument I often hear is that improving peer review is an unsolvable problem. This always leaves me baffled. In their respective field, researchers love hard problems and sometimes work their whole life trying to make some progress. But when it comes to the reviewing process, the current status quo is as good as it gets?
Second, we need to fully accept the results of these experiments. The experiments so far provide strong hints that there is a significant signal in reviews, but also a significant amount of noise and randomness. Yet, to this day, the myth of a natural cutoff for determining the set of accepted papers prevails. It is usually acknowledged that there is a gray zone, but not that this "gray zone" might encompass almost all of the papers which are not clear rejects. PCs can spend a lot of time debating papers, blissfully unaware that another PC in a parallel universe did not give these papers much attention because they were accepted or more likely rejected early on in the process. From my own PC experience, I conjecture that there are at least two biases at work here. One is that humans tend to be unaware of their biases and feel that they are much more objective than they actually are. Another is the feeling that if you make a strong effort as a group, then the result is meaningful and fair. The other extreme is fatalism: the feeling that the whole process is random anyway, so why bother to provide a proper review. Both of these extremes are wrong, and this is still not widely understood or acted upon.
Third, how do we incorporate these results to improve the reviewing process? Let us assume that the results from the NIPS and the ESA experiments are not anomalies; then, there are some pretty straightforward ways how we can incorporate them into the current reviewing process. For example, discussion of papers in the alleged "gray zone" could be dropped. Instead, this energy could be used to communicate and implement the semantics of the available scores as clearly as possible in advance. Average scores could then be converted to a probability distribution for at least a subset of the papers, namely those for which at least one, but not all, reviewers spoke up. Papers from this "extended gray zone" could then be accepted with a probability proportional to their score. This would not make the process any more random, but definitely less biased. To reduce not only bias, but also randomness, a simple and effective measure would be to accept more papers. Digital publication no longer imposes a limit on the number of accepted papers and many conferences have already moved away from the "one full talk per paper" principle.
Fourth, all of this knowledge has to be preserved from one PC to the next. Already now, we have a treasure of knowledge on the peer review process. But only a fraction of it is considered or implemented at any particular conference. The main reason I see is the typical way in which administrative jobs are implemented in the academic world. Jobs rotate (often rather quickly), there is little incentive to excel, there is almost no quality control (who reviews the reviewers), and participation in the peer review process is another obligation on top of an already more-than-full-time job. You do get status points for some administrative jobs, but not for doing them particularly well or for investing an outstanding amount of time or energy. Most of us are inherently self-motivated and incredibly perseverant when it comes to our science. Indeed, that is why most of us became scientists in the first place. Administrative tasks are not what we signed up for, not what we were trained for, and not what we were selected for. We understand intellectually how important they are, but we do not really treat them that way.
My bottom line: The reputation of the peer review process is tarnished. Let us work on this with the same love and attention we give to our favorite research problems. Let us do more experiments to gain insights that help us make the process more fair and regain some trust. And let us create powerful incentives, so that whatever we already know is good is actually implemented and carried over from one PC to the next.
1 https://cacm.acm.org/blogs/blog-cacm/181996-the-nips-experiment provides a short description of the NIPS experiment and various links to further analyses and discussions.
3 There are other experiments, like the single-blind vs. double-blind experiment at WSDM'17, which investigated a particular aspect of the reviewing process: https://arxiv.org/abs/1702.00502
Hannah Bast is a professor of computer science at the University of Freiburg, Germany. Before that, she was working at Google, developing the public transit routing algorithm for Google Maps. Right after the ESA experiment, she became Dean of the Faculty of Engineering in Freiburg and a member of the Enquete Commission for Artificial Intelligence of the German parliament (Bundestag). That's why it took her two years to write this blog post.
There are lots of great insights here. Thanks for taking the time to make this more widely available.
Given the threshold for clear rejects is so well defined suggests you could perhaps relax the definition of clear reject to reduce the number of papers in the grey area. For example, if you removed the requirement for at least one score of -2 and simply defined it as no positive scores how many more papers would be clear rejects and what would the agreement be?
Dear Allan, thank you for that question. There were 13 papers of the kind you describe (no strong reject, but also no reviewer spoke up for it): 8 in the one PC, and 5 in the other PC. The agreement on these papers was not good: One of these papers was a "clear accept" in the other PC, and three were considered for acceptance in the other PC. And none of these 13 papers were of the same kind in the other PC. This is interesting, because it suggests that a -2 score (strong reject) is not given lightly and in combination with no other reviewer speaking up is a very strong signal for reject. Without a -2 score, the situation is much less clear.
Feel free to play around with the data yourself on https://github.com/ad-freiburg/esa2018-experiment . There is also a Python script that can do all sorts of analyses and visualizations and the README lists a few example invocations.
Very interesting and insightful! And I immediately wonder: now that you have the 2-pc data, would it be worthwhile to look at the accepted papers in, say, 5 years time and see whether there is any correlation between number of citations received and how the papers were judged by the independent pcs...
These are fascinating results. On first reading they look horrifying, but on reflection, perhaps they are quite reassuring: reviewers do reject bad papers fairly consistently; conferences accept a random selection from the acceptable ones.
I wonder also if this analysis misses a major purpose of reviews: to help the authors to improve their papers. My experience is that phases 2 and 3 are often discussing how best to represent the reviewers' conclusions to the authors; omitting those phases would perhaps lose some of the value of the review process.
The result of the current review system, according to this research, is that the authors of an acceptable submission get, somewhat randomly, either the kudos of acceptance or the bonus of several hours' worth of expert feedback to improve the paper. On the figures here, papers will average several rejects, so once papers are finally accepted they will typically have anonymous contributions from many other industry experts.
So if we want reviews to be a process for objective quality measurement (if such a thing were possible), we shall be disappointed; if we want them as a process for quality improvement, perhaps the system is effective as it is?
Dear Jack, that would be very worthwhile, and as it happens, Corinna Cortes and Neil Lawrence have now done just that for the NeurIPS 2014 experiment: https://arxiv.org/pdf/2109.09774v1.pdf . Their main result: "for accepted papers, there is no correlation between quality scores and impact of the paper as measured as a function of citation count."
Dear Charles, I agree and I would agree even more if the review scores were free from bias. But I don't think they are. I think there are systematic biases for and against certain kinds of papers, certain kinds of topics, etc. That's why I personally would be in favor of injecting a deliberate element of randomness into the selection process. As I wrote above, this "would not make the process any more random, but definitely less biased." Interestingly, the first reaction of most colleagues to this suggestion is one of strong refusal. Which I interpret as coming from a combination of overconfidence in the objectivity of the reviews scores and a feeling of devaluation of the effort that one personally invests into reviewing. Yes, there is a significant and valuable signal in the review scores, but it is not that strong.
Displaying all 6 comments