Rethinking the Systems Review Process

There has been much discussion on Twitter, Facebook, and in blogs about problems with the reviewing system for HCI systems papers (see James Landay’s blog post, "I give up on CHI/UIST" and the comment thread at http://dubfuture.blogspot.com/2009/11/i-give-up-on-chiuist.html). Unlike papers on interaction methods or new input devices, systems are messy. You can’t evaluate a system using a clean little lab study, or show that it performs 2% better than the last approach. Systems often try to solve a novel problem for which there was no previous approach. The value of these systems might not be quantified until they are deployed in the field and evaluated with large numbers of users. Yet doing such an evaluation incurs a significant amount of time and engineering work, particularly compared to non-systems papers. The result, observed in conferences like CHI and UIST, is that systems researchers find it very difficult to get papers accepted. Reviewers reject messy systems papers that don’t have a thorough evaluation of the system, or that don’t compare the system against previous systems (which were often designed to solve a different problem).

At CHI 2010 there was an ongoing discussion about how to fix this problem. Can we create a conference/publishing process that is fair to systems work? Plans are afoot to incorporate iterative reviewing into the systems paper review process for UIST, giving authors a chance to have a dialogue with reviewers and address their concerns before publication.

However, I think the first step is to define a set of reviewing criteria for HCI systems papers. If reviewers don’t agree on what makes a good systems paper, how can we encourage authors to meet a standard for publication?

Here’s my list:

A clear and convincing description of the problem being solved. Why isn’t current technology sufficient? How many users are affected? How much does this problem affect their lives?
How the system works, in enough detail for an independent researcher to build a similar system. Due to the complexities of system building, it is often impossible to specify all the parameters and heuristics being used within a 10-page paper limit. But the paper ought to present enough detail to enable another researcher to build a comparable, if not identical, system.
Alternative approaches. Why did you choose this particular approach? What other approaches could you have taken instead? What is the design space in which your system represents one point?
Evidence that the system solves the problem as presented. This does not have to be a user study. Describe situations where the system would be useful and how the system as implemented performs in those scenarios. If users have used the system, what did they think? Were they successful?
Barriers to use. What would prevent users from adopting the system, and how have they been overcome?
Limitations of the system. Under what situations does it fail? How can users recover from these failures?

What do you think? Let’s discuss.

Readers’ comments

I’d like to second your first recommendation. I’ve reviewed a number of systems papers that do not provide a sufficiently compelling motivation or use case—why should I or anyone care about this system? Without this, the paper often represents technology in search of a problem.

Now, having read Don Norman’s provocative article "Technology First, Needs Last: The Research-Product Gulf" in the recent issue of interactions magazine, I have a keener appreciation for the possible contribution of some technologies in search of problems, but I still believe these are more the exception than the norm … and that without adequately making the case for the human-centered value(s) the systems will help realize, such papers are probably more suitable for other venues.
—Joseph McCarthy

One problem is that our field is moving so fast that we have to allow new ideas to cross evolve with other ideas rapidly. If we require evaluations of every paper, then we don’t have the rapid turnaround required for innovations to cross paths with each other.

On the other hand, it seems wrong not to have some filter. Without filters, we might end up publishing ideas that seem interesting, but are actually quite useless.

I think you have a great start on a list of discussion points. One thing to keep in mind is that we should evaluate papers in whole rather in parts. I will often recommend accepting papers that are deficient in one area but very good in another.
—Ed Chi

I think it would be useful to some of us discussing your post if you could say more about the kinds of evidence you are referring to when you say "evidence that the system solves the problem" that are not user studies.

So, what are some examples of specific system problems ("clearly and convincingly presented"), and what would you consider appropriate evidence to show that your system solved the problem? Is it a set of usage scenarios that have been hard to address through previous designs and you show how a single interface design can address them completely? Is it a new, significantly more efficient algorithm or mechanism, for example, to handle complex preferences around group permissions, which would be useful to the builders of group systems to know about? (In the latter case, would evidence be performance testing, using logs of previous queries as data?) Is it a new approach for using skin-tapping as input?
—Dan Gruen

I am a strong proponent of rigorous gatekeeping at conferences simply because I need some help figuring out which things are worth following in my limited time. At the same time, I think it is important to keep in mind all the different ways a systems paper can be really valuable and worth seeing at a conference like CHI. A systems paper could be interesting thanks to a thorough analysis of its deployment and usage (evaluation). Or it could be interesting thanks to a well-argued discussion of why it was built a particular way (design). Or it might just demonstrate that a given interesting capability could be created at all. Or it could be a careful argument about why a certain system would be really useful, even if it hasn’t been built or evaluated yet (motivation/position paper). In the end, what I want are papers that stimulate thought and action. I’m not going to demand any particular levels of motivation, design, or evaluation; rather, I’m going to ask whether the whole is innovative enough. This is a highly subjective decision, which is why I greatly value wise program committees who can make such a judgment on my behalf.
—David Karger

TESSA LAU "If reviewers don’t agree on what makes a good systems paper, how can we encourage authors to meet a standard for publication?"

I like your list, and think that the bullet points are representative of good evaluation criteria for systems papers across computer science.

The main sticking point is, as I see it, "Evidence that the system solves the problem as presented." In some other areas of empirical computer science, we have repositories of test problems, suites of agreed-upon performance metrics, testing harnesses for software, and so forth. Usability testing is seen as the gold standard in HCI, though, and it’s much harder to leverage such tools to make evaluation by user testing efficient. The effort devoted to user testing of a new system can sometimes rival the effort to having built the system in the first place—okay, I might be exaggerating a bit, but still….

If we could agree on some reasonable substitutes, that would be good. Some of the papers I’ve worked on have included GOMS models of performance, for example, but not everyone buys into a purely analytical approach. Sometimes, even worse, what I’d like to convey in a paper is a conceptual shift, a different way of thinking about some kind of problem, and that’s even harder to evaluate than pure performance.
—Robert St. Amant

I’ve suggested it before and will suggest it again. An easy start could be to make accepted interactivity demos "worth" as much as a full paper at CHI—same presentation length, and the associated paper (maybe six pages long) needs to be of the same "archival" status in the ACM Digital Library.

This could show a true commitment to systems research.
—Florian Mueller

Tessa Lau responds

Thank you all for the interesting discussion. My goal was to initiate a discussion within our community, not to have all the answers.

Along those lines, Dan, the question you raise about what constitutes "appropriate evidence" is one that I’ll turn back to the community for a collective answer.

For what it’s worth, though, I don’t think of your examples as "systems." The first is usage scenarios or design. The second is an algorithm. The third is an interaction method. Each of those is fairly self-contained and possible to evaluate using a fairly closed study.

What I meant by "systems" is an implemented prototype that gives people access to new functionality that did not exist before. Examples of "systems" include CoScripter, Many Eyes, Landay’s DENIM and SILK, Gajos’s SUPPLE, Andrew Ko’s Whyline. How can we show that each of these systems is "innovative enough" (in David’s words) to merit publication?

Footnotes

DOI: http://doi.acm.org/10.1145/1839676.1839680

Readers’ comments

Tessa Lau responds

Rethinking the Systems Review Process

DOI

November 2010 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Readers’ comments

Tessa Lau responds

Rethinking the Systems Review Process

DOI

November 2010 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.