In 2012, when reading a paper from a recent premier computer security conference, we came to believe there is a clever way to defeat the analyses asserted in the paper, and, in order to show this we wrote to the authors (faculty and graduate students in a highly ranked U.S. computer science department) asking for access to their prototype system. We received no response. We thus decided to reimplement the algorithms in the paper but soon encountered obstacles, including a variable used but not defined; a function defined but never used; and a mathematical formula that did not typecheck. We asked the authors for clarification and received a single response: "I unfortunately have few recollections of the work ... "
We next made a formal request to the university for the source code under the broad Open Records Act (ORA) of the authors' home state. The university's legal department responded with: "We have been unable to locate a confirmed instance of [system's] source code on any [university] system."
Expecting a research project of this magnitude to be developed under source code control and properly backed up, we made a second ORA request, this time for the email messages among the authors, hoping to trace the whereabouts of the source code. The legal department first responded with: "... the records will not be produced pursuant to [ORA sub-clause]." When we pointed out reasons why this clause does not apply, the university relented but demanded $2,263.66 " ... to search for, retrieve, redact and produce such records." We declined the offer.
We instead made a Freedom of Information Act request to the National Science Foundation for the funded grant proposals that supported the research. In one, the principal investigator wrote, "We will also make our data and software available to the research community when appropriate." In the end, we concluded, without assistance from the authors to interpret the paper and with the university obstructing our quest for the source code of the prototype system, we would not be able to show the analyses put forth could be defeated.
Reproducibility, repeatability, benefaction. There are two main reasons to share research artifacts: repeatability and benefaction.2,10,16,20 We say research is repeatable if we can re-run the researchers' experiment using the same method in the same environment and obtain the same results.19 Sharing for repeatability is essential to ensure colleagues and reviewers can evaluate our results based on accurate and complete evidence. Sharing for benefaction allows colleagues to build on our results, better advancing scientific progress by avoiding needless replication of work.
Unlike repeatability, reproducibility does not necessarily require access to the original research artifacts. Rather, it is the independent confirmation of a scientific hypothesis,19 done post-publication, by collecting different properties from different experiments run on different benchmarks, and using these properties to verify the claims made in the paper. Repeatability and reproducibility are cornerstones of the scientific process, necessary for avoiding dissemination of flawed results.
In light of our discouraging experiences with sharing research artifacts, we embarked on a study to examine the extent to which computer systems researchers share their code and data, reporting the results here. We also make recommendations as to how to improve such sharing, for the good of both repeatability and benefaction.
The study. Several hurdles must be cleared to replicate computer systems research. Correct versions of source code, input data, operating systems, compilers, and libraries must be available, and the code itself must build and run to completion. Moreover, if the research requires accurate measurements of resource consumption, the hardware platform must be replicated. Here, we use the most liberal definitions of repeatability: Do the authors make the source code used to create the results in their article available, and will it build? We will call this "weak repeatability."
Our study examined 601 papers from ACM conferences and journals, attempting to locate any source code that backed up published results. We examined the paper itself, performed Web searches, examined popular source-code repositories, and, when all else failed, emailed the authors. We also attempted to build the code but did not go so far as trying to verify the correctness of the published results.
Recommendations. Previous work on repeatability describes the steps that must be taken in order to produce research that is truly repeatable11,12 or describes tools or websites that support publication of repeatable research.4,6 Our recommendations are more modest. We recognize that, as a discipline, computer science is a long way away from producing research that is always, and completely, repeatable. But, in the interim, we can require authors to conscientiously inform their peers of their intent with respect to sharing their research artifacts. This information should be provided by the authors when submitting their work for publication; this would allow reviewers to take the expected level of repeatability into consideration in their recommendation to accept or reject. To this end, we make a recommendation for adding sharing contracts to publicationsa statement by authors as to the level of repeatability readers can expect.
Three previous empirical studies explored computer science researchers' willingness to share code and data. Kovacevi5 rated 15 papers published in the IEEE Transactions on Image Processing and found that while all algorithms had proofs, none had code available, and 33% had data available. Vandewalle et al.18 examined the 134 papers published in IEEE Transactions on Image Processing in 2004, finding "... code (9%) and data (33%) are available online only in a minority of the cases ..." Stodden15 reported while 74% of the registrants at the Neural Information Processing Systems (machine-learning) conference said they were willing to share post-publication code and 67% post-publication data, only " ... 30% of respondents shared some code and 20% shared some data on their own websites." The most common reasons for not sharing code were "The time it takes to clean up and document for release," "Dealing with questions from users about the code," "The possibility that your code may be used without citation," "The possibility of patents, or other IP constraints," and "Competitors may get an advantage." Stodden14 has since proposed "The Open Research License," which, if universally adopted, would incentivize researchers to share by ensuring " ... each scientist is attributed for only the work he or she has created."13
Public repositories can help authors make their research artifacts available in perpetuity. Unfortunately, the "if you build it they will come" paradigm does not always work; for example, on the RunMyCode17 and ResearchCompendia Web portals,a only 143 and 236 artifacts, respectively, had been registered as of January 2016.
One attractive proposition for researchers to ensure repeatability is to bundle code, data, operating system, and libraries into a virtual machine image.4,9 However, this comes with its own problems, including how to perform accurate performance measurements; how to ensure the future existence of VM monitors that will run my VM image; and how to safely run an image that contains obsolete operating systems and applications to which security patches may have not been applied.
From 2011 until January 2016, 19 computer science conferencesb participated in an "artifact evaluation process."c Submitting an artifact is voluntary, and the outcome of the evaluation does not influence whether or not a paper is accepted for publication; for example, of the 52 papers accepted by the 2014 Programming Language Design and Implementation (PLDI) conference, 20 authors submitted artifacts for evaluation, with 12 classified as "above threshold."d For PLDI 2015, this improved to 27 accepted artifacts out of 58 accepted papers, reflecting an encouraging trend.
Our study employed a team of undergraduate and graduate research assistants in computer science and engineering to locate and build source code corresponding to the papers from the latest incarnations of eight ACM conferences (ASPLOS'12, CCS'12, OOPSLA'12, OSDI'12, PLDI'12, SIGMOD'12, SOSP'11, and VLDB'12) and five journals (TACO'12, TISSEC'12/13, TOCS'12, TODS'12, and TOPLAS'12).e
We inspected each paper and removed from further consideration any that reported on non-commodity hardware or whose results were not backed by code. For the remaining papers we searched for links to source code by looking over the paper itself, examining the authors' personal websites, and searching the Web and code repositories (such as GitHub, Google Code, and SourceForge). If still unsuccessful, we sent an email request to the authors, excluding some papers to avoid sending each author more than one request. We sent each request to all authors for whom we could determine an address and reminder email messages to those who did not respond.
Repeatability and reproducibility are cornerstones of the scientific process, necessary for avoiding dissemination of flawed results.
In the following cases we marked a paper as "code not available" when we found only partial code or binary releases; when the authors promised they would "send code soon" but we heard nothing further; when we were asked to sign a license or non-disclosure agreement; when the authors requested credit for any follow-up work; or when we received code more than two months after the original email request.
We next made two attempts to build each system. This often required editing
makefiles and finding and installing specific operating system and compiler versions, and external libraries. We first gave a research assistant a 30-minute time limit, and, if that failed, we gave another assistant "unlimited time" to attempt the build.f
Upon completing the build process we conducted an online survey of all authors to help verify the data we had gathered. We resolved cases where we had misclassified a paper, where our Web searches had turned up the wrong code, or where there had been a misunderstanding between us and the authors. We also asked the authors if the version of the code corresponding to the results in their papers was available and (in cases where we had failed to build their code) if they thought the code ought to build. The survey also let the authors comment on our study.
We define three measures of weak repeatabilityweak repeatability A, B, and Cwith notation we outline in Table 1:
Weak repeatability A models scenarios where limited time is available to examine a research artifact, and when communicating with the author is not an option (such as when reviewing an artifact submitted alongside a conference paper). Weak repeatability B models situations where ample time is available to resolve issues, but the lead developer is not available for consultation. The latter turns out to be quite common. We saw situations where the student responsible for development had graduated, the main developer had passed away, the authors' email addresses no longer worked, and the authors were too busy to provide assistance. Weak repeatability C measures the extent to which we were able to build the code or the authors believed their code builds with reasonable effort. This model approximates a situation where ample time is available to examine the code and the authors are responsive to requests for assistance.
The results of our study are listed in Table 2 and outlined in the figure here, showing repeatability rates of A=32.3%, B=48.3%, and C=54.0%. Here, C is limited by the response rate to our author survey, 59.5%.
Does public funding affect sharing? The National Science Foundation Grant Proposal Guide7 says, "Investigators and grantees are encouraged to share software and inventions created under the grant or otherwise make them or their products widely available and usable." However, we did not find significant differences in the weak repeatability rates of NSF-funded vs. non-NSF-funded research.g
Does industry involvement affect sharing? Not surprisingly, papers with authors only from industry have a low rate of repeatability, and papers with authors only from academic institutions have a higher-than-average rate. The reasons joint papers also have a lower-than-average rate of code sharing is not immediately obvious; for instance, the industrial partner might have imposed intellectual-property restrictions on the collaboration, or the research could be the result of a student's summer internship.
We noticed published code does not always correspond to the version used to produce the results in the corresponding paper.
Does the right version exist? From the responses we received from authors, we noticed published code does not always correspond to the version used to produce the results in the corresponding paper. To see how common this is, in our author survey we asked, "Is your published code identical to the version you ran to get the results in the paper (ignoring inconsequential bug fixes)?" It was encouraging to see that out of the 177 responses to this question, 83.1% answered "yes," 12.4% answered "No, but it is possible to make that version available," and only 4.5% answered, "No, and it is not possible to make that version available."
Why is code not shared? The email responses we received were generally pleasant, accommodating, and apologetic if code could not be provided. In the following paragraphs, we explore several representative examples of email responses from authors who turned down our request.
In order for research to be truly repeatable, the correct version of all artifacts must be available, which is not always the case; for example, one respondent said, "I'm not very sure whether it is the final version of the code used in our paper, but it should be at least 99% close."
Authors often told us once their code was cleaned up we could have access to their system, in one case saying, "Unfortunately the current system is not mature enough at the moment, so it's not yet publicly available. We are actively working on a number of extensions and things are somewhat volatile." Eventually making (reworked) code available may be helpful for benefaction, but for repeatability, such delayed releases are ineffectual; it will never be possible for a reviewer or reader to verify the results presented in the paper.
Several authors acknowledged they never had the intention to make the code available, in one case saying, "I am afraid that the source code was never released. The code was never intended to be released so is not in any shape for general use."
In some cases, the one person who understood the system had left, with one respondent saying, "For the paper we used a prototype that included many moving pieces that only [student] knew how to operate and we did not have the time to integrate them in a ready-to-share implementation before he left."
Lack of proper backup procedures was also a problem, with one respondent saying, "Unfortunately, the server in which my implementation was stored had a disk crash in April and three disks crashed simultaneously ... my entire implementation for this paper was not found ... Sorry for that."
Researchers employed by commercial entities were often not able to release their code, with one respondent saying, "The code owned by [company], and AFAIK the code is not open-source." This author added this helpful suggestion: "Your best bet is to reimplement: (Sorry."
Even academic researchers had licensing issues, with one respondent saying, "Unfortunately, the [system] sources are not meant to be opensource [sic] (the code is partially property of [three universities])." Some universities put restrictions on the release of the code, with one respondent saying, " ... we are making a collaboration release available to academic partners. If you're interested in obtaining the code, we only ask for a description of the research project that the code will be used in (which may lead to some joint research), and we also have a software license agreement that the University would need to sign."
Some systems were built on top of other systems that were not publicly available, with one respondent saying, "We implemented and tested our ... technique on top of a commercialized static analysis tool. So, the current implementation is not open to public. Sorry for this." And, some systems were built on top of obsolete systems, with one respondent saying, "Currently, we have no plans to make the scheduler's source code publicly available. This is mainly because [ancient OS] as such does not exist anymore ... few people would manage to get it to work on new hardware."
Some authors were worried about how their code might be used, with one respondent saying, "We would like to be notified in case the provided implementation will be utilized to perform (and possibly publish) comparisons with other developed techniques ... based on earlier (bad) experience, we would like to make sure that our implementation is not used in situations that it was not meant for."
Producing artifacts solid enough to be shared is clearly labor intensive, with one researcher explaining how he had to make a draconian choice, saying, "[Our system] continues to become more complex as more Ph.D. students add more pieces to it ... In the past when we attempted to share it, we found ourselves spending more time getting outsiders up to speed than on our own research. So I finally had to establish the policy that we will not provide the source code outside the group."
Unlike researchers in other fields, computer security researchers must contend with the possible negative consequences of making their code public, with one respondent saying, "... we have an agreement with the [business-entity] company, and we cannot release the code because of the potential privacy risks to the general public."
Some authors used unusual languages and tools that make it difficult for others to benefit from their code, with one respondent saying, "The code ... is complete, but hardly usable by anyone other than the authors ... due to our decision to use [obscure language variant] for the input language."
To improve the state of repeatability in computer science research we could simply require, along with every paper submitted for publication, the authors attach the corresponding code, perhaps in the form of a virtual machine image. Unfortunately, based on our study, it is unrealistic to expect computer science researchers to always make their code available to others. There are several reasons for this: the code may not be clean enough for public distribution; they may be considering commercialization; (part of) the code may have licensing restrictions; they may be too busy to answer questions about their system; or they may worry about not receiving proper attribution for any follow-up work.
We thus make a much more modest proposal that would require only minor changes to how public funding agencies and academic publishers operate:
Fund repeatability engineering. Funding agencies should encourage researchers to request additional funds for "repeatability engineering," including hiring programming staff to document and maintain code, do release management, and assist other research groups wanting to repeat published experiments. In the same way funding agencies conduct financial audits to ensure costs claimed by grantees are allowed, they should also conduct random audits to ensure research artifacts are shared in accordance with what was promised in the grant application; and
Require sharing contract. Publishers of conference proceedings and journals should require every article include a sharing contract specifying the level of repeatability to which its authors will commit.
While the first point will have the effect of shifting some funding from pure research to engineering and oversight, both are important because they ensure research results continue to benefit the academic communityand the public funding itpast the project end date. Here, we expand on the second point.
Sharing contracts. The sharing contract should be provided by the authors when a paper is submitted for publication (allowing reviewers to consider the expected level of repeatability of the work), as well as in the published version (allowing readers to locate research artifacts). The contract commits the author to making available certain resources that were used in the research leading up to the paper and committing the reader/reviewer to take these resources into account when evaluating the contributions made by the paper.
Table 3 lists the data that should be part of a sharing contract, including the external resources that back up the results in the paper, the locations where these resources can be found or ways to contact the authors, and the level of technical support the authors will provide.
Resources can include code, data, and media. For each resource, the contract must state if it is accessible and at what cost, a deadline after which it might no longer be available, and whether it is available in source or binary form or accessible as a service. Code accessed as a service could be running as a Web service or be executed by the authors themselves on input data provided by the reader. We include an optional comment field to handle unusual situations.
Sharing is different from licensing. A sharing contract represents a commitment on behalf of the author to make resources available to the wider community for scrutiny. A license, on the other hand, describes the actions allowed on these resources (such as modification, redistribution, and reverse engineering). Since copyright bars reuse without permission of the author(s), both licensing and sharing specifications are necessary; for example, if a license prohibits reverse engineering, the community's ability to verify the actions performed by the software are consistent with what is described in the publication is diminished. Likewise, benefaction is hampered by code that makes use of libraries whose license prohibits redistribution.
The contract must also specify the level of technical support the authors commit to provide, for how long they will provide it, and whether that support is free; Table 3 includes a non-exhaustive list of possible types of support.
In some situations authors will want to make their artifacts available under more than one sharing contract, where each contract is targeted at a different audience (such as academic and commercial).
Example of a sharing contract. Publishers must design a concrete syntax for sharing contracts that handles most common situations, balancing expressiveness and conciseness. For illustrative purposes, here is an example contract for the research we have presented, giving free access to source code and data in perpetuity and rudimentary support for free until at least the end of 2016:
code: access, free, source;
data: access, free, source, "sanitized";
support: installation, bug fixes, free, 2016-12-31;
In this research, we must sanitize email exchanges before sharing them. We express this in the comment field.
While there is certainly much room for novel tools for scientific provenance, licensing frameworks that reassure researchers they will be properly attributed, and repositories that can store research artifacts in perpetuity, we must acknowledge the root of the scientific-repeatability problem is sociological, not technological; when we do not produce solid prêt-à-partager artifacts or attempt to replicate the work of our peers it is because there is little professional glory to be gained from doing so. Nosek8 wrote, "Because of strong incentives for innovation and weak incentives for confirmation, direct replication is rarely practiced or published," and "Innovative findings produce rewards of publication, employment, and tenure; replicated findings produce a shrug."
The real solution to the problem of researchers not sharing their research artifacts lies in finding new reward structures that encourage them to produce solid artifacts, share these artifacts, and validate the conclusions drawn from the artifacts published by their peers.3 Unfortunately, in this regard we remain pessimistic, not seeing a near future where such fundamental changes are enacted.
In the near term, we thus propose two easily implementable strategies for improving this state of affairs: a shift in public funding to repeatability engineering and adopting sharing specifications. With required sharing contracts, authorsknowing reviewers are likely to take a dim view of a paper that says, up front, its results are not repeatablewill thus be incentivized to produce solid computational artifacts. Adjusting funding-agency regulations to encourage engineering for repeatability will provide them with the resources to do so.
We would like to thank Saumya Debray, Shriram Krishnamurthi, Alex Warren, and the anonymous reviewers for valuable input.
1. Collberg, C., Proebsting, T., and Warren, A.M. Repeatability and Benefaction in Computer Systems Research: A Study and a Modest Proposal. Technical Report TR 14-04. Department of Computer Science, University of Arizona, Tucson, AZ, Dec. 2014; http://repeatability.cs.arizona.edu/v2/RepeatabilityTR.pdf
3. Friedman, B. and Schneider, F.B. Incentivizing Quality and Impact: Evaluating Scholarship in Hiring, Tenure, and Promotion. Computing Research Association Best Practices Memo, Feb. 2015; http://archive2.cra.org/uploads/documents/resources/bpmemos/BP_Memo.pdf
5. Kovacevi, J. How to encourage and publish reproducible research. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Volume IV (Honolulu, HI, Apr. 1520). IEEE Computer Society, 2007, 12731276.
7. National Science Foundation. Grant Policy Manual 05-131. Arlington, VA, July 2005; http://www.nsf.gov/pubs/manuals/gpm05_131
9. Perianayagam, S., Andrews, G.R., and Hartman, J.H. Rex: A toolset for reproducing software experiments. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (Hong Kong, Dec. 1821). IEEE Computer Society, 2010, 613617.
10. Rozier, K.Y. and Rozier, E.W.D. Reproducibility, correctness, and buildability: The three principles for ethical public dissemination of computer science and engineering research. In Proceedings of the IEEE International Symposium on Ethics in Science, Technology, and Engineering (Chicago, IL, May 2324). IEEE Computer Society, 2014, 113.
15. Stodden, V. The Scientific Method in Practice: Reproducibility in the Computational Sciences. Technical Report Working Paper 4773-10. MIT Sloan School of Management, Cambridge, MA, Feb. 2010; http://web.stanford.edu/~vcs/papers/SMPRCS2010.pdf
17. Stodden, V., Hurlin, C., and Perignon, C. RunMyCode.org: A novel dissemination and collaboration platform for executing published computational results. In Proceedings of the Eighth IEEE International Conference on E-Science (Chicago, IL, Sept. 15). IEEE Computer Society, 2012, 18.
19. Vitek, J. and Kalibera, T. Repeatability, reproducibility, and rigor in systems research. In Proceedings of the 11th ACM International Conference on Embedded Software (Taipei, Taiwan, Oct. 914). ACM Press, New York, 2011, 3338.
20. Yale Law School Roundtable on Data and Code Sharing. Reproducible research: Addressing the need for data and code sharing in computational science. Computing in Science and Engineering 12, 5 (Sept./Oct. 2010), 813.
e. See Collberg et al.1 for a description of the process through which the study was carried out.
f. A group of independent researchers set out to verify our build results through a crowdsourced effort; http://cs.brown.edu/~sk/Memos/Examining-Reproducibility
Figure. Summary of the study's results. Blue numbers represent papers we excluded from the study, green numbers papers we determined to be weakly repeatable, red numbers papers we determined to be non-repeatable, and orange numbers represent papers for which we could not conclusively determine repeatability (due to our restriction of sending at most one email request per author).
©2016 ACM 0001-0782/16/03
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from firstname.lastname@example.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2016 ACM, Inc.
The following letter was published in the Letters to the Editor of the May 2016 CACM (http://cacm.acm.org/magazines/2016/5/201586).
Christian Collberg and Todd A. Proebsting deserve our gratitude for their article "Repeatability in Computer Systems Research" (Mar. 2016) shining sunlight the best kind of disinfectant, according to Supreme Court Justice Louis D. Brandeis on the very real problem of lack of repeatability in computer science research. Without repeatability, there is no real science, something computer science cannot tolerate.
Displaying 1 comment