Computing Applications Contributed Articles

(Re)Use of Research Results (Is Rampant)

Prior pessimism about reuse in software engineering research may have been a result of using the wrong methods to measure the wrong things.

By Maria Teresa Baldassarre, Neil Ernst, Ben Hermann, Tim Menzies, and Rahul Yedida

Posted Feb 1 2023

Introduction
Key Insights
Capturing Reuse
Studying Reuse
Related Work
Next Steps for Reuse Graphs
References
Authors
Footnotes
Sidebar: The ROSE Initiative

According to Popper,²³ the ideas we can most trust are those that have been the most tried and tested. For that reason, many of us are involved in this process called “science,” which produces trusted knowledge by sharing one’s ideas and trying out and testing the ideas of others. Science and scientists form communities where people do each other the courtesy of curating, clarifying, critiquing, and improving a large pool of ideas.

Key Insights

Researchers must take back control over how our products are shared.
Results should not be locked away behind paywalls that block access to results and enshrine outdated views on science. Using open source tools, research communities can map the real patterns of inference between their work.
For example, in SE, researchers often share many artifacts, ranging from ideas to paper-based methods, datasets, and tools. While exact replication of results is rare, we can report just from one year that there are hundreds of cases where one researcher used some but not all of the artifacts from other work.
We ask others to join us in this effort to accurately record the reality of 21st century science.

According to this definition, one measure of a scientific community’s health is how much it reuses results. By that measure, the software engineering research community might seem to be very unhealthy. Da Silva et al. reported that from 1994 to 2010, only 72 studies had been replicated by 96 new studies.¹⁰ In February 2022, as a double-check for da Silva’s conclusion, we queried the ACM Portal for products from the International Conference on Software Engineering (ICSE), that field’s premier conference. Between 2011 and 2021, only 111 out of the 8,774 ICSE research entries were labeled as ‘available,’ 74 as ‘reusable,’ 24 as ‘functional,’ and none as ‘replicated’ or ‘reproduced’ reuse (see Table 1 for a definition of those terms). Put another way, according to the ACM Portal, only 2.4% of the ICSE publications are explicitly associated with any kind of reuse. Worse still, according to that report, there were no replicated or reproduced results from ICSE in the last decade.

Table 1. Badges such as the ones shown in this table are currently awarded at conferences.² This table is based on ACM’s badge program, however, analogous badges are used at other conferences. Images used by permission of the Association for Computing Machinery.

We argue that the reuse “problem” is more apparent than real—at least in software engineering. We describe a successful approach to recording research reuse where teams of researchers from around the world read 170 recent (2020) conference papers from software engineering. This work generated the “reuse graph” in Figure 1, in which each edge connects papers to the prior work they are (re)using. As we will discuss, when compared to other community monitoring methods (for example, artifact tracks or bibliometric searches^5,12,19), these reuse graphs require less effort to build and verify. For example, it took around 12 minutes per paper for our team from Hong Kong, Canada, the U.S., Italy, Sweden, Finland, and Australia to apply this reuse graph methodology to software engineering.^a

Figure 1. The 1,635 arrows in this diagram connect reusers to the reused.

The rest of this article discusses generating, applying, and the value of our reuse graphs. Before beginning, we offer the following introductory remark. This article is written as a protest, of sorts, against how we currently assess science and scientific output. This article’s authors have worked as researchers for decades, supervising graduate students and organizing prominent conferences and journals. Based on that experience, we assert that researchers do more than write papers. Rather, we are all engaged in long-term stewardship of ideas; as part of that stewardship, we generate more than just papers. Yet, of all our products, it is only our papers that are used, mostly in some annual bibliometric analysis of our worth. We view this as an inadequate way to measure what researchers do.

The problem, we think, is in the very term “bibliometric.” This term is heavily skewed toward publications and monographs and the kinds of things we can easily store in the repositories of our professional societies—for example, IEEE Xplore and ACM Portal. In fact, the term “bibliométrie” was first used by Paul Otlet in 1934²⁵ and was defined as “the measurement of all aspects related to the publication and reading of books and documents.”

This article is written as a protest, of sorts, against how we currently assess science and scientific output.

Subsequent definitions tried to broaden that definition. For example, the anglicized version “bibliometrics” was first used by Alan Pritchard in his 1969 paper, “Statistical Bibliography or Bibliometrics?”, where he defined the term as “the application of mathematics and statistical methods to books and other media of communication.”²⁴ But what we are observing in 2022 is that “other media of communication” in software engineering (and other fields) is far broader than just the products stored in the repositories of our professional societies. For example, researchers might use the results of papers, follow guidance from one paper in their own work, or download data or code used on another paper (and then use locally). We argue that all such downloads or guidance-following are examples of reuse, since all are examples of members in our research community reusing products from other research in their own work (for a more exact categorization of the types of reuse we are studying, please see our section Studying Reuse).

It is all too easy to propose a broader definition for how scholars reuse and communicate their products. Such a new definition is practically useless unless we can propose some method to collect data on that new definition. We suggest that our new definitions can be operationalized via crowdsourced methods.

Capturing Reuse

There are many methods to map the structure of SE research, such as (a) manual or automatic citation searchers or (b) “artifact evaluation committees” that foster the generation and sharing of research products. Such studies can lag significantly behind current work. For example, in our own prior citation analysis of SE,¹⁹ we only studied up to 2016. The study itself was conducted in 2017, but not fully published till 2018. Given the enormous effort required for that work, we have vowed never to do it again.

Reuse graphs, on the other hand, are faster to update since the work of any individual working on these graphs is minimal. Other reasons for favoring reuse graphs are that they are community comprehensible, verifiable, and correctable. All the data used for our reuse graphs is community-collected and can be audited at https://reuse-dept.org. If errors are detected, issue reports can be raised in our GitHub repository and then corrected. The same may not hold true for studies based on citation servers run by professional bodies and for-profit organizations (see Table 2). New data can be contributed by anyone either directly supplying data in our format or through a user interface directly on our website for easier access. The resulting issue report is then reviewed and, when necessary, corrected. After a third person successfully inspects the data, it is added to the reuse graph.

Table 2. Examples of errors in citation servers.

What is the value of a verified, continually updated snapshot of a current research area? Once our reuse graph covers several years (and not just 2020 conference publications), we foresee several applications:

Academics can check that their contributions to science are being properly recorded.
When applying for a promotion or new position, research faculty or industrial workers could document the impact of their work beyond papers, including tools, datasets, and innovative methods.
Graduate students could direct their attention to research areas that are both very new (nodes from recent years) and very productive (nodes with an unusually large number of edges attached).
Organizers of conferences could select their keynote speakers from that space of new and productive artifacts.
Growth patterns might guide federal government funding priorities or departmental hiring plans.
Venture capitalists could use these graphs to detect emergent technologies, perhaps even funding some of those.
Conference organizers could check if their program committees have enough members from currently hot topics.
Further, those same organizers could create new conference tracks and journal sections to service active research communities that are under-represented in current publication venues.
Journal editors could find reviewers with relevant experience.
Educators can use the graphs to guide their teaching plans.

Studying Reuse

In our reuse study, we targeted papers from the 2020 technical programs of six major international SE conferences: ICSE, Automated Software Engineering (ASE), Joint European Software Engineering Conference/Foundations of Software Engineering (ESEC/FSE), Software Maintenance and Engineering (ICSME), Mining Software Repositories (MSR), and Empirical Software Engineering and Measurement (ESEM). These conferences were selected using advice from Matthew et al.,¹⁹ but our vision is to expand—for example, by looking at all top-ranked SE conferences. GitHub issues were used to divide up the hundreds of papers from those conferences into “work packets” of 10 papers each. Reading teams were set up from software engineering research teams from around the globe in Hong Kong; Istanbul, Turkey; Victoria, Canada; Gothenburg, Sweden); Oulu, Finland; Melbourne, Australia; and Raleigh, NC, USA. Team members assigned themselves work packets and read the papers in search of examples of reuse listed in the next paragraph. Once completed, a second person (from any of our teams) performed the same task and checked for consistency. Fleiss Kappa statistics were then computed to track the level of reader disagreement. GitHub issues^b were used to manage this in the open, but raters were asked not to examine previous results. A member of this article’s author team then performed a final check on disagreements before including the data into the graph.

Teams were asked to record six kinds of reuse:

Most papers must benchmark new ideas against some prior recent state-of-the-art paper. That is, they reuse old papers as steppingstones toward new results.
Statistical methods are often reused. Here we do not mean “we use a two-tailed t-test” or some other decades-old, widely used statistical method. Rather, we refer to statistical methods in recent papers that propose statistical guidance for the kinds of analysis seen in SE. Perhaps because this kind of analysis is very rare, this work is highly cited. For example:

A 2008 paper, “Benchmarking Classification Models for Software Defect Prediction,”¹⁸ has 1,178 citations
A 2011 paper, “A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms,”¹ has 778 citations.

Metrics and methodology descriptions that are specific to the research area, including software metrics such as CK metrics or flow metrics as well as research methods such as grounded theory or sampling criteria.
Datasets
Sanity checks, which justify why a particular approach works or is reasonable to avoid bad data—for example, why to avoid using GitHub stars to select repositories.¹⁶
Software packages of the kind currently being reviewed by SE conference AECs (tools and replications).

Figure 2 shows an example. Starting with a paper by Bernal-Cárdenas et al.,⁵ we find among others a reused dataset from Moran et al.,²² tool reuse of FFmpeg and Tensorow object detection, and several reused methods, including the ConvNet approach described by Simonyan. Readers can follow the URL in the Figure 2 caption for more detailed information.

Figure 2. A detailed view of a section of the reuse graph from Figure 1.

We can report that it is not difficult to read papers to detect these kinds of reuse:

The six types of reuse noted above can be found quickly. Our graduate students report that reading their first paper might take up to an hour. But after two or three papers, the median reading time drops to approximately 12 minutes (see Figure 3a).

Figure 3. Reading time results, agreement scores, and yearly prevalence of reused papers.

When we compare the reuse reported by different readers, we get Figure 3b. In our current results, the median Fleiss Kappa score (for reviewer agreement) is 1—that is, very good.
The one caveat we would add is that graduate students involved in this activity need at least two years of active research experience in their area of study. We base this on the fact that when we tried data collection from a large intro-to-SE graduate subject, the resulting Kappa agreement scores were poor.

The result of this data collection is a directed multi-graph of publications and other forms of dissemination of research artifacts. The edges of this graph are annotated with the type of reuse according to the list above. Reuse metrics for a specific publication (or other form, for example, a GitHub repository) are the in-degree and out-degree measures of the node that represents this publication. When accumulated for the originating authors, individual reuse metrics can be collected. Zooming into the graph on our website, reuse types are visibly annotated at the graph edges (see Figure 2). A filter allows a graph to be extracted for a single reuse type out of the multi-graph.

Of course, there any many more items being reused than just the six we have listed.^c It is an open question, worthy of future work, to check if those other items can be collected in this way and, indeed, to refine these categories as understanding changes.

Related Work

Apart from software engineering,²¹ many other disciplines are actively engaged in artifact creation, sharing, and reuse.^3,4 Artifacts are useful for building a culture of replication and reproducibility,^9,17 already acknowledged as important in SE.^8,10,15,27 Fields such as psychology have had many early results thrown into doubt due to a failure to replicate the original findings.²⁸ Sharing research protocols and data through replication packages and artifacts allows for other research teams to conduct severe tests of the original studies,²⁰ strengthening or rejecting these initial findings.

In medicine, drug companies are mandated to share the research protocols and outcomes of their drug trials, something that has become vitally important recently, albeit not without challenges.¹¹ In physics and astronomy, artifact sharing is so commonplace that large community infrastructures exist solely to ensure data sharing, not least because the governments which fund these costly experiments insist on it.

In more theoretical areas of CS, the pioneering use of preprint servers has enabled ‘reuse’ of proofs, which has been essential to progress. In machine learning, replication is focused on steppingstones, enabled by highly successful benchmarks such as ImageNet.²⁶ However, recent advances with extremely costly training regimens have called replicability into question.^d

In the specific case of SE research, prior to this paper, there was little recorded and verified evidence of reuse. Many researchers have conducted citation studies that find links to highly cited papers—for example, Matthew et al.¹⁹ As stated previously, such studies can lag the latest results. Also, recalling Table 2, we have cause to doubt the conclusions from such citation studies.

From a practical perspective, many conferences have recently introduced AECs to entice reuse and replication. Moreover, authors of accepted conference papers submit software packages that, in theory, let others re-execute that work.^8,9 These committees award badges, as shown in Table 1.

Artifact evaluation is something of a growth industry in the SE as well as the programming languages (PL) communities, as shown in Figure 4, which presents the increasing number of people evaluating artifacts between 2011 and 2019. One may conclude that such practices make the community more aware of what is available and reusable, and therefore, can become a potential node of a reuse graph. As such, the source is explicitly made available to any other researcher willing to (re)use it.^8,14

Figure 4. Artifact evaluation committee sizes, 2011–2019.¹⁴

Now, the question to be asked is: Are all the people of Figure 4 making the best use of their time? Perhaps not. Most artifacts are assigned the badges requested by the authors, so it might be safe to ask some of the personnel from Figure 4 to, for example, spend less time evaluating conference artifacts and more time working on Figure 1.

But most importantly, it is not clear whether the artifact evaluation process is creating reused artifacts, and therefore, indirectly contributing to the reuse graph concept. Indeed, if we query ACM Portal for “software engineering” and “artifacts” between 2015 and 2020, we find that most of the recorded artifacts are not reused in replications or reproductions.^e Specifically, only 1/20 are reproduced and only 1/50 are replicated.

Perhaps it might be useful to reflect more on what is being reused (as we have done earlier in this article). This is what has motivated our research and led us to create the reuse graph.

Next Steps for Reuse Graphs

When discussing this work with colleagues, we are often asked if we have assessed it. We reply that, at this stage, this is like asking the inventors of kd-trees⁶ in 1975 how much that method has sped up commercial databases. Right now, we are engaged in community building and have shown that we can create the infrastructure needed to collect our data with very little effort and not much coding. While Figure 1 is a promising start, scaling up requires that we organize a larger reading population. Our goal is to analyze 200 papers in 2022, 2,000 in 2023, and 5,000 in 2024, by which time we would have covered most of the major SE venues in the last five years. After that, our maintenance goal would be to read around 500 papers per year to keep up to date with the conferences (then, we would move on to journals). Based on Figure 3a, and assuming each paper is read by two people, the maintenance goal would be achievable by a team of 20 people working two hours per month on this task. To organize this work, we have created the ROSE Initiative (see the sidebar: The Rose Initiative for more information).

If that work interests you, then there are many ways you can get involved:

Visit https://reuse-dept.org if you are a researcher and wish to check that we have accurately recorded your contribution.
If you want to apply reuse graphs to your community, please use our tools at https://github.com/bhermann/DoR/.
If you would like to join this initiative and contribute to an up-to-the-minute snapshot of SE research, then please take our how-to-read-for-reuse tutorial,^f and then visit the dashboard at the GitHub site (bhermann/DoR). Find an issue with no one’s face on it, and assign yourself a task.

If we take an agile view of SE science, then as researchers we should focus on generating artifacts and rapidly securing critique, curation, and clarification.

We see this effort as one part of the broader open science effort, in addition to helping the community identify the state of the art—for example, patterns of growth in the reuse graph. Among the goals of open science are the desire to increase confidence in published results and an acknowledgment that science produces more types of artifacts than just publications: Researchers also produce method innovations, new datasets, and better tools. If we take an agile view of SE science, then as researchers we should focus on generating these artifacts and rapidly securing critique, curation, and clarification from our peers and the public.

Figure. Watch the authors discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/reuse-of-research

Sidebar: The ROSE Initiative

The Rose Initiative (Recognizing and Rewarding Open Science in Software Engineering) is an international, multi-conference workshop that will continually report updates to the software engineering reuse graphs.

Researchers who reuse the most from other papers will be applauded and awarded an “R-index” (reuse index).
Researchers who build the artifacts that are most reused will be applauded (even louder) and awarded an “R+-index,” indicating they are producing the artifacts that are most used by the rest of the community.
Between each conference, the ROSE Initiative will coordinate an international team of volunteers to incrementally update the SE reuse graph. This article’s authors have volunteered to serve as the group’s initial nucleus, and we will hold open elections to grow that steering committee using others from our community.
This reuse graph will be displayed at a publicly available website (reuse-dept.org), where individual researchers can browse, check their entries, and propose corrections and extensions.
All reuse reports will be double-checked, and disputed claims will then be tripled-checked.
All the tools used to create the site will be freely available for download. Hence, if the SE community does not like how we are running these reuse graphs, they can use the code and data for a different application.
Also, researchers from other disciplines can apply our tools to their own community.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

(Re)Use of Research Results (Is Rampant)

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/3554976

February 2023 Issue

Published: February 1, 2023

Vol. 66 No. 2

Pages: 75-81

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Nov 8 2024

The Importance of Robust Documentation in Software Development

Alex Williams

Computing Profession

BLOG@CACM Nov 4 2024

The Gift That Keeps on Giving to Apple and Google

Saurabh Bagchi

Computing Applications

people holding dollar signs stand in line before a giant mobile phone, illustration

BLOG@CACM Nov 1 2024

Computational Thinking: The Idea That Lived

Shuchi Grover

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Key Insights

Capturing Reuse

Studying Reuse

Related Work

Next Steps for Reuse Graphs

Sidebar: The ROSE Initiative

(Re)Use of Research Results (Is Rampant)

DOI

February 2023 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.