When Data Is Not Enough

When Data Is Not Enough, illustrative photo

Massive datasets and digital processing are transforming and accelerating science, but there is growing concern that many scientific results may not be trustworthy. Scientific procedures developed over centuries to assure reliable knowledge are sometimes overwhelmed by new ways of generating and processing scientific information. As a result, the scientific community is implementing requirements that help independent researchers reproduce published results, a cornerstone of the scientific method.

For data, the revolution is well under way. Inspired by projects like the Human Genome Project, the National Institutes of Health has provided infrastructure (and funding) for massive repositories of genetic and other data. In this field and others, researchers are expected to make their data available to other researchers.

Yet there is a growing recognition that provisions must also be made for the data-analysis software that supports the conclusions. Policy discussions of reproducibility "almost always talk primarily about data," cautioned Victoria Stodden of the Graduate School of Library and Information Science at the University of Illinois at Urbana-Champaign. "This is a huge gap for computational science."

"There has been an enormous flip in the last 20 years from data collection to data analysis," said Roger Peng at the Johns Hopkins Bloomberg School of Public Health in Baltimore. In many fields, the critical challenges now reside in analysis of widely available big datasets, he said. This shift "has happened more quickly than I think most fields have been prepared to deal with."

A Reproducibility Crisis

Reproducibility is always central to science, but especially when data may be fudged. In a recent case, Duke University researchers described a gene "signature" to predict cancer-drug response, which generated enormous interest and even clinical trials at Duke. Excited clinicians at the University of Texas MD Anderson Cancer Center enlisted statisticians Keith Baggerly and Kevin Coombes to confirm the conclusions, but the pair found numerous problems in the data analysis. Eventually the trials were stopped, but only after misrepresentations were found on researcher Anil Potti’s résumé.

Because of the limited published information, this reanalysis required "about 30% of our time for several months," Baggerly said. "This is not a scalable solution." Based on this experience, he said, "It became obvious that improving the reproducibility of reports was an important problem in and of itself," including the posting of code as well as data.

Systematic efforts are under way to validate important studies in biomedicine and psychology.

"This was certainly an atypical case," Baggerly stressed, but encouraging reproducibility is critical even without the possibility of misconduct. For example, a widely noted 2012 comment in Nature said retesting by the pharmaceutical company Amgen confirmed only six of 53 "landmark" biomedical studies cases. A study published in June on "The Economics of Reproducibility in Preclinical Research" (http://bit.ly/1K7NeWL) estimated the costs of irreproducibility in life science at $28 billion per year in the U.S., attributing a quarter of that sum to "data analysis and reporting." Recognizing this challenge, systematic efforts are under way to validate important studies in biomedicine and also in psychology, which has been plagued by irreproducible results.

Documenting Science in Progress

Traditionally, the laboratory notebook was a standard part of experimental research, documenting procedures, measurements, and calculations that led to the conclusions.

As analysis has moved to computers, researchers may explore their data using simple but powerful tools like Excel spreadsheets, which leave no record of the calculations. "It’s a little bit crazy," Stodden said, since those programs are "not designed with scientific tracking and sharing needs in mind."

Reproducibility of code should force scientists to migrate to platforms that document their manipulations explicitly. Although there is not yet any standard to rival the lab notebook, "it’s gotten markedly easier," said Baggerly. "You’ve seen the advent of a whole bunch of tools which make tracking of computation and reconstruction easier."

For code, GitHub is widely used to track rapidly evolving programs. This support is important for code, Stodden says, because unlike data, "well-used popular software is all forked and branched; it’s changing all the time."

Other tools are also available, but so far none is appropriate for everyone. To help computational scientists navigate this complex landscape, Peng and two colleagues offer an online course on reproducible research, and he, Stodden, and a third author also collaborated on a book, Implementing Reproducible Research. "Everyone involved in science will need to know some data analysis," said Peng, or at least "how to manage people who are doing these complex data analyses" as part of increasingly multidisciplinary teams.

Putting Up Resistance

There are many reasons researchers may not want to post their code, noted Yolanda Gil of the University of Southern California’s Information Sciences Institute, who also chairs the ACM Special Interest Group on Artificial Intelligence. She likens the task to commenting code, a thankless task that everyone endorses but which many programmers neglect. On top of the work involved in making their software usable by strangers, researchers "often believe it is not very high-quality," she said, or that "no one will use it."

If the code and data are used by others, that can create even more work, as climate scientists learned when outside skeptics pored over their work. "It’s easy to take data that someone has provided and do something to it and say that something fraudulent has occurred," said Peng. In principle, such scrutiny will improve the science, but the extra work can make researchers reluctant to share.

Publications

In practice, researchers rarely recreate experiments simply to confirm them, but they often retrace the key steps in order to extend the results. Scientific-article style evolved specifically to ensure readers have enough information to repeat the procedures.

An official publication is an ideal time to assess the reliability of the results and to deposit data and code in a public repository, and journals increasingly allow online posting of supporting material. They also frequently require authors to affirm that data has been placed in public repositories, although those requirements are not necessarily enforced. "The legacy publishing system hasn’t quite caught up to the fact that we need to have code and data and digital scholarly objects associated with our publications," said Stodden.

One exception to this neglect is the journal Mathematical Programming Computation, founded in 2009. Following the mathematics tradition of checking proofs before publication, this journal not only requires software but, whenever possible, technical editors confirm the software works as advertised. "It can sometimes be quite difficult," said founding editor William Cook, a professor of Combinatorics and Optimization at the University of Waterloo in Canada, who says this process has uncovered serious unintentional errors. "I am now skeptical about any computational results (in my own area of mathematical optimization) that have not gone through a review like we do."

Still, "the extra review is a great burden on authors," he admits. Some of them complain that they get no extra credit for publishing in this journal, "so they choose to go elsewhere."

As a demonstration for less-mathematical science, Gil has been shepherding The Geosciences Paper of the Future Initiative, sponsored by the National Science Foundation. For this project, a dozen or so teams of geoscientists have volunteered to submit articles that illustrate transparency and reproducibility, to be published in the next few months. Among other things, "We’re trying to incorporate these best practices as you do the work, not at the end," Gil stressed.

Providing Incentives

"Everyone agrees" about the need to address code reproducibility, Stodden said. "We’d have much better science if we resolved these issues." However, expecting early-career researchers to unilaterally devote effort to this end is not reasonable, she stressed. "They’re just not going to, until the incentives change."

As a step toward providing those incentives, last winter the non-profit Center for Open Science hosted a committee including scientific experts as well as representatives of journals and funding agencies. The group published a general framework for journals to codify their transparency standards for each of eight standards–notably including "analytic methods (code) transparency." For each subject, the framework specifies four tiers of rigor, ranging from vague support or silence to detailed validation of posted resources.

The standards had to be flexible, said executive director Brian Nosek, a psychologist at the University of Virginia in Charlottesville. "Different fields have different documentation needs," and some journals have limited resources. "There can’t be a one-size-fits-all solution." The graduated structure should make it easy for journals to get started and to increase their commitment over time.

"Different fields have different documentation needs … there can’t be a one-size-fits-all solution."

Still, "just having a standard isn’t a solution in itself," Nosek acknowledged. Ultimately, requirements for deposition of data, code, and other research materials also will have to be enforced broadly throughout the scientific ecosystem, including funding agencies and committees making tenure decisions. "A focus on journals was our start," Nosek said. "It was low-hanging fruit."

Into the Future

Code used to derive scientific results is finally being recognized as a critical ingredient for reproducibility, but supporting tools are still primitive. "The only solution that we have now is to dump all the code onto you," said Peng, which he likens to trying to teach a musician a song with a bit-level sound file. Researchers are still trying to devise the code equivalent of sheet music that makes the significance of sounds humanly useful.

Simply making code available also fails to answer the key question: whether it actually does the calculations it is supposed to do. Such verification is burdensome and may be almost impossible for large programs like those used for climate, unless researchers can implement intrinsically robust techniques like those being developed for mathematical proofs. (See "A New Type of Mathematics?" Communications, February 2014.)

In addition, even a complete snapshot of the code used in the publication will miss "all the trails and avenues that scientists went down that didn’t pan out," Stodden warned. Although such exploration is an important part of science, it naturally selects pleasant surprises. The statistical significance of a clinical trial, for example, can only be assessed if there was a clear protocol on record beforehand. Tracking the dead ends would require a much more sophisticated software environment, one that does not yet exist. In principle, such a platform could document the entire history and workflow of a project. Unlike the minutiae of physical and chemical experiments, data and code can be completely and cheaply captured, so software tools could lead the way to improving reproducibility across the entire scientific enterprise.

For now, however, advocates are just hoping to improve the transparency of code. "It’s a serious collective-action problem." Stodden said. Only the highest-impact journals can impose new requirements without the risk of alienating authors, for example. "The progress is very slow," Stodden said, but "we’re getting there."

A Reproducibility Crisis

Documenting Science in Progress

Putting Up Resistance

Publications

Providing Incentives

Into the Future

Further Reading

When Data Is Not Enough

DOI

December 2015 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

A Reproducibility Crisis

Documenting Science in Progress

Putting Up Resistance

Publications

Providing Incentives

Into the Future

Further Reading

When Data Is Not Enough

DOI

December 2015 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.