Sign In

Communications of the ACM

BLOG@CACM

Research Data Sustainability and Access


Microsoft Research Director Daniel Reed

How recently have you mounted a 9-track open reel tape, hoping to access the irreplaceable data that was the foundation of your first research paper? At this point, you may not even remember if it was 800, 1600 or 6250 bits per inch (bpi), EBCDIC or ASCII, blocked or unblocked. You are not that old, you say?  What about your 5.25” or 3.5” floppy disks or DAT archive? Odds are you haven’t accessed the data because you can’t without seeking the services of a conversion company that specializes in data retrieval from obsolete media.

Have you ever been involved in a research project, either individually or as part of a multi-institutional team, that produced data intended for broader community use?  If so, then you probably placed it on the project web site, perhaps with the research software needed to decode and process the data. A decade later, is the data still accessible and does the software even compile or execute on current systems?

Personally, I still have some 9-track computer tapes, a punched card deck and a paper tape in an office desk drawer, saved for both reasons of pedagogy and nostalgia. I also have some data analytics tools originally designed for workstations now found in the Computer History Museum.

These examples may seem quaint, and perhaps they are, but each of us has some variant of this data obsolescence and inaccessibility experience. They are the analog (pun intended) of our previous consumer media experiences.  After all, have you played any of your 45s, 8-track tapes, or cassettes lately?

These are the symptoms of three bigger issues: the rapid obsolesce of specific storage technologies, the explosive growth of research data across all scientific and engineering disciplines, and the even more difficult task of sustaining data access past the end of research projects.

The first is a natural consequence of technological change, one that we have collectively managed across the history of modern digital computing.  In turn, “big data” is a hot topic of research and business innovation, with new tools and techniques appearing to extract insights from large volumes of unstructured or ill-structured data. However, the social and economic challenges around research data preservation are profound and not yet resolved.

In the U.S., the Office of Science and Technology Policy (OSTP) on behalf of the National Science and Technology Council (NSTC) recently issued a request for information (RFI) on Public Access to Digital Data Resulting from Federally Funded Scientific Research.  In addition, the National Science Foundation has instituted a requirement that research proposals include a data management plan, which notes that “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants.”

Several issues are convolved in our desire and expectations for data sharing.  One is advancing discovery and innovation via the free flow of information, replicating and expanding experiments and sharing data from increasingly expensive national and international scientific instrumentation. This is central to the scientific process, though it brings the cultural disparity that exists across disciplines to the fore – astronomy, biology and computing are culturally quite different.

The second is the need for multidisciplinary sharing and cross-domain fertilization.  Increasingly, new insights emerge from fusing and analyzing data drawn from diverse sources. Such integration places a premium on metadata schemas, well documented data formats and service access protocols.  In turn, these require standards and coordination, both within and across disciplines.

The third is the distinction between research, which produces data, and data preservation, documentation and dissemination. In my experience, the skills and expertise, as well as reward metrics, are distinctly different for the two activities. This is an oblique way of saying that researchers will generally optimize for research advancement over data preservation when forced to choose between the two.  This is only natural, given our current reward structure.

Research and data preservation also differ markedly in their timescales, for data preservation and dissemination services often require decadal planning, with associated infrastructure and professional staffing, rather than the 3-5 year funding for principal investigators, graduate students and post-doctoral research associates that is typical of research grants and contracts.

Finally, data preservation and dissemination can be expensive, rivaling or exceeding that of the initial research investment. Quite clearly, not everything can and should be preserved in perpetuity, but predicting the future value of data is both difficult and perilous.

This suggests that we need economic and social processes that more rigorously access the present and future value of data. They might include combinations of commercial, fee-based models, where researchers and organizations vote with their funds for access to and retention of certain data (i.e., a cloud-based research data marketplace), government funded and managed repositories where key data is retained (e.g., NIH’s GenBank), or distributed but interconnected archives funded by multiple agencies and governments (e.g., the Worldwide LHC Computing Grid).

What is clear is that the dramatic growth of research data, the collaborative and competitive nature of international science and engineering research, expectations for economic returns from research investments and disciplinary differences all make this a pressing and difficult problem.  Our current, ad hoc approaches are inadequate and not sustainable.


Comments


Anonymous

Interestingly, this should be more of a problem for life and physical sciences, rather than computer science. In the latter, the presence of public benchmarks usually allow us to experiment on common datasets; in the case of individual synthetic data, all we need to preserve are scripts for data generation.

The key factor is necessarily going to be incentivizing researchers to publicly make their data available in an appropriately accessible format. When deciding to fund research proposals, why not take into account the investigator's past record of public offerings, compared with the number of research papers published? On the plus side, this should encourage people to put more effort in releasing their data, in exchange for future funding. On the negative, it would take a long time to validate its effectiveness.


Anonymous

"before" webcitation.org ? Yes, all of those anecdotal examples were a real concern. Now? Meaning, now that webcitation.org exists? Now I am not so scared.

...Now it is the job of the (maintainers of the) content on the server, to worry about whether migration to a new underlying medium (or format) is "advisable", and to do the necessary copying, "if any" -- (if appropriate); so IT'S NOT MY WORRY! (any longer). (Thank goodness!)

But IMHO that also means that, the need for an item (blog post) like this, is less. Sure, it is still needed for old material, from more than a couple of years ago. But, not for new material; and even for OLD content, it only has to be put on the web ONE TIME, and then it can be "archived" at webcitation.org , and ("bingo") it can now be thought of as being OK -- "similar" to, recent content! Case closed.


Anonymous

Thanks for the comments to date.

Regarding computer science versus the life and physical sciences, I think it depends on the subdiscipline. People working in data intensive areas (e.g, search, databases, information visualization) have different needs and concerns than those in other, less data intensive ares.

Regarding media migration, it is somewhat less of an issue than it once was, but it still matters. The National Archives, for example, struggles with this issue daily, as do other organizations that archive historical or longitudinal data for long periods. We have another form of transition as well, namely the migration across logical data formats and tools that still plagues us.

Dan Reed


Displaying all 3 comments