There exists no snappy term—no perfect storm, no big bang, not even a Kuhnian paradigm shift—but computational approaches to analyzing the entire ecosystem around scientific discovery are converging upon what may be a new era of deep knowledge mining. And for one veteran researcher, the shift is happening none too soon.
“If we’re interested in science, the paper is nice, but the data is nicer,” says Michael Conlon, chief operating officer of the University of Florida’s Clinical and Translational Science Institute. “The publication processes have to change. The publication business was developed for paper, with certain economies and a lot of characteristics related to publishing a five-page paper, putting it in an envelope, and mailing it to people.”
Conlon is the principal investigator of VIVO, a cross-institutional, open source platform funded by a $12 million grant from the U.S. National Institutes of Health. VIVO is meant to connect the research community, via formal Semantic Web-compliant ontology and description technologies, to provide biographical information and links to research for scientists at participating institutions. It is just one aspect of a very active cohort of research-spanning projects from language-based analytics for phrases that have proved influential over time, to network-based analysis of formally classified paper topics that might help predict where disparate fields may merge in the future. And the utility of these resources is expected to go far past the scientific research community itself and into the public funding agencies and philanthropic foundations betting on the next generation of breakthroughs.
Quantifying Data
Both Conlon and James Evans, associate professor of sociology at the University of Chicago, believe enhancing computational resources around the community of scientists—that is, finding a way to quantify who is working on what projects, where they are doing it, and how their work may be related to similar projects else-where—will yield greater efficiency in the scientific process and, ultimately, better science itself.
Not only is there nuance behind the empiricism of science that cannot be captured in a published paper, Conlon says, there is also nuance beyond the paper, perhaps known only to a few members of a similar small community.
“I know the people who did this, I know their limitations, I know their physical setting—I know many things that aren’t in the paper,” he says. Conlon believes finding a way to quantify the data supporting published works, as well as the works themselves, and then somehow linking that data to similar work being done elsewhere, will be critical to advancing science.
“How can we expect to know things and pass on knowledge if you’re claiming so much of that knowledge is in your head?” asks Conlon.
Conlon says platforms such as VIVO can help scientists more easily and quickly gain access to that type of “in your head” data not only to read papers more accurately, but also to test reported research results.
“If somebody reports a result in a paper, can I get the same results? For science to work, you have to believe you can,” he says. “If you can’t get the same result, what kind of science is that?” Without Web-enabled cross-referencing, Conlon says trying to find that kind of data is “a bit of Russian doll problem. You take apart the paper and get down to some bare bones, and it’s not enough, and then say I have to call them, and you just keep going to get to the level of specificity needed to reproduce the science. And in the computational disciplines, this is particularly frustrating.”
As of January 2011, making supporting data available to other scientists is not just a good idea. In the U.S., the National Science Foundation (NSF) has adopted a mandatory data management plan that states “investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of the work under NSF grants.” The VIVO platform, Conlon says, is an ideal vehicle for this task, and now offers an easy-to-use bookmarklet tool called VIVO Searchlight. When a reader finds a pertinent passage in a VIVO-compliant paper, he or she can highlight the passage and click on the Searchlight icon in the bookmark bar. The application then shows, on the same page, links to other researchers working on similar topics.
Finding a way to quantify the data supporting published works, as well as the works themselves, and then somehow linking that data to similar work being done elsewhere, will be critical to advancing science, says Michael Conlon.
Tools such as VIVO Searchlight are likely to accelerate the process by which the information surrounding published works, what Evans and co-author Jacob Foster called metaknowledge in a Science paper published in February 2011, is processed by machines and interpreted by several communities, including researchers, funding agencies, and philanthropic foundations. Ideally, the analysis may lead to a more empirical assessment of ideas and allocation of resources.
Part of what metaknowledge is about, Evans contends, is modeling the social context around science to find “diamonds in the rough”; currently, he says, those in the center of the system—that is, researchers at institutions with a reputation for scientific success—have ideas that get amplified much more quickly than those in smaller regional universities and private research foundations.
“And I think there is an interest among reviewers, and certainly among governments, to find ways to counter-balance the biases that are inherent to taking seriously the status order of academic science,” says Evans. “I think there’s a market for funding underdogs, but there’s not a method for doing it.”
Connecting Ideas
There are classification schemes beyond VIVO that can help scientists and funding agencies predict possible sweet spots of experimentation. For example, as part of a post-doctoral project, researchers Mark Herrera, David Roberts, and Natali Gulbahce built an “idea network” model using the American Physical Society’s Physics and Astronomy Classification Scheme (PACS) to see if they could predict where discrete fields may someday merge. By using the PACS scheme and a community-finding algorithm, they were able to build a model showing where papers in various fields of physics had built connections.
“The idea was predicting what fields are coming closer in this idea space, which fields are going to merge in the future,” Gulbahce says. She believes a model that allows this sort of insight might also give both researchers and funding agencies clues as to where high-payoff, “low-hanging fruit” concepts of these newly connected fields might be, as well as what may be complex problems that will take years to solve.
However, their work also exposed a current shortcoming—the near-total lack of inter-field metadata and even intra-field metadata disparities.
Gulbahce says her team intended to use the Inspec bibliographic database, which includes computer science and engineering topics, as well as physics, instead of PACS, but found it not hierarchical enough and “too messy” (keywords are not always thoroughly assigned, she says). The 1966 Aerospace Systems Conference record, for example, includes 132 conference papers, yet the entire record has just two keywords: C0000, “General and management topics,” and C3360L, “aero-space control.”
The PACS and Inspec schemes also depend on different approaches to indexing keywords; Inspec uses experts in the field and PACS uses authors to define them. This disparity, Gulbahce says, is indicative of the difficulty in creating comprehensive research databases for either reference or predictive uses.
One possible route to classifying those disparate databases and texts could be the Dublin Core Metadata Initiative’s (DCMI’s) 15-element resource description set, according to Jane Greenberg, professor of information and library science at the University of North Carolina, who has been active in DCMI’s Science and Metadata Community since its inception in 2008.
“That could be one role for Dublin Core, because it’s not a highly technical schema,” she says. “On the other hand, almost every formal metadata schema that exists for scientific data maps to the Dublin Core at some level.”
Topic Modeling
Researchers are increasingly interested in going beyond formal definitional data to deduce influential past texts and possible future directions of scientific fields. Natural language processing technology, for instance, is yielding insight into unstructured data in existing bodies of work. David Blei, associate professor of computer science at Princeton University, recently co-authored two papers using topic modeling in assessing the influence of ideas within the texts of scientific corpora. Blei found a definite correlation between terms within papers that influenced fields and the number of citations those papers garnered in subsequent research. While his existing work is retrospective—after all, he says, one cannot observe the language of future documents—Blei says he is working on extending his models into predictive ones.
“The next step to this model would be to build in some higher level patterns—’Some words take on this kind of envelope and some take on that kind of envelope’—and once you do that, you can recognize words that seem to be on the rise and those that seem to be on the fall,” Blei says. “That way you can think about second-order dynamics, and then maybe you can make a prediction. We’re working on that kind of thing now. It’s a simple change to the model, but a big change conceptually. We’re not going to claim we can predict the future of science.”
Perhaps the most dynamic area in which Blei’s approach may find use is analyzing unstructured data, such as Twitter feeds about illness trends. Finding persistent linguistic patterns in such data may help scientists narrow their focus of more formal experimentation and response.
“Drawing conclusions from observational data is a dicey business,” says Blei. But he also believes persistent patterns revealed through this type of data may be helpful in narrowing down areas where more rigorous methods of analysis can be better put into place.
“You can use the text as a noisy signal for the truth,” he says. “You want to find nuggets of repeated patterns suggestive of something that could lead to the next controlled experiment.”
Further Reading
Bettencourt, L., Kaiser, D., Kaur, J., Castillo-Chávez, C., and Wojick, D.
Population modeling of the emergence and development of scientific fields, Scientometrics 3, 75, May 2008.
Evans, J.A. and Foster, J.G
Metaknowledge, Science 331, 6018, Feb. 11, 2011.
Gerrish, S. and Blei, D.
A language-based approach to measuring scholarly impact, Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, June 2124, 2010.
Herrera, M., Roberts, D.C., and Gulbahce, N.
Mapping the evolution of scientific fields, Public Library of Science 5, 5, May 2010.
Jensen, S. and Plale, B.
Trading consistency for scalability in scientific metadata, Proceedings of the 2010 IEEE International Conference on e-Science, Brisbane, Australia, Dec. 710, 2010.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment