The web has emerged as the largest distributed information repository on the planet. Human knowledge is captured on the Web in various digital forms: Web pages, news articles, blog posts, digitized books, scanned paintings, videos, podcasts, lyrics, speech transcripts, and so forth. Over the years, services have emerged to aggregate, index, and enable the rapid searching of all this digital data but the full meaning of that data may only be interpretable by humans. In the common case, machines are incapable of understanding or reasoning about the vast amounts of data available on the Web. They are not able to interpret or infer new information from the data and this has been a topic of active research interest for decades within the artificial intelligence community. While the dream of artificial intelligencemachines capable of human-level reasoning and understandingmay still not be within our grasp, we believe semantic technologies hold the promise that machines will be able to meaningfully process, combine, and infer information from the world's data in the not-too-distant future (see Figure 1).
The Web ecosystem of simple formats and protocols is an example of how we can effectively manage, share, access, and represent large amounts of data. Companies like Microsoft and Google are building large-scale services (such as search and cloud services) leveraging the existing hardware and software infrastructures. Schema languages, XML, Entity Data Models, Microformats, RSS, Atom, RDF (see http://www.w3.org/RDF/), OWL (see http://www.w3.org/2007/OWL/), and other technologies are being used to capture the information in data while machine learning, entity extraction, neural networks, clustering, and latent semantics are approaches to extracting information from that data and help reason about it. The field is an active area of research and experimentation and is still rapidly evolving (see the sidebar "Semantic Computing" vs. "Semantic Web").
At the center of our discussion is the concept of a "data mesh," a term we use to refer to the various information and knowledge representation techniques/technologies that have been developed over the years (see Figure 2). In its simplest form, a data mesh looks like a directed graph in which the nodes represent data/information captured in well-known formats and the edges capture a relationship, characterized by a predicate and perhaps other information, between the linked data. For example, "Jane listens to Santana every day" is a relationship, in which "Jane" (the subject) and "Santana" (the object) are the nodes, "listens to" is the edge (the predicate), and "every day" is an attribute of the edge. Other tuples could add further information to the data mesh (for example, "Santana is an artist," "Santana plays the guitar," "Santana makes music," "Jane met Santana in 1995" and so forth). Semantic Web's RDF is one, but not the only, technology that can be used to represent such graphs or knowledge bases. Indeed, Cyc,4 Semantic Networks, WordNet,7 MultiNet6 are examples of other such technologies/approaches. Scaling to the same level as the Web still remains a challenge for these approaches.
We believe there is an opportunity to involve users, who are now equally producers as they are consumers of information on the Web, and not just the very few experts in producing structured data at Web scale. Recent success stories in the application of knowledge representation to specific domains, as in the myGrid (see http://www.mygrid.org.uk/) work in BioInformatics research,10 demonstrate the potential benefits of semantic computing technologies. Here, we use the term "data mesh" to encompass the various concepts and approaches that could be used or combined to support a semantics-rich ecosystem of research tools and services. It is not our intention to suggest there would be one single data mesh that would represent all human knowledge.
We expect a great number of vocabularies to emerge, many of which will overlap, for representing every aspect of a data mesh (such as geo-location, mood, reviews, personal information, domain-specific concepts and terms). Ontologies will support an evolving ecosystem of facts, vocabularies, and relationships in specific domains. We are already witnessing a plethora of emerging efforts to standardize on such vocabularies, such as microformats (http://www.microformats.org/), data portability (http://www.dataport-ability.org/), gene ontology (http://www.geneontology.org/), and others.
Programs will consume, combine, and correlate everything in the universe of structured information and help users reason over it. They will allow them to ask questions against this (global) collection of factsinformation access policies permittingsuch as "Which is the most popular book among my friends today?," "Who is the expert on aspect A of my business workflow inside my organization?," "Have Evelyne and Savas been at the same conference, at the same time in any point in time?," "What's the degree of separation in terms of citations between my paper and the seminal work by Jim Gray?" and so on.
While data mesh instances can be built in isolation (as in many of today's social networks), we believe the potential value of aggregating all of them and combining them in one huge network of facts is tremendous. This idea is similar to Tim Berners-Lee's more recent rhetoric around the 'Giant Global Graph of Facts' (see http://dig.csail.mit.edu/breadcrumbs/node/215). Please note that we are not suggesting there would be a single repository of facts or that there would even be universal agreement on what is represented. We do expect, however, to see machine-based technologies that would be able to reason, many times using probabilistic-based techniques, over the diverse set of facts.
We are already witnessing the emergence of data mesh instances on the Web, especially as they relate to social networks.
We are already witnessing the emergence of data mesh instances on the Web, especially as they relate to social networks. The Zune Social (http://social.zune.net/) is an example of how a social network can be combined with information about music preferences, recommendations, and an online marketplace. Facebook (http://www.facebook.com/) is another example of how connections between identities can help in aggregating user-oriented preferences and then inferring behavior and preference statistics. Finally, Powerset (http://www.powerset.com/) is an example of a search service that leverages existing structured information, for example, Freebase (http://www.freebase.com/) or generates it from unstructured sources (such as by applying natural language processing technologies on Wikipedia content) to improve the quality of the query results.
We believe that over time, a huge ecosystem of services and tools will emerge around data mesh instances. Such tools and services will allow us to move beyond current practice of information management by incorporating more automation. Recommendation engines will be the norm and our interactions with computers will always be context-aware (for example, "since the topic of the paper being written is about botany, a query about 'bush' is unlikely to be about a person's name" or "the search about papers on orchid will take into consideration the opinion of people in the user's professional social network"). While today we can search for information over the global graph of linked Web pages consisting of predominately unstructured data, in the future we will be able to search over all types of semantically enriched information, which will in turn enable a wide range of new applications to emerge such as recommendation services, information management automation, information inferencing, and so forth.
We believe the research community will play a central role in supporting and further evolving the semantic computing vision. We should not only become early adopters of semantic computing technologies and infrastructure in our research projects but we should also actively develop and evolve them. In Microsoft Research we are taking some first steps toward this vision, as we are investing in projects that can demonstrate the benefits of semantic computing technologies in research. We are therefore attempting to build an ecosystem of research tools and services as demonstrations of these ideas and concepts.
We focus here on the role of the researcher as an "extreme information worker" meaning a technology user with expectations and requirements at a scale not yet required by the business community. We believe information representation, management, and processing tools in combination with automation technologies will greatly help them in their research. We are therefore taking small steps toward developing semantics-aware tools and services. Here, we describe some of the work we are doing in supporting the scholarly communications life cycle through semantic computing technologies.
Semantic Annotations and Metadata in Word. The authoring stage is perhaps the best time to capture an author's intentions and to record the meaning of the words as they are being written. Natural language may not always be adequate to convey the meaning of a word or an expression, especially in the scientific world. In many disciplines domain-specific ontologies are therefore being created by experts to address this issue but they have not so far been incorporated with productivity tools like Microsoft Office.
Natural language may not always be adequate to convey the meaning of a word or an expression, especially in the scientific world.
In collaboration with Phil Bourne and Lynn Fink at the University of California, San Diego, we worked toward a plug-in for Word 2007 (part of the BioLit project; http://biolit.ucsd.edu/) that allows authors to annotate words or sentences with terms from an ontology (for example, Gene Ontology; http://www.geneontology.org/). The annotations are stored as part of the Office Open XML (OOXML) representation of the document (OOXML has been accepted as an ISO standard. More information can be found at http://openxmldeveloper.org/). Tools and services can now extract the annotations by just opening the OOXML package without human intervention and there is not even a need for Word to be installed. As a result, the documents will be able to be better categorized, indexed, and searched with the author's intent always closely associated with the text.
The ability to easily annotate terms from within Word is a first step in producing documents that semantically relate to the body of knowledge in a domain. In this way, information can easily become part of a data mesh as it is being generated (see Figure 3). The source code for the plugin is now available as open source (see http://ucsdbiolit.codeplex.com/) for the community to further extend or just use as the basis for a new generation of semantics-oriented authoring tools.
Chemistry in Word. We are investigating, in collaboration with Peter Murray-Rust, Jim Downing, and Joe Townsend from the University of Cambridge, the introduction of chemistry drawing functionality into Word documents (see http://research.microsoft.com/en-us/projects/chem4word/). Rather than just having images of chemical structures, we would like to preserve the chemistry-related semantics in a machine-processable manner. For that reason, we are using the Chemistry Markup Language (CML) in our investigations; instances of CML would be embedded inside OOXML documents. We believe an ecosystem of chemistry-related tools and services can then emerge to enable the automatic processing of documents, making the authoring process an easy but increasingly valuable part of the research life cycle.
As an example, consider the water molecule (H2O). In a Word document, it appears as a series of characters, one of which is a subscript. Through this project, it will be possible to also store the structured representation of water so that programs can discover it. Figure 4 shows how some part of a document can be identified as chemistry. The tool will automatically save the CML representation of the identified region (1D and 2D representations and authoring functionality will also be supported; see Figure 5). The use of a semantically rich data format to represent domain-specific information is another step in producing structured data that automatically becomes part of the data mesh.
ZentityA Repository Platform. The need for quality, well-engineered, and documented software infrastructures to support institutional repositories, archives, and digital libraries is increasing, especially in the context of the global initiative toward Open Access.2,5 We have developed a platform, called Zentity, to support repository systems based on product-quality technologies like SQL Server, .NET 3.5, and the Entity Framework (see http://research.microsoft.com/en-us/projects/zentity/).
The Zentity platform supports a graph-based representation of the data in a repository. It provides an easy-to-use application programming interface that abstracts the use of the underlying relational system to manage digital resources and the relationships between them and creates a data mesh. Initially, the platform will be targeted toward the "research output repository" domain, offering a data model capable of capturing the research-related resources (for example, papers, reports, theses, presentations, and data sets) of an organization. However, Zentity has been designed to support the data models of arbitrary domains (for example, museums, art collections, research data, and so forth).
Interoperability is a major focus of the project and we are implementing support for popular Semantic Web technologies, like RDF and RDFS. We are also building a number of tools and services to operate against the data mesh created and we hope that more will be developed by the community as we make the platform freely available.
In addition to supporting the repository community through product-quality technologies, our work on Zentity attempts to demonstrate some of the principles of data meshes. The data model employed for the implementation promotes the graph representation principles discussed earlier and illustrated in Figure 2.
Social Networking and Data Meshes. As a final example, we examine the relationship between social networks and other structured information. We consider the former a special case of a data mesh (in which the nodes are people and the edges between them represent human relationships, such as "friend" or "colleague"). A social network can provide context for the interactions between people (for example, "the botany domain-specific social network"); it can be used to infer information about a community (for example, "the botany community has been actively looking at orchids over the last week"); it can be used to provide recommendations (for example, "the most read article about orchids can be found at location X"); it can be used to change or supplement the way peer reviewing is done and merit is given (for example, "the article posted online on orchids last week has received great reviews from the experts of the botany community").
One can therefore imagine an ecosystem of tools and services that takes advantage of the relationships between researchers in the context of a particular research discipline, collaboration or research project and their activities, documents, and opinions. Typical research-related processes could be augmented or even completely supplanted. For example, researchers could automatically get recommendations of papers and contacts based on what they are currently doing; experts might be automatically identified in a domain based on discussions around their papers and blog entries; peer reviewing could evolve to take into consideration the new social media and Web-based interactions; and even 'impact factors' for institutions might incorporate electronic analysis of all types of information and not just citations to publications and research grants.
Our support of the myExperiment (http://www.myexperiment.org/) project is a demonstration of our belief that scientific collaboration and information sharing can be supported through social networking. The myExperiment project brings together social networks and workflows in a single information grapha data meshthat can be browsed, analyzed, and searched.
As researchers and scientific instruments can now produce and publish large amounts of data and information more easily than at any other point in history, there is an increasing requirement for automation tools to help manage and navigate the deluge of research data. For example, projects like Pan-STARRS (http://pan-starrs.ifa.hawaii.edu/) and the HLC (http://lhc.web.cern.ch/) will generate many petabytes of data. The emergence of folk-sonomies on the Web is one example of how user-driven categorization can help with information discovery. The need to deal with meaningful and relevant information within the context of one's actions is growing. There is an immense opportunity for the research community to bring its expertise and experience together in accelerating the development of semantic computing technologies. We need to invest significant resources to making the semantic computing vision a reality by:
There is an increasing requirement for automation tools to help manage and navigate the deluge of research data.
The discussion on data meshes shows the potential value of aggregating information in a (semi-)structured, machine-interpretable manner. We believe an ecosystem of desktop tools, cloud services, and data formats will emerge to support "information and knowledge management," namely, the (automatic) acquisition, representation, aggregation, indexing, discovery, consumption, correlation, management, and inference of information. Doing so at scale would significantly improve the way we discover and share information and how we collaborate.
We need to invest significant resources to making the semantic computing vision a reality.
We have described a representative set of investments we are making to ease the transition of researchers toward a world where information is produced and consumed in a structured and semantics-rich manner (more information about the work and research tools offered by Microsoft Research for scientists can be found at http://research.microsoft.com/en-us/collaboration/about/). However, this will not happen instantly. There is a lot of unstructured data out there already. Data-mining technologies are necessary to automatically extract as much semantically rich information as possible. For example, Microsoft's Live Labs has worked on machine learning-based technologies to extract entities from the unstructured Web (see http://livelabs.com/projects/entity-extraction/). The research world needs similar technologies to be deployed at scale that can aggregate, index, and mine research-related information.
We believe such an ecosystem of semantics-aware tools and services will ultimately become the norm in our day-to-day interactions with computers, constituting a global "smart cyberinfrastructure." However, if the big companies are to invest in implementing these ideas and technologies in their offerings (products and services), the research community must test and demonstrate their potential as part of the community's attempt to build a smart cyberinfrastructure for research. Ultimately, this vision of a data mesh and smart cyberinfrastructure will go some way toward realizing the visions of the early pioneers like Vannevar Bush3 and J.C.R. Licklider.8
3. Bush, V. As we may think. The Atlantic Monthly (1945); www.theatlantic.com/doc/194507/bush
4. Cycorp. Cyc Knowledge Base; www.cyc.com/
Figure 1. Data, information, knowledge: While we are good at data management at scale (for example, Google, Amazon) we are still far away from supporting information representation and reasoning. Knowledge management at scale is a great opportunity for innovation.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2009 ACM, Inc.