Let us begin this review by defining the subject matter. The term Semantic Web as used in this article is a field of research rather than a concrete artifact—in a similar way, say, Artificial Intelligence denotes a field of research rather than a concrete artifact. A concrete artifact, which may deserve to be called “The Semantic Web” may or may not come into existence someday, and indeed some members of the research field may argue that part of it has already been built. Sometimes the term Semantic Web technologies is used to describe the set of methods and tools arising out of the field in an attempt to avoid terminological confusion. We will come back to all this in the article in some way; however, the focus here is to review the research field.
This review will be rather subjective, as the field is very diverse not only in methods and goals being researched and applied, but also because the field is home to a large number of different but interconnected subcommunities, each of which would probably produce a rather different narrative of the history and the current state of the art of the field. I therefore do not strive to achieve the impossible task of presenting something close to a consensus—such a thing still seems elusive. However, I do point out here, and sometimes within the narrative, that there are a good number of alternative perspectives.
The review is also very selective, because Semantic Web is a rich field of diverse research and applications, borrowing from many disciplines within or adjacent to computer science. In a brief review like this one cannot possibly be exhaustive or give due credit to all important individual contributions. I do hope I have captured what many would consider key areas of the Semantic Web field. For the reader interested in obtaining a more detailed overview, I recommend perusing the major publication outlets in the field: The Semantic Web journal,a the Journal of Web Semantics,b and the proceedings of the annual International Semantic Web Conference.c This is by no means an exhaustive list, but I believe it to be uncontroversial that these are the most central publication venues for the field.
Now that we understand that Semantic Web is a field of research, what is it about? Answers to this question are again necessarily subjective as there is no clear consensus on this in the field.d
One perspective is that the field is all about the long-term goal of creating The Semantic Web (as an artifact) together with all the necessary tools and methods required for creation, maintenance, and application. In this particular narrative, The Semantic Web is usually envisioned as an enhancement of the current World Wide Web with machine-understandable information (as opposed to most of the current Web, which is mostly targeted at human consumption), together with services—intelligent agents—utilizing this information. This perspective can be traced back to a 2001 Scientific American article,1 which arguably marks the birth of the field. Provision of machine understandable information in this case is done by endowing data with expressive metadata for the data. In the Semantic Web, this metadata is generally in the form of ontologies, or at least a formal language with a logic-based semantics that admits reasoning over the meaning of the data. (Formal metadata is discussed later.) This, together with the understanding that intelligent agents would utilize the information, perceives the Semantic Web field as having a significant overlap with the field of Artificial Intelligence. Indeed, for most of the major artificial intelligence conferences held in the last 20 years ran explicit “Semantic Web” tracks.
An alternative and perhaps more recent perspective on the question of what the field is about rests on the observation that the methods and tools developed by the field have applications not tied to the World Wide Web, and which also can provide added value even without having to establish intelligent agents utilizing machine-understandable data. Indeed, early industry interest in the field, which was substantial from the very outset, was aimed at applying Semantic Web technologies to information integration and management. From this perspective, one could argue the field is about establishing efficient (that is, low cost) methods and tools for data sharing, discovery, integration, and reuse, and the World Wide Web may or may not be a data transmission vehicle in this context. This understanding of the field moves it closer to databases, or the data management part of data science.
A much more restrictive, but perhaps practically rather astute, delineation of the field may be made by characterizing it as investigating foundations and applications of ontologies, linked data, and knowledge graphs (all discussed later), with the W3C standardse RDF, OWL, and SPARQL at its core.
Perhaps, each of these three perspectives has merit, and the field exists in a confluence of these, with ontologies, linked data, knowledge graphs, being key concepts for the field, W3C standards around RDF, OWL, and SPARQL constituting technical exchange formats that unify the field on a syntactic (and to a certain extent, semantic) level; the application purpose of the field is in establishing efficient methods for data sharing, discovery, integration, and reuse (whether for the Web or not); and a long-term vision that serves as a driver is the establishing of The Semantic Web as an artifact complete with intelligent agent applications at some point in the (perhaps, distant) future.
In the rest of this article, I will lay out a timeline of the field’s history, covering key concepts, standards, and prominent outcomes. I will also discuss some selected application areas as well as the road and challenges that lie ahead.
A Subjective Timeline
Declaring any specific point in time as the birth of a field of research is of course debatable at best. Nevertheless, a 2001 Scientific American article by Berners-Lee et al.1 is an early landmark and has provided significant visibility for the nascent field. And, yes, it was around the early 2000s when the field was in a very substantial initial upswing in terms of community size, academic productivity, and initial industry interest.
But there were earlier efforts. The DARPA Agent Markup Language (DAML) programf ran from 2000 to 2006 with the declared goal of developing a Semantic Web language and corresponding tools. The European Union-funded On-To-Knowledge project,g running from 2000–2002, gave rise to the OIL language that was later merged with DAML, eventually giving rise to the Web Ontology Language (OWL) W3C standard. The more general idea of endowing data on the Web with machine-readable or “-understandable” metadata can be traced back to the beginnings of the World Wide Web itself. For example, a first draft of the Resource Description Framework (RDF) was published as early as 1997.h
Our story of the field will commence from the early 2000s, and we group the narrative into three overlapping phases, each driven by a key concept; that is, under this reconstruction, the field has shifted its main focus at least twice. From this perspective, the first phase was driven by ontologies and it spans the early to mid 2000s; the second phase was driven by linked data and stretches into the early 2010s. The third phase was and is still driven by knowledge graphs.
Ontologies. For most of the 2000s, work in the field had the notion of ontology at its center, which, of course, has much older roots. According to a many-cited source from 1993,5 an ontology is a formal, explicit specification of a shared conceptualization—though one may arguethat this definition still needs interpretation and is rather generic. In a more precise sense (and perhaps a bit post-hoc), an ontology is really a knowledge base (in the sense of symbolic artificial intelligence) of concepts (that is, types or classes, such as “mammal” and “live birth”) and their relationships (such as, “mammals give live birth”), specified in a knowledge representation language based on a formal logic. In a Semantic Web context, ontologies are a main vehicle for data integration, sharing, and discovery, and a driving idea is that ontologies themselves should be reusable by others.
In 2004, the Web Ontology Language OWL became a W3C standard (the revision OWL 211 was established in 2012), providing further fuel for the field. OWL in its core is based on a description logic, that is, on a sub-language of first-order predicate logici using only unary and binary predicates and a restricted use of quantifiers, designed in such a way that logical deductive reasoning over the language is decidable.12 Even after the standard was established, the community continued to have discussions whether description logics were the best paradigm choice, with rule-based languages being a major contender.28 The discussion eventually settled, but the Rule Interchange Format RIF,25 which was later established as a rule-based W3C standard gained relatively little traction.j
Also in 2004, the Resource Description Framework (RDF) became a W3C standard (the revision RDF 1.132 was completed in 2014). In essence, RDF is a syntax for expressing directed, labeled, and typed graphs.k RDF is more or lessl compatible with OWL, by using OWL to specify an ontology of types and their relationships, and by then using these types as types in the RDF graph, and the relationships as edges. From this perspective, an OWL ontology can serve as a schema (or a logic of types) for the RDF (typed) graph.m
A W3C standard for an RDF query language, called SPARQL, followed in 2008 (with an update in 2013,36 which then also became more fully compatible with OWL). Additional standards in the vicinity of RDF, OWL, and SPARQL have been, or are being, developed, some of which have gained significant traction, for example, ontologies such as the Semantic Sensor Networks ontology7 or the Provenance ontology,20 or the SKOS Simple Knowledge Organization System.24
With all these key standards developed under the W3C, basic compatibility between them and other key W3C standards has been maintained. For example, XML serves as a syntactic serialization and interchange format for RDF and OWL. All W3C Semantic Web standards also use IRIs as identifiers for labels in an RDF graph, for OWL class names, for datatype identifiers among others.
In a Semantic Web context, ontologies are a main vehicle for data integration, sharing, and discovery, and a driving idea is that ontologies themselves should be reusable by others.
The DARPA DAML program ended in 2006, and subsequently there were few if any large-scale funding lines for fundamental Semantic Web research in the U.S. As a consequence, much of the corresponding research in the U.S. moved either to application areas such as data management in healthcare or defense, or into adjacent fields altogether. In contrast, the European Union Framework Programmes, in particular FP 6 (2002–2006) and FP 7 (2007–2013), provided significant funding for both foundational and application-oriented Semantic Web research. One of the results of this divergence in funding priorities is still mirrored in the composition of the Semantic Web research community, which is predominantly European. The size of the community is difficult to assess, but since the mid-2000s, the field’s key conference—the International Semantic Web Conference—has drawn over 600 participants on average each year.n Given the inter-disciplinary nature and diverse applications of the field, it is to be noted that much Semantic Web research or applications are published in venues for adjacent research or application fields.
Industry interest has been significant from the outset, but it is next to impossible to reconstruct reliable data on the precise level of related industry activity. University spin-offs applied state-of-the-art research from the outset, and graduating Ph.D. students—in particular, the significant number produced in Europe—were finding corresponding industry jobs. Major and smaller companies have been involved in large-scale foundational or applied research projects, in particular under EU FP 6 and 7. Industry interest has changed focus with the research community, and we will come back to this throughout the narrative.
Some large-scale ontologies, often with roots predating the Semantic Web community, matured during this time. For example, the Gene Ontology35 had its beginnings in 1998 and is now a very prominent resource. Another example is SNOMED CT,o which can be traced back to the 1960s but is now fully formalized in OWL and widely used for electronic health records.33
As is so often the case in computer science research, initial over-hyped expectations on short-term massive breakthrough results gave way, around the mid-2000s, to a more sober perspective. Ontologies in the form that were mostly developed during this time—meaning often based on ad-hoc modeling as methodologies for their development were researched but had not yet led to tangible results—turned out to be difficult to maintain and re-use. This, combined with the considerable up-front cost at that time to develop good ontologies,p paved the way for a shift in attention by the research community, which can be understood as perhaps antithetical to the strongly ontology-based approach of the early 2000s.
Linked Data. The year 2006 saw the birth of “linked data” (or “linked open data” if the emphasis is on open, public, availability under free licenses). Linked data3 would soon become a major driver for Semantic Web research and applications and persist as such until the early 2010s.
What is usually associated with the term “linked data” is that linked data consists of a (by now rather large) set of RDF graphs that are linked in the sense that many IRI identifiers in the graphs also appear also in other, sometimes multiple, graphs. In a sense, the collection of all these linked RDF graphs can be understood as one very big RDF graph.
The number of publicly available linked RDF graphs has been showing significant growth in particular during the first decade as shown in Figure 1; the data is from the Linked Open Data Cloud website,q which does not account for all RDF datasets on the Web. A 2015 paper29 reports on “more than 37 billion triplesr from over 650,000 data documents,” which is also only a selection of all RDF graph triples that can be freely accessed on the World Wide Web. Large data providers, for example, often provide only a query interface based on SPARQL (a “SPARQL endpoint”) or use RDF for internal data organization but provide it to the outside only via human-readable Web pages. Datasets in the Linked Open Data Cloud cover a wide variety of topics, including geography, government, life sciences, linguistics, media, scientific publications, and social networking.
Figure 1. Number of RDF graphs in the Linked Open Data Cloud over time.
One of the most well-known and used linked datasets is DBpedia,22 which is a linked dataset extracted from Wikipedia (and, more recently, also Wikidata). The April 2016 releases covers about six million entities and about 9.5 billion RDF triples. Due to its extensive topic coverage (essentially, everything in Wikipedia) and the fact it was one of the very first linked datasets to be made available, DBpedia plays a central role in the Linked Open Data Cloud of interlinked datasets: Many other datasets link to it so that it has become a kind of hub for linked data.
There was significant industry interest in linked data from the outset. For example, BBCt was one of the first significant industry contributors to the Linked Data Cloud and the New York Times Company31 and Facebook40 were early adopters. However, industry interest seemed mostly be about utilizing linked data technology for data integration and management, often without it being visible on the open World Wide Web.
During the Linked Data era, ontologies played a much less prominent role. They often were used as schemas in that they informed the internal structure of RDF datasets, however, the information in RDF graphs in the Linked Data Cloud was shallow and relatively simplistic compared to the overpromises and depth of research from the Ontologies era. The credo sometimes voiced during this time was that ontologies cannot be reused, and that a much simpler approach based mainly on utilizing RDF and links between datasets held much more realistic promises for data integration, management, and applications on and off the Web. It was also during this time that RDF-based data organization vocabularies with little relation to ontologies, such as SKOS,24 were developed.
It was also during this time (2011) when schema.org appeared on the scene.6 Initially driven by Bing, Google, and Yahoo!—and slightly later joined by Yandex—schema.org made public a relatively simple ontologyu and suggested that website providers annotate (that is, link) entities on their sites with the schema.org vocabulary. In return, the Web search engine providers behind schema.org promised to improve search results by utilizing the annotations as metadata. Schema.org saw considerable initial uptake: In 2015, Guha et al.6 reported over 30% of pages have schema.org annotations.
Another prominent effort launched in 2012 is Wikidata,39 which started as a project at Wikimedia Deutschland funded among others by Google, Yandex, and the Allen Institute for AI. Wikidata is based on a similar idea as Wikipedia, namely, to crowdsource information, However, while Wikipedia is providing encyclopedia-style texts (with human readers as the main consumers), Wikidata is about creating structured data that can be used by programs or in other projects. For example, many other Wiki-media efforts, including Wikipedia, use Wikidata to provide some of the information they present to human readers. As of the time of this writing, Wikidata has over 66 million data items, has had over one billion edits since project launch, and has over 20,000 active users.v Database downloads are available in several W3C standards, including RDF.
During the early 2010s, the initial hype about linked data began to give way to a more sober perspective. While there were indeed some prominent uses and applications of linked data, it still turned out that integrating and utilizing it took more effort than some initially expected. Arguably, shallow non-expressive schemas often used for linked data appeared to be a major obstacle to reusability,16 and initial hopes that interlinks between datasets would somehow account for this weakness did not really seem to materialize. This observation should not be understood as demeaning the significant advances linked data has brought to the field and its applications: Just having data available in some structured format that follows a prominent standard means it can be accessed, integrated, and curated with available tools, and then made use of—and this is much easier than if data is provided in syntactically and conceptually much more heterogeneous form. But the quest for more efficient approaches to data sharing, discovery, integration, and reuse was of course as important as ever, and is commencing.
Knowledge Graphs. In 2012, a new term appeared on the scene when Google launched its Knowledge Graph. Pieces of the Google Knowledge Graph can be seen, for example, by searching for prominent entities on google.com: Next to the search results linking to Web pages a so-called infobox is displayed that shows information from the Google Knowledge Graph. An example of such an infobox is given in Figure 2—this was retrieved by searching the term Kofi Annan. One can navigate from this node to other nodes in the graph by following one of the active hyperlinks, for example, to Nane Maria Annan who is listed with a spouse relationship to the Kofi Annan node. After following this link, a new infobox for Nane Maria Annan is displayed next to the usual search results for the same term.
Figure 2. Google Knowledge Graph node as shown after searching on google.com for the term “Kofi Annan.”
While Google does not provide the Knowledge Graph for download, it does provide an API to access contentw—the API uses standard schema.org types and is compliant with JSON-LD,34 which is essentially an alternative syntax for RDF standardized by the W3C.
Knowledge graph technology has found a prominent place in industry, including leading information technology companies other than Google, such as Microsoft, IBM, Facebook, and eBay.27 However, given the history of Semantic Web technologies, and in particular of linked data and ontologies discussed earlier, it seems that knowledge graph is mostly a new framing of ideas coming directly out of the Semantic Web field,x with some notable shifts in emphasis.
One of the differences is about openness: As the term Linked Open Data has suggested from the very beginning, the linked data efforts by the Semantic Web community mostly had open sharing of data for reuse as one its goals, which means that linked data is mostly made freely available for download or by SPARQL endpoint, and the use of non-restricting licenses is considered of importance in the community. Wikidata as a knowledge graph is also unowned, and open. In contrast, the more recent activities around knowledge graphs are often industry-led, and the prime showcases are not really open in this sense.27
Another difference is one of central control versus bottom-up community contributions: The Linked Data Cloud is in a sense the currently largest existing knowledge graph known, but it is hardly a concise entity. Rather, it consists of loosely interlinked individual subgraphs, each of which is governed by its very own structure, representation schema, and so on. Knowledge graphs, in contrast, are usually understood to be much more internally consistent, and more tightly controlled, artifacts. As a consequence, the value of external links—that is, to external graphs without tight quality control—is put into doubt,y while quality of content and/or the underlying schema comes more into focus.
The biggest difference is probably the transition from academic research (which mostly drove the linked data effort) to use in industry. As such, recent activities around knowledge graphs are fueled by the strong industrial use cases and their demonstrated or perceived added value, even though there is, to the best of my knowledge, no published formal evaluation of their benefits.
Yet many of the challenges and issues concerning knowledge graphs remain the same as they were for linked data; for example, all items on the list of current challenges listed in Noy et al.27 are very well-known in the Semantic Web field, many with substantial bodies of research having been undertaken.
Selected Relationships to other Fields and Disciplines
As we discussed, the Semantic Web field is not primarily driven by certain methods inherent to the field, which distinguishes it from some other areas such as machine learning. Rather, it is driven by a shared vision,z and as such it borrows from other disciplines as needed.aa
For example, the Semantic Web field has strong relations to knowledge representation and reasoning as a sub-discipline of artificial intelligence, as knowledge graph and ontology representation languages can be understood—and are closely related to—knowledge representation languages, with description logics, as the logics underpinning the Web Ontology Language OWL, playing a central role. Semantic Web application requirements have also driven or inspired description logic research, as well as investigations into bridging between different knowledge representation approaches such as rules and description logics.19
The field of databases is clearly closely related, where topics such as (meta)data management and graph-structured data have a natural home but are also of importance for the Semantic Web field. However, the emphasis in Semantic Web research is strongly focused on conceptual integration of heterogeneous sources; for example, how to overcome different ways to organize data; in Big Data terminology, Semantic Web emphasis is primarily on the variety aspect of data.17
Natural language processing as an application tool plays an important role, for example, for knowledge graph and ontology integration, for natural language query answering, as well as for automated knowledge graph or ontology construction from texts.
Machine learning, and in particular deep learning, are being investigated as to their capability to improve hard tasks arriving in a Semantic Web context, such as knowledge graph completion (in the sense of adding missing relations), dealing with noisy data, and so on.4,10 At the same time, Semantic Web technologies are being investigated as to their potential to advance explainable AI.10,21
Some aspects of cyber-physical systems and the Internet of Things are being researched on using Semantic Web technologies, for example, in the context of smart manufacturing (Industry 4.0), smart energy grids, and building management.30
Some areas in the life sciences have already a considerable history of benefiting from Semantic Web technologies, for example, the previously noted SNOMED-CT and Gene Ontology. Generally speaking, biomedical fields were early adopters of Semantic Web concepts. Another prominent example would be the development of the ICD11, which was driven by Semantic Web technologies.38
Other current or potential application areas for Semantic Web technologies can be found wherever there is a need for data sharing, discovery, integration, and reuse, for example, in geosciences or in digital humanities.15
Some of the Road Ahead
Undoubtedly, the grand goal of the Semantic Web field—be it the creation of The Semantic Web as an artifact, or providing solutions for data sharing, discovery, integration, and reuse, which make it completely easy and painless—has not yet been achieved. This does not mean that intermediate results are not of practical use or even industrial value, as the discussions about knowledge graphs, schema.org, and the life science ontologies demonstrate.
Yet, to advance toward the larger goals, further advances are required in virtually every subfield Semantic Web. For many of these, discussions of some of the most pressing challenges can be found, for example, in Bernstein et al.2 in the contributions to the January 2020 special issue of the Semantic Web journalab or in Noy et al.27 for industrial knowledge graphs, in Thieblin et al.37 for ontology alignment, in Martinez-Rodriguez et al.23 for information extraction, in Höffner et al.13 for question answering, or in Hammer et al.9 for ontology design patterns and more. Rather than to repeat or recompile these lists, let us focus on the challenge that I personally consider to be the current, short-term, major roadblock for the field at large.
There is a wealth of knowledge—hard and soft—in the Semantic Web community and its application communities about how to approach issues around efficient data management. Yet, new adopters often find themselves confronted with a cacophony of voices pitching different approaches, little guidance as to the pros and cons of these different approaches, and a bag of tools which range from crude unfit-for-practice research prototypes to well-designed software for particular subproblems, but again with little guidance which tools, and which approaches, will help them best in achieving their particular goals.
Thus, what the Semantic Web field most needs, at this stage, is consolidation. And as an inherently application-driven field, this consolidation will have to happen across its sub-fields, resulting in application-oriented processes that are well-documented as to their goals and pros and cons, and which are accompanied by easy-to-use and well-integrated tools supporting the whole process. For example, some of the prominent and popular software available, such as the Protégé ontology editor,26 the OWL API,14 Wikibase, which is the engine underlying Wikidata,ac or the ELK reasoner,18 are powerful and extremely helpful, but fall far short from working easily with each other in some cases, even though they all use RDF and OWL for serializations.
Who could be the drivers of such consolidation? For academics, there is often limited incentive to develop and maintain stable, easy-to-use software, as academic credit—mostly measured in publications and in the sum of acquired external funding—often does not align well with these activities. Likewise, complex processes are inherently difficult to evaluate, which means that top-tier publication options for such kinds of work are limited. Writing high-quality introductory textbooks as a means to consolidate a field is very time-consuming and returns very little academic credit. Yet, the academic community does provide a basis for consolidation, by developing solutions that bridge between paradigms, and by partnering with application areas to develop and materialize use-cases.
Consolidation of sorts is already happening in industry, as witnessed by the adoption of Semantic Web technologies in start-ups and multinationals. Technical details, not even to speak of in-house software, underlying this adoption, for example, as in the case of the industrial knowledge graphs discussed in Noy et al.,27 are however usually not shared, presumably to protect the own competitive edge. If this is indeed the case, then it may only be a matter of time before corresponding software solutions become more widely available.
Conclusion
Within its first approximate 20 years of existence, the Semantic Web field has produced a wealth of knowledge regarding efficient data management for data sharing, discovery, integration, and reuse. The contributions of the field are best understood by means of the applications they have given rise to, including Schema.org, industrial knowledge graphs, Wikidata, ontology modeling applications, among other fields discussed throughout this article.
It is natural to also ask about the key scientific discoveries that have provided the foundations for these applications; however, this question is much more difficult to answer. What I hope has become clear from the narrative, advances in the pursuit of the Semantic Web theme require contributions from many computer science subfields, and one of the key quests is about finding out how to piece together contributions, or modifications thereof, in order to provide applicable solutions. In this sense, the applications (including those mentioned herein) showcase the major scientific progress of the field as a whole.
Of course, many of the contributing fields have individually made major advances in the past 20 years, and sometimes central individual publications have decisively shaped the narrative of a subfield. Reporting in more detail on such advances would be a worthwhile endeavor but constitute a separate piece in its own right. The interested reader is encouraged to follow up on the references given, which in turn will point to the key individual technological contributions that lead to the existing widely used standards, the landmark applications reported herein, and the current discussion on open technical issues in the field to which references have been included.
The field is seeing mainstream industrial adoption, as laid out in the narrative. However, the quest for more efficient data management solutions is far from over and continues to be a driver for the field.
Acknowledgment. This work was supported by the National Science Foundation under award OIA-2033521.
Figure. Watch the author discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/semantic-web
Join the Discussion (0)
Become a Member or Sign In to Post a Comment