Research and Advances
Computing Applications Contributed articles

Database and Information-Retrieval Methods For Knowledge Discovery

Comprehensive knowledge bases would tap the Web's deepest information sources and relationships to address questions beyond today's keyword-based search engines.
  1. Introduction
  2. Harvesting, Searching, Ranking the Web
  3. YAGO for Large-Scale Semantic Knowledge
  4. Searching and Ranking YAGO with NAGA
  5. Conclusion
  6. Acknowledgments
  7. References
  8. Authors
  9. Footnotes
  10. Figures
illustrated matrix of data elements

Our aim here is to advocate for the integration of database systems (DB) and information-retrieval (IR) methods to address applications that are emerging from the ongoing explosion and diversification of digital information. One grand goal of such an endeavor is the automatic building and maintenance of a comprehensive knowledge base of facts from encyclopedic sources and the scientific literature. Facts should be represented in terms of typed entities and relationships and allow expressive queries that return ranked results with precision in an efficient and scalable manner. We thus explore how DB and IR methods might contribute toward this ambitious goal.

DB and IR are separate fields in computer science due to historical accident. Both investigate concepts, models, and computational methods for managing large amounts of complex information, though each began almost 40 years ago with very different application areas as motivations and technology drivers; for DB it was accounting systems (such as online reservations and banking), and for IR it was library systems (such as bibliographic catalogs and patent collections). Moreover, these two directions and their related research communities emphasized very different aspects of information management; for DB it was data consistency, precise query processing, and efficiency, and for IR it was text understanding, statistical ranking models, and user satisfaction.

There were attempts at integration (late 1990s), most notably the probabilistic datalog and probabilistic relational-algebra models,13,14 the proximal node model,19 and the WHIRL approach to similarity joins.9 But it is only in the past few years that mission-critical applications have emerged with a compelling need for integrated DB and IR methods and platforms. From an IR perspective, digital libraries of all kinds are becoming rich information repositories, with documents augmented by metadata and annotations captured in semistructured data formats (such as XML); enterprise search on intranet data represents a variant of this theme.

From a DB point of view, application domains (such as customer support, product and market research, and health-care management) reflect data growth in terms of both structured and unstructured information. Web 2.0 applications (such as social networks) require support for structured and textual data, as well as ranking and recommendation in the presence of uncertain information of highly diverse quality (see Figure 1). The Figure categorizes information systems along two dimensions: how the data is to be managed and how the data is to be searched. The first divides the world of digital data into structured data (such as like schema-oriented records with numerical, categorical, and short-string attributes) and unstructured data (such as natural-language text and multimodal information, including speech and video) and loose collections of heterogeneous records. The second dimension distinguishes sophisticated query languages that express logical conditions from simple keyword search as the prevalent way of posing queries to search engines. Since the late 1960s DB and IR systems have resided in two totally separate quadrants in the Figure, while it seemed as though the other two were useless or unoccupied.

Since the late 1990s, DB and IR researchers have explored these previously blank quadrants (middle of the Figure). IR-style keyword search over structured data (such as relational databases) makes sense when the structural data description—the schema—is so complex that information needs cannot be concisely or conveniently expressed in a structured query. As an example of this difficulty, consider a social-network database with tables of users, friends, and posted items (such as photos, videos, and recommended books or songs), as well as ratings and comments. Assume a user wants to find the connections shared by Alon, Raghu, and Surajit with respect to the Semantic Web. Answers might be that the three co-authored a book on the Semantic Web, two edited a book, one commented on it, or the three are friends and one posted a video called “Semantic Web Saga.” With structured querying, where each value (such as “Alon”) refers to a particular attribute (such as User.Name and Friend.Name), the combinatorial options lead to very complex queries with many joins and unions. Much simpler is to state five keywords—”Alon, Raghu, Surajit, Semantic, Web”—and let the system compute the most meaningful answers in a relational graph. This relaxed attitude toward the schema (which value should occur in which attribute) naturally entails IR-style ranking.

Conversely, linguistic and learning-based information-extraction techniques have been applied in order to augment textual sources with structured records and enable expressive DB-style querying over originally unstructured data. Consider an information request about “the life of the scientist Max Planck” to be evaluated over an XML-based digital library, perhaps an extended form of Wikipedia. A simple approach would be to formulate a keyword query like “life scientist Max Planck.” Unfortunately, the results would be dominated by information about the Max-Planck Institutes (approximately 80 in Germany) in the area of life sciences. Structured query languages (such as SQL and XPath Full-Text) allow professional users to specify more precisely what they are interested in, possibly in the form of attribute name-value conditions (such as Name = “Max Planck”) and XML structure-and-content conditions (such as

  • //Article [Person ftcontains “Max Planck”]
  • [Category ftcontains “science”] //Biography).

These search predicates yield much more precise answers but may require approximate matching (to counter overspecified queries) and result ranking2,25 and related references.

Meanwhile, the initially pure quadrants for DB and IR systems have been substantially enhanced by new methods for digital libraries, enterprise search and analytics, text extensions for database engines, and ranking capabilities for SQL and XQuery. The boundaries between quadrants are blurring; the DB and IR fields are increasingly fertilizing each other. The “Future” part of Figure 1 envisions convergence toward an integrated solution, though only the future alone can reveal if this goal is feasible and how it might be achieved.

Many applications must be able to manage both structured and unstructured data. Consider a health-care scenario involving, say, relational tables with the following schemas (attribute names and types, unique keys underlined):

  • Disease (DId int; Name char[50]; Category int; Pathogen char[50]; …)
  • Patient (PId int; …; Age int; Treated-DId int; ResponsibleHId int; Timestamp date; Report longtext; …)
  • Hospital (HId int; Address char[200]; …)

Some of the information, especially foreign-key references between relations (such as a patient record referring to a disease identifier), is suitable for structured queries. But long text fields, often containing valuable latent information, are amenable to only keyword and text-similarity search. Moreover, some of the attributes (such as Category) may refer to external taxonomies and ontologies (such as the Unified Medical Language System).

To illustrate the nature of querying such data, consider an information request (such as “Find young patients in central Europe who have been reported, in the past two weeks, to have symptoms of tropical virus diseases and an indication of anomalies”). Computing relevant answers requires evaluating structured predicates (such as range conditions on Age and joins with additional ontology tables for identifying the values of Disease—Name, Category, Pathogen—referring to tropical virus diseases). This computation also involves fuzzy predicates on Report to test the “anomaly” condition. Moreover, hospital Address must be matched against geographic taxonomies with some inherent vagueness (such as about, say, whether England and Italy are part of central Europe). Finally, with such fuzzy matching and similarity tests it is absolutely necessary for the query engine to be able to provide meaningful rankings of the results. For example, a case with strong evidence of anomalies that may be at the border or just outside of central Europe or happened just outside the two-week time window should rank higher than a case that satisfies the structured (and spatio-temporal) conditions very well, though little evidence in the Report suggests it was particularly anomalous.

Structured and unstructured search conditions are combined in a single query, and the query results must be ranked. The queries must be evaluated over very large data sets that exhibit high update rates as new cases of diseases are added. A programmer can build such an application through two separate platforms—a DB system for the structured data and an IR search engine for the textual and fuzzy-matching issues. But this widely adopted approach is a challenge to application developers, as many tasks are not covered by the underlying platforms and must be addressed in the application code. An integrated DB/IR platform would greatly simplify development of the application and largely reduce the cost of maintaining and adapting it to future needs.

Abstracting from this application-centric discussion, we have identified several compelling motivations for bringing IR concepts to DB systems and vice versa, leading to the following DB and IR concepts and methods a developer would find useful:

Approximate matching and record linkage. Adding text-matching functionality to DB systems often entails approximate matching (such as due to spelling variants) and when text fields refer to named entities lead to record linkage for matching entities. For example, the strings “William J. Clinton” and “Bill Clinton” likely denote the same person, and the names “M-31” and “NGC 224” should be reconciled to denote the Andromeda galaxy. Approximate matching by similarity measures requires IR-style ranking.

Too-many-answers ranking. Preference search of, say, travel portals and product catalogs often poses a too-many-answers problem. Narrowing the query conditions may overshoot by producing too few or even no results; interactive reformulation and browsing is time-consuming and might irritate users. Large result sets inevitably require ranking based on data and/or workload statistics, as well as on user profiles;

Schema relaxation and heterogeneity. In the DB world, the norm is that applications access multiple databases, often with a run-time choice of the data sources. Even if each source contains structured data records and comes with an explicit schema, there is no unified global schema unless a breakthrough could be achieved to magically perform perfect on-the-fly data integration. So the application program must be able to cope with the heterogeneity of the underlying schema names, XML tags, and Resource Description Framework (RDF) properties, and queries must be schema-agnostic or at least tolerant to schema relaxation;

Information extraction and uncertain data. Textual information contains named entities and relationships in natural-language sentences that can be made explicit through information-extraction techniques (pattern matching, statistical learning, and natural-language processing). However, this approach can lead to large knowledge bases with facts that exhibit uncertainty; querying extracted facts thus entails ranking.

Entity search and ranking. Recognizing entities in text sources allows entity-search queries about, say, electronics products, travel destinations, and movie stars, boosting search capabilities on intranets, portals, news feeds, and the business- and entertainment-oriented parts of the Web. Extracting binary relations between entities, as well as place and time attributes, could pave the way toward semantic IR on digital libraries (such as PubMed), news, and blogs and also aid natural-language question answering and searching the deep, or hidden, Web.

Back to Top

Harvesting, Searching, Ranking the Web

The Web has the potential for being the world’s most comprehensive knowledge base, but we are still far from exploiting it. Valuable scientific and cultural content is all mixed up with huge amounts of noisy, low-quality, unstructured text and media. The challenge is how to extract the important facts from the Web and organize them into an explicit knowledge base that captures entities and semantic relationships among them. Imagine a formally structured Wikipedia with the same scale and richness as Wikipedia itself but that offers a precise and concise representation of knowledge that enables expressive and precise querying.

Figure 2 outlines what such a knowledge base might look like, depicting an excerpt from our own Yet Another Great Ontology (YAGO) knowledge base,24 a typed entity-relationship graph that can be represented in the RDF or Owl-Lite data models. Building and maintaining it in a largely automated manner is not only difficult but an opportunity for computer science to contribute toward high-value assets for science, culture, and society. DB and IR methods could indeed have the potential to play major roles in this endeavor.

With a knowledge base that sublimates valuable content from the Web, we could address difficult questions beyond the capabilities of today’s keyword-based search engines. For example, a user might ask for a list of drugs that inhibit proteases and obtain a fairly comprehensive list of drugs for this HIV-relevant family of enzymes. Such advanced information requests are posed by knowledge workers, including scientists, students, journalists, historians, and market researchers. Although it is possible to find relevant answers, the process is laborious and time-consuming, as it often requires rephrasing queries and browsing through many potentially promising but ultimately useless result pages. The following example questions illustrate this complexity:

Which German Nobel laureate survived both world wars and outlived all four of his children? The answer is Max Planck. The bits and pieces needed to answer are not difficult to locate: lists of Nobel prize winners, birth and death dates of the relevant people, the names of family members extracted from biographies, and dates associated with the various children. Gathering and connecting these facts is straightforward for a human but could take them days of manually inspecting Web pages.

Which politicians are also accomplished scientists? Today’s search engines fail on such questions because they match words and return pages rather than identify entities (such as persons) and test their relationships. Moreover, the question entails a difficult ranking problem. Wikipedia alone contains hundreds of names listed in the categories “Politicians” and “Scientists.” An insightful answer must rank important people first, say, the German chancellor Angela Merkel, who has a doctoral degree in physical chemistry, and Benjamin Franklin, who made scientific discoveries and was a founding father of the U.S.

How are Max Planck, Angela Merkel, Jim Gray, and the Dalai Lama related? All four have doctoral degrees from German universities (honorary doctorates for Gray and the Dalai Lama). Discovering interesting facts about multiple entities and their connections on the Web is virtually impossible due to the sheer amount of interconnected pages about these four famous people.

Note that even though the questions are asked in natural language, they would remain equally difficult to answer even if expressed in a formal language. Conversely, a rich knowledge base of entities and relationships would enable much more effective natural-language question answering.

Information organization and search on the Web are being augmented with increasingly sophisticated structure, context awareness, and semantic favor in the form of faceted search, vertical-domain search, entity search, and deep-Web search. All major search engines recognize a large fraction of worldwide product names, have built-in knowledge about geographic locations, and return high-precision results for popular queries about consumer interests, travel, and entertainment. Information-extraction and entity-search methods are clearly at work. But these efforts focus only on specific domains. Generalizing the approach toward a universal methodology for knowledge harvesting requires bolder steps, and three major research avenues promise to contribute to this goal:

Semantic-Web-style knowledge repositories (such as ontologies and taxonomies). Included are general-purpose ontologies and thesauri (such as SUMO, OpenCyc, and WordNet), as well as domain-specific ontologies and terminological taxonomies (such as GeneOntology and UMLS in the biomedical domain);

Large-scale information extraction (IE) from text sources in the spirit of a Statistical Web. IE methods—entity recognition and learning relational patterns—are increasingly scalable and less dependent on human super-vision1,10,21; and

Social tagging and Web 2.0 communities that constitute the social Web. Human contributions are abundant in the form of semantically annotated Web pages, phrases in pages, images, and videos, together providing “wisdom of the crowds.” Freebase and other such endeavors collect structured data records from human communities. Wikipedia is another example of the Social Web paradigm, including semistructured data (such as infoboxes) that can be augmented with explicit facts.4,24,27

Research projects often combine elements of the semantic, statistical, and social approaches. Here, we discuss several interesting projects, highlighting YAGO results:

Libra. Aiming to support entity search on the Web, the Microsoft Research Lab in Beijing has developed comprehensive technology for information extraction, including pattern-matching algorithms tailored to typical Web-page layouts and trained learning of patterns using advanced models (such as hierarchical conditional random fields28). A particularly fruitful focus is to extract entities and their attributes from product-related pages with HTML tables and lists. These methods and tools are being used to build and maintain several vertical-domain portals, including product search and the Libra portal for scholarly search on extracted records about authors, papers, conferences, and communities.

Once the facts are gathered and organized into searchable form, a typical IR issue arises concerning how a system should rank the results of an entity-centric query. To this end, an advanced statistical language model (LM) has been extended from the form of document-oriented bags of words to the form of structured records.20

Libra is an example of the Statistical-Web approach.

Cimple/DBLife. The Cimple project,11,22 being carried out jointly by the University of Wisconsin and Yahoo! Research, is similar to Libra, aiming to generate and maintain community-specific portals with structured information gathered from Web sources. However, it applies a number of methods to achieve this goal, as we illustrate by discussing its flagship application: the DBLife portal.

DBLife features automatically compiled “super-homepages” of researchers with bibliographic data, as well as facts about community services (such as PC work), colloquium lectures, and more. For gathering and reconciling these facts, Cimple has a suite of DB-style extractors based on pattern matching and dictionary lookups. The extractors are combined into execution plans and periodically applied to a carefully selected set of relevant Web sources, including prominent sites like DBLP and the Dbworld archive and important conference and university pages that are selected semi-automatically. While the overall approach makes use of IR concepts like tf*idf-based ranking and Web-graph link analysis, Cimple emphasizes a more DB-oriented toolkit for declarative extraction programs, using Datalog as a query-language framework and DB rewriting techniques for query optimization.17

Cimple leans more toward the Semantic-Web approach and less toward a Statistical-Web approach. In addition, it contains Social-Web elements, most notably, a Wiki-based mechanism for users to provide feedback about incorrect facts they identify on community portals.

KnowItAll/TextRunner. Both Libra and Cimple operate on the basis of one page at a time, then aim to extract as many facts as possible from the given page. A dual view is to focus on one or more entity types or relationship types, aiming to populate them by inspecting many pages and exploiting their redundancies. For example, a user might want to find all cities on planet Earth, along with all scientists, guitar players, and other unary relations (entity types). For binary relations, a user might consider gathering all CEOs of all companies, all (city, river) pairs where a city is located on a river, or the answers to questions like: Who discovered what? and Which enzyme triggers which biochemical process?

The challenge is how to extract the important facts from the Web and organize them into an explicit knowledge base that captures entities and semantic relationships among them.

The KnowItAll project5,6,12 at the University of Washington in Seattle has pursued this goal, using techniques that combine pattern matching, linguistic analysis, and statistical learning. KnowItAll starts with a set of seeds: the instances of the relation of interest (such as a set of cities or a set of (city, river) pairs).12 This is the only “training input” needed by KnowItAll, which automatically funds sentences on the Web with the seeds, extracts linguistic patterns surrounding the seeds, performs statistical analyses to identify strong patterns, and finally identifies the most useful patterns to obtain extraction rules. For example, the phrase templates “located in downtown $x” and “$x is located on the banks of $y” may be determined to be good rules for extracting cities and (city, river) pairs, respectively. Now these rules can be applied to newly seen Web pages, yielding facts or fact candidates, some in turn considered as new, additional seeds. Needed are statistical inferences to identify good rules and assess the confidence in the harvested facts.

The TextRunner tool5 pays special attention to scalability and simplifies the entire fact-gathering pipeline. It has a completely unsupervised bootstrapping phase for identifying simple patterns, just enough to identify, with high confidence, noun phrases and verbal patterns. When TextRunner sees a new Web page, it aggressively extracts all potentially meaningful instances of all possible binary relation types from the page text; Banko et al.5 refers to this processing mode as open information extraction, or “machine reading.”

KnowItAll and TextRunner are examples of Statistical-Web methods for large-scale knowledge acquisition.

Back to Top

YAGO for Large-Scale Semantic Knowledge

Our YAGO project23,24 shares the KnowItAll and TextRunner goal of large-scale knowledge harvesting but emphasizes high accuracy and consistency rather than high recall (coverage). YAGO is best characterized as a Semantic-Web approach, gathering its knowledge by (primarily) integrating information from Wikipedia and WordNet. It also employs text-mining-based techniques. YAGO contains close to two million entities and about 20 million facts about them, where facts are instances of binary relations. Extensive sampling has shown that YAGO accuracy is at least 95%, and many of its errors (false positives) are due to incorrect entries in Wikipedia itself. YAGO is publicly available at

Two Wikipedia assets—infoboxes and the category system—are almost structured data. Infoboxes are collections of attribute name-value pairs often based on templates and reused for important types of entities (such as countries, companies, scientists, music bands, and sports teams). For example, the infobox for Max Planck delivers such data as birth_date = April 23, 1858, birth_place = Kiel, death_date = October 4, 1947, nationality = Germany, and alma_mater = Ludwig-Maximilians-Universität München. As for the category system, the Max Planck article is manually placed in such categories as German_Nobel_laureates, Nobel_laureates_in_physics, quantum_ physics, and University_of_Munich_ alumni. All give YAGO clues about instanceOf relations, so it can infer that the entity Max Planck is an instance of the classes GermanNobelLaureates, NobelLaureatesInPhysics, and UniversityOfMunichAlumni. But YAGO must be careful, as the placement in category quantum_physics does not mean that Max Planck is an instance of QuantumPhysics. The YAGO extractors employ linguistic processing (noun phrase parsing) and mapping rules to achieve high accuracy in harvesting the categories information.

These examples of YAGO information extraction indicate that relying solely on Wikipedia infoboxes and categories may result in a large but incoherent collection of facts. For example, we may know that Max Planck is an instance of GermanNobelLaureates but be unable to automatically infer that he is also an instance of Germans and of Nobel Laureates. Likewise, the fact that he was a physicist does not automatically tell us he was a scientist. To address these shortcomings, YAGO makes intensive use of the WordNet thesaurus (lightweight ontology), integrating the facts it harvests from Wikipedia with the taxonomic backbone provided by WordNet.

While WordNet knows many abstract classes and the “is-a” and “part-of” relationships among them, it has only sparse information about individual entities that would populate its classes. The wealth of entities in Wikipedia complements WordNet nicely; conversely, the rigor and extensive coverage of WordNet’s taxonomy compensate for the gaps and noise in the Wikipedia category system. Each individual entity YAGO discovers must be mapped into at least one existing YAGO class. If this fails, the entity and its related facts are not admitted into the knowledge base. Analogously, classes derived from Wikipedia category names (such as GermanNobelLaureates) must be mapped with a subclass relationship to one or more superclasses (such as NobelLaureates and Germans). These procedures ensure that YAGO maintains a consistent knowledge base, where consistency eliminates dangling entities and classes and guarantees that the subclass relation is acyclic.

Kylin/KOG. The “Intelligence in Wikipedia” project also extracts information from Wikipedia through its tools Kylin27 and Kylin Ontology Generator (KOG).26 Whenever an infobox type includes an attribute in some articles but the attribute has no value for a given article, Kylin analyzes the full text of the article to derive the most likely value. Like KnowItAll and TextRunner (but unlike Libra, Cimple, and YAGO), Kylin pursues open extraction by considering all potentially significant attributes, even if they occur only sparsely in the entire Wikipedia corpus. KOG builds on Kylin’s output, unifies attribute names, derives type signatures, and (like YAGO) maps these entities onto the WordNet taxonomy through statistical relational learning.15 KOG goes beyond YAGO by discovering new relationship types. It builds on the class system of both YAGO and DBpedia,4 along with the entities in each class, to train its learning algorithms for generating the subsumption graph among classes.

The Kylin/KOG project combines all three knowledge-gathering paradigms: Semantic-Web-oriented by being targeted at infoboxes; Social-Web-based by leveraging the input of the large Wikipedia community; and Statistical-Web-style through learning methods.

Back to Top

Searching and Ranking YAGO with NAGA

The query language we designed for YAGO adopts concepts from the standardized SPARQL Protocol and RDF Query Language for RDF data but extends them through more expressive pattern matching and ranking.3,18 The prototype system that implements these features is called NAGA (for Not Another Google Answer, Viewing the knowledge base as a graph, users and programmers alike can construct a query with the help of subgraph templates; Figure 3 outlines three examples related to the question scenarios discussed earlier.

The leftmost query in the Figure about politicians who are also scientists shows two nodes matched by the desired results and one node (labeled $x) denoting a variable for which the query must find all bindings. The edge labels denote relationships and need to be matched by the results. Here, “isa” is shorthand notation for a composition of two connected edges that correspond to the relationships instanceOf between an entity and a class and subclass between two classes. This way the user also finds people who belong to the classes “mayor” and “physicist.”

The query in the middle of Figure 3 (a simpler variant of the German-Nobel-laureate question) generalizes the point about labels referring to compositions of relations. The label (bornIn|livesIn|citizenOf).locatedIn* is a regular expression that allows users to avoid overspecifying their information demand. We may be generous when we call a person German, and the locatedIn relationship often reflects geographical hierarchies (such as with cities, counties, states, and countries).

The rightmost query of Figure 3 is a broad relatedness query that looks for commonalities or other connections among several entities. Here again, users or programmers would use regular expressions as edge labels in the query’s graph template.

NAGA queries often return too many results and so must rank these results. For example, a query like “What is known about Einstein?,” which may be phrased as a single-edge graph pattern isa(Einstein, $y), returns dozens if not hundreds of results, including many uninteresting ones like isa(Einstein, Entity), isa(Einstein, Organism), and isa(Einstein, Colleague). Ranking models for such results is much more difficult than for traditional search engines, as the system must consider the graph structure in both queries and results. Three general criteria must be accommodated:

Informativeness. Users prefer informative answers—salient or interesting facts, as opposed to overly generic facts or facts that are trivially known already. In the example query about Einstein the user would prefer the answers isa(Einstein, Physicist) or isa(Einstein, NobelLaureate) over the near-trivial results or a nontrivial but still less important fact like isa(Einstein, Vegetarian). However, when asking a different query about noteworthy vegetarians, the latter fact should be one of the highest-ranked results. Informativeness is a query-dependent (and potentially user-and-situation-dependent) notion;

Confidence. Users may occasionally find uncertain, dubious, or false statements in the YAGO knowledge base. Each fact is annotated with a confidence value derived from the fact’s original data sources and the extraction methods YAGO used. The quality of the sources tapped by YAGO can be quantified in terms of authority and trust measures in the spirit of Google’s PageRank, and the quality of different extractors can be empirically assessed. Various ways are available for combining these measures into a single confidence value,7,8 and high-confidence answers are preferred; and

Compactness. Whenever a query returns paths or graphs rather than individual nodes, we are interested in compact graphs and short paths. For example, a query about how Einstein and Bohr are related should return a short answer path that says something like “both are physicists,” rather than a convoluted answer like “Einstein was a vegetarian like Tom Cruise who was born in the same year Bohr died.”

A good ranking function is needed to combine all three criteria. Here, we sketch our approach for informativeness. We developed for NAGA a new kind of statistical LM for graph-structured data and queries. The parameters of the model are estimated from corpus or workload statistics. Consider the simple queries isa(Einstein; $y) about Einstein and bornIn($y; Frankfurt) about people born in Frankfurt. For the Einstein query YAGO estimates conditional co-occurrence probabilities


to compare and rank two possible answers. For the Frankfurt query YAGO computes and compares


clearly favoring Goethe as a top result.

This LM-based ranking allows NAGA to rank politicians who are also scientists with high informativeness, with Benjamin Franklin, Angela Merkel, and other prominent figures showing up in the top ranks. So while the YAGO knowledge base is primarily a Semantic-Web endeavor, the ranking for its search engine is built on Statistical-Web assets.

Despite good progress, these approaches face three notable challenges:

Scalable harvesting. Most new knowledge is produced in textual form (such as in scientific publications). Methods for natural-language IE face inherent trade-offs regarding training effort vs. easy deployment and precision vs. recall of the results. Scaling up the IE machinery for higher throughput without sacrificing quality is a formidable problem. For example, can IE tools process all blog postings on the planet at the same rate they are produced, without missing relevant facts or producing too many false positives?;

Expressive ranking. The LM-based ranking models pursued by Libra and NAGA should be extended to better capture the context of the user and the data. User context requires personalized and task-specific LMs that consider current location, time, short-term history, and intention in the user’s digital traces. Data context calls for LMs for entity-relationship graphs, aiming to better model complex patterns beyond single facts (edges) and consider types; and

Efficient search. Evaluating complex query predicates over graphs is computationally difficult. Moreover, the need for ranking suggests that the system should avoid materializing overly large numbers of results and better aim for solely computing the top-k results in a more efficient way.16

On a grander scale is the question of which is the most appropriate paradigm. The three avenues toward comprehensive knowledge harvesting—Semantic, Statistical, and Social Web—are by no means mutually exclusive. The projects outlined here combine aspects of several of these directions. Deeper understanding of feedback between and synergies from the three paradigms is an overriding theme of great potential value to researchers. Semantic-Web sources can be powerful bootstrap tools for large-scale Statistical-Web mining. Statistical-Web tools may produce many false hypotheses, but they can be assessed by Social-Web platforms with large communities of users that engage in human-computing tasks. Social-Web endeavors in turn are often grassroots catalysts for developing high-value knowledge repositories that eventually become Semantic-Web assets; examples are Wikipedia and derived knowledge bases (such as YAGO and DBpedia).

Back to Top


We have presented motivations for and approaches toward integrating the historically separated DB and IR methodologies. While deep DB/IR integration may be wishful thinking, at least for the time being, we observe strong trends toward adopting IR concepts in the DB world and vice versa. In addition to applications that must be able to manage structured and unstructured data or highly heterogeneous information sources, we also see increasing interest and success in extracting entities and relationships from text sources. The envisioned path toward automatically building and growing comprehensive knowledge bases with expressive search and ranking capabilities may take a long time to mature. In any case, it is an exciting and rewarding challenge that should appeal to and benefit from innovation in several research communities, most notably DB and IR.

Back to Top


Our work on knowledge harvesting is supported by the Excellence Cluster “Multimodal Computing and Interaction” ( funded by the German Science Foundation.

Back to Top

Back to Top

Back to Top

Back to Top


F1 Figure 1. Past, present, and future of DB and IR methods.

F2 Figure 2. Excerpt from the YAGO knowledge base.

F3 Figure 3. Example queries for the YAGO knowledge base.

Back to top

    1. Agichtein, E. Scaling information extraction to large document collections. IEEE Data Engineering Bulletin 28, 4 (Dec. 2005), 3–10.

    2. Amer-Yahia, S, and Lalmas, M. XML search: Languages, INEX, and scoring. ACM SIGMOD Record 35, 4 (Mar. 2006), 16–23.

    3. Anyanwu, K., Maduko, A., and Sheth, A. SPARQ2L: Towards support for subgraph extraction queries in RDF databases. In Proceedings of the 16th International Conference on World Wide Web (Banff, Canada, May 8–12). ACM Press, New York, 2007, 797–806.

    4. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. DBpedia: A nucleus for a Web of open data., In Proceedings of the Sixth International Semantic Web Conference (Pusan, Korea, Nov. 11–15). Springer, Berlin/Heidelberg, 2007, 722–735.

    5. Banko, M., Cafarella, M., Soderland, S., Broadhead, M., and Etzioni, O. Open information extraction from the Web. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (Hyderabad, India, Jan. 6–12, 2007), 2670–2676;

    6. Cafarella, M., Re, C., Suciu, D., and Etzioni, O. Structured querying of Web text data: A technical challenge. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research (Asilomar, CA, Jan. 7–10, 2007), 225–234;

    7. Chakrabarti, S. Dynamic personalized PageRank in entity-relation graphs. In Proceedings of the 16th International Conference on World Wide Web (Banff, Canada, May 8–12). ACM Press, New York, 2007, 571–580.

    8. Cheng, T., Yan, X., and Chang, K. Entity Rank Searching entities directly and holistically. In Proceedings of the 33rd International Conference on Very Large Data Bases (Vienna, Austria, Sept. 23–27). ACM Press, New York, 2007, 387–398.

    9. Cohen, W. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Seattle, June 2–4). ACM Press, New York. 1998, 201–212.

    10. Cunningham, H. An introduction to information extraction. In Encyclopedia of Language and Linguistics, Second Edition, K. Brown et al., Eds., Elsevier, Amsterdam, 2005.

    11. DeRose, P., Shen, W., Chen, F., Doan, A.-H., and Ramakrishnan, R. Building structured Web community portals: A top-down, compositional, and incremental approach. In Proceedings of the 33rd International Conference on Very Large Data Bases (Vienna, Austria, Sept. 23–27). ACM Press, New York. 2007, 399–410.

    12. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D., and Yates, A. Unsupervised named-entity extraction from the Web: An experimental study. Artificial Intelligence 165, 1 (June 2005), 91–134.

    13. Fuhr, N. and Rölleke, T. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Transactions on Information Systems 15, 1 (Jan. 1997), 32–66.

    14. Fuhr, N. Probabilistic datalog: A logic for powerful retrieval methods. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle. July 9–13). ACM Press, New York 1995, 282–290.

    15. Getoor, L. and Taskar, B., Eds. Introduction to Statistical Relational Learning, MIT Press, Cambridge, MA, 2007.

    16. Ilyas, I., Beskales, G., and Soliman, M. A survey of top-k query-processing techniques in relational database systems. ACM Computing Surveys 40, 1 (Oct. 2008), 1–58.

    17. Ipeirotis, P., Agichtein, E., Jain, P., and Gravano, L. Towards a query optimizer for text-centric tasks. ACM Transactions on Database Systems 32, 4 (Nov. 2007).

    18. Kasneci, G., Suchanek, F., Ifrim, G., Ramanath, M., and Weikum, G. NAGA: Searching and ranking knowledge. In Proceedings of the 24th International Conference on Data Engineering (Cancun, Mexico, Apr. 7–12). IEEE Computer Society, Washington, D.C., 2008, 953–62.

    19. Navarro, G. and Baeza-Yates, R. Proximal nodes: A model to query document databases by content and structure. ACM Transactions on Information Systems 15, 4(1997), 400–435.

    20. Nie, Z., Ma, Y., Shi, S., Wen, J.-R., and Ma, W.-Y. Web object retrieval. In Proceedings of the 16th International Conference on World Wide Web (Banff, Canada, May 8–12). ACM Press, New York, 2007, 81–90.

    21. Sarawagi, S. Information extraction. Foundations and Trends in Databases 1, 3 (2008), 261–377.

    22. Shen, W., Doan, A.H., Naughton, J., and Ramakrishnan, R. Declarative information extraction using datalog with embedded extraction predicates. In Proceedings of the 33rd International Conference on Very Large Databases (Vienna, Austria, Sept. 23–27). ACM Press, New York, 2007, 1033–1044.

    23. Suchanek, F., Kasneci, G., and Weikum, G. YAGO: A large ontology from Wikipedia and WordNet. Journal of Web Semantics 6, 3 (2008), 203–217.

    24. Suchanek, F., Kasneci, G., and Weikum, G. YAGO: A core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web (Banff, Canada, May 8–12). ACM Press, New York. 2007, 697–706.

    25. Theobald, M., Bast, H., Majumdar, D., Schenkel, R., and Weikum, G. TopX: Efficient and versatile top-k query processing for semistructured data. VLDB Journal 17, 1 (Jan. 2008), 81–115.

    26. Wu, F. and Weld, D. Automatically refining the Wikipedia infobox ontology. In Proceedings of the 17th International Conference on World Wide Web (Beijing, Apr. 21–25). ACM Press, New York, 2008. 635–644.

    27. Wu, F. and Weld, D. Autonomously semantifying Wikipedia. In Proceedings of the 16th ACM Conference on Information and Knowledge Management (Lisbon, Nov. 6–10). ACM Press, New York, 2007, 41–50.

    28. Zhu, J., Nie, Z., Wen, J.-R., Zhang, Bo, and Ma, W.-Y. Simultaneous record detection and attribute labeling in Web data extraction. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Philadelphia. PA, Aug. 20–23). ACM Press, New York, 2006, 494–503.


Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More