Computing Applications Digital government

Using an Ontology to Simplify Data Access

Posted Jan 1 2003

Article
References
Author
Figures

When we turn to government for information, we expect it to be timely, thorough, and above all accurate. However, we also demand that government be split in many different ways—federal, state, and local; executive, judicial, and legislative; tax management separated from pensions and health. Moreover, government data should be collected at different times by different people. The resulting heterogeneity (especially incompatible data resources) places special demands on government software systems. A good example of the data problem appears in the more than 70 U.S. Federal Statistics (FedStats) agencies that collect information about all aspects of life in the U.S. Collectively, these agencies have tens of thousands of databases, stored in numerous formats (database software, Web pages, typewritten tables, and so on), with new ones being added every day. Frequently, portions of this data overlap, or are semi-complementary (an individual in one database may be part of a family in another, for example). Often, the classes of data are related, near-identical, or even identical (what is termed “salary” in one database, for example, might be exactly the same as “income” in another and “wages” in a third, but might be quite different from what some other agency means by “salary”).

Some mechanism is required to standardize data types, enable sharing, and facilitate perusal of others’ data. Ideally, such a mechanism should:

Include specialized domain terminology to support the detailed representation of fine-grained technical distinctions in the data, which allows experts to use the system;
Also include lay terminology, to enable nonexperts to quickly locate information without having to know the expert word usage; and
Support automatic inference, so that computer programs can help users find and merge information, find correspondences across data sets in the same domain, and (semi-) automatically extend the mechanism to incorporate new domains.

These desiderata make up a very tall order, for which no simple and wholly complete solution might ever be found. But some progress has been made in recent work. We describe one variant of such a mechanism, the ontology. We define an ontology simply as a taxonomized set of terms, ranging from very general terms at the top (allowing nonexpert users to find access points) down to very specialized ones at the bottom (allowing them to be connected to specific columns in databases). For example, the general term “wage” may head a small subtaxonomy including Agency1-wage, Agency2-salary, and so on, each one with its own definition, associated Web pages, and of course associated data.

What makes our approach new is an attempt to automate much of the ontology and domain model construction.

Our ontology is being used in a prototype system called the Energy Data Collection (EDC) system [1]. This system is being built by members of the Digital Government Research Center (DGRC; www.dgrc.org), which consists of faculty, staff, and students at the Information Science Institute (ISI) of the University of Southern California (USC) and Columbia University’s Computer Science Department and its Center for Research on Information Access.

The EDC project was started in the NSF’s Digital Government Program in 1999. We are working with representatives from the Census Bureau, the Bureau of Labor Statistics, the Department of Energy’s Energy Information Administration (EIA), and the California Energy Commission. For example, the EIA’s www.eia.doe.gov provides extensive monthly energy data to the public. This site receives hundreds of thousands of hits a month, even though most of its information is available only in standardized HTML Web pages or prepared PDF documents, and only for the last few years. The current facility thus supports only limited access to this very rich data source.

In order to support more dynamic yet homogeneous access to multiple energy databases, the EDC system includes three principal components detailed as follows:

Interface. The interface allows users to construct data requests, either by ontology browsing, natural language type-in, or cascaded menus. A completed request is dispatched to the query processor, which returns data tables and graphs to the interface for display.

Information access planner. The query processor employs USC/ISI’s SIMS system [2] that decomposes data requests into database queries according to the content and nature of the data sources, retrieves data from them, and reassembles the results appropriately. Since we have incorporated over 50,000 energy-related data tables of various kinds, SIMS uses a model of the data that identifies and describes their contents. This domain model, which unifies the various databases’ metadata descriptions, forms the lowermost portion of the ontology. A fragment of the EDC model (about 500 nodes, manually defined) is shown in Figure 1. A typical query includes some type of gasoline (chosen from the top-right cluster), some quality grade (bottom right), some area of interest (bottom left) and so on.

Ontology as metadata. It is not simple to unify different databases’ metadata and/or domain terms, and to create a coherent domain model. As Figure 1 shows, the various clusters represent independent and quite different concepts: gasoline type and quality, geographic region, units of measurement, and so on. In order to place these concepts in a single coherent framework, which will also facilitate the future addition of domains and databases dealing with very different information, we used as overarching ontology USC/ISI’s 70,000-node terminological taxonomy SENSUS [8]. SENSUS is a rearrangement and extension of Princeton’s WordNet 1.6 [4], retaxonomized under USC/ISI’s Penman Upper Model [3] (built to support natural language processing). SENSUS can be accessed using the browsers DINO at edc.isi.edu:8011/dino or its predecessor Ontosaurus at mozart.isi.edu:8003/sensus/sensus_frame.html.

What makes our approach new is an attempt to automate much of the ontology and domain model construction. In order to facilitate model building and ontologization, we have developed algorithms that:

Identify and extract from data sources terms likely to be important for domain modeling;
Cluster them into mini-taxonomies [6];
Create mappings/alignments of terms and clusters into the ontology [5, 6]; and
Our Columbia University partners have investigated extracting and analyzing terms from online glossaries [7].

The ontology for the EDC project has the structure shown in Figure 2. To create it, we identified the principal domain terms, manually defined the domain model of approximately 500 nodes to represent the concepts present in the EDC gasoline domain, and linked these domain concepts into SENSUS using the semiautomated alignment algorithms.

We needed two types of links for this work. The links between data sources and domain model terms express logical equivalences, as required to ensure the correctness of SIMS reasoning. They must therefore be checked manually. To connect concepts in the upper ontology and the domain model we defined a new type of link called “generally-associated-with” (GAW). GAW links enable the user while browsing to rapidly proceed from high-level concepts to the domain model terms associated with real data in the databases. In contrast to domain model links, the semantics of GAW links are intentionally vague. This vagueness allows us to link a specifically defined domain model term (such as price) to very disparate (though still thematically related) SENSUS concepts (such as price, cost, money, charge, dollar, amount, fee, payment, paying, and so on). Clearly, these links cannot support automated inference. They can, however, help the nonexpert user to start browsing or forming queries using whatever terms are most familiar. In addition, the vague semantics has a fortunate side effect, in that it facilitates automated alignment of concepts from domain model to SENSUS. Since the alignment techniques are still not very accurate, we cannot without considerable manual intervention employ them where logically strict equivalence links are required. For GAW links, however, they are quite well suited.

Figures

Figure 1. Fragment of the EDC domain model.

Figure 2. Ontology and domain models.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Using an Ontology to Simplify Data Access

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/602421.602447

January 2003 Issue

Published: January 1, 2003

Vol. 46 No. 1

Pages: 47-49

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Nov 8 2024

The Importance of Robust Documentation in Software Development

Alex Williams

Computing Profession

BLOG@CACM Nov 4 2024

The Gift That Keeps on Giving to Apple and Google

Saurabh Bagchi

Computing Applications

people holding dollar signs stand in line before a giant mobile phone, illustration

BLOG@CACM Nov 1 2024

Computational Thinking: The Idea That Lived

Shuchi Grover

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Figures

Using an Ontology to Simplify Data Access

DOI

January 2003 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.