Research and Advances
Artificial Intelligence and Machine Learning Ontology applications and design

Making Ontologies Work For Resolving Redundancies Across Documents

Producing normalized representations from different ways of expressing the same idea.
  1. Introduction
  2. System Overview
  3. Ontological Design for Representing Natural Language
  4. Normalizing Comparisons
  5. Conclusion
  6. References
  7. Authors
  8. Footnotes
  9. Figures

Knowledge management efforts over the past decade have produced many document collections focused on particular domains. As such systems scale up, they become unwieldy and ultimately unusable if obsolete and redundant content is not continually identified and removed.

We are working with such a knowledge-sharing system at Xerox, focused on the repair of photocopiers. Called Eureka, it now contains about 40,000 technician-authored free text documents, in the form of tips on issues not covered in the official manuals. Figure 1 shows a pair of similar tips from this corpus. Our goal is to build a system that can identify such conceptually similar documents, regardless of how they are written; identify the parts of two documents that overlap; and identify parts of the documents that stand in some relation to each other, such as expanding on a particular topic or being in mutual contradiction. Such a system will enable the maintenance of vast document collections by identifying potential redundancies or inconsistencies for human attention.

This task requires extensive knowledge about language and of the world, and a rich representation language. Moreover, assessing similarity imposes conflicting requirements on the underlying ontology. On one hand, the representations must capture enough of the nuances of natural language to be sufficiently discriminating, yet the ontology must support the normalization of differing representations of similar content to enable the detection of similarities.

At this point in our research, we have built a prototype system that embodies the knowledge necessary to analyze a test set of 15 pairs of similar tips. Although this system is far from complete, in the course of our work we have developed several design criteria for ontologies that support comparisons of natural language texts. In [2], we discuss the need for reified contexts to handle the representation of nonexistent situations and objects, and how reasoning with types and their instantiations can help. In this article, we focus on ways to produce normalized representations in our ontology from a wide range of different ways of expressing the same idea. We then describe a particular mechanism for normalizing frequently occurring comparative constructions, such as x is deeper than y and y is shallower than x, to a common representation.

Back to Top

System Overview

To create useful representations of natural language text, we first obtain a compact representation of the syntactic and semantic structures for each sentence, using the Xerox Linguistic Environment, a deep parser based on Lexical Functional Grammar theory [3]. From these sentence structures, we automatically construct conceptual representations of the text based on our ontology.

A constrained graph matching of these representations determines the overall degree of conceptual similarity between two texts, while also identifying areas of overlap and conflict. For matching, we use the Structure Mapping Engine (SME) [6], an implementation of the structure mapping theory of analogy [7]. SME anchors its matching process in identical elements that occur in the same structural positions in the base and target representations. From this, it builds a correspondence subject to two constraints: preservation of one-to-one correspondences between base and target elements, and identicality of the aforementioned anchors. The larger the structure that can be recursively constructed in this manner, the greater the similarity score.

Back to Top

Ontological Design for Representing Natural Language

Reasoning systems that receive well-specified input can utilize carefully constrained ontologies that capture exactly the set of concepts necessary for the task at hand. In contrast, our ontology must be more expressive, in order to accept arbitrary natural language input (albeit with strong expectations about the nature of the content). At one extreme, we could define a separate concept for each sense of each word. For example, we might define BreakDamage, BreakInterrupt, and BreakRecuperate to represent the meaning of break in the following sentences:

  • (1) The left cover broke in half.
  • (2) The sheet of paper breaks the light beam.
  • (3) Before doing step 3, you might want to break for coffee.

This approach is tantamount to modeling the language in which the tips are written. It would require a vast ontology, as we would also have to represent equivalence classes, to match synonymous words. WordNet [5], a lexical database that is grounded in cognitive theories of human memory, provides an example of this approach. It contains on the order of 110,000 synsets, or classes of synonymous words.

However, even this fine-grained coverage is inadequate for our purposes. For example, none of the seven senses in WordNet for the word remove captures the use of the word to denote a cleaning event in the sentence

  • (4) Use a soft brush to remove any toner from the rollers.

This is not a shortcoming of WordNet, because there is no sense of remove per se that denotes cleaning. The inference of a cleaning activity arises from world knowledge applied to the entire sentence. In this case, what is being removed, toner, is a form of dirt, and removing dirt is plausibly an abstract description of a cleaning event.

WordNet maps individual words onto synsets, not ontological concepts. For our purposes, such a mapping is inadequate. We need a richer relational structure that only an ontology can support and the means to compose concepts dynamically. (However, there are interesting applications of WordNet for semiautomatically constructing ontologies, such as in [4].)

Natural language texts generally contain descriptions of causally or sequentially related events in which entities play particular roles. An ontology that supports the representation of textual content must therefore at least comprise events and entities. Events have a richer structure than objects, including critical role relations for their participants, so an adequate ontology must also include such relations. For example, a cleaning event has roles for the agent doing the cleaning, the object being cleaned, the instrument for accomplishing the cleaning, and possibly the dirt or other contamination that is being removed.

Having these thematic roles in our ontology enables us to abstract away from grammatical relations in our representations of natural language texts. This enables us to compare events described at differing levels of specificity. For example, cleaning object y with instrument x is a more specific kind of event than cleaning object y or cleaning with instrument x, which are in turn more specific event types than cleaning. A description logic approach (see, for example, [1]) is well suited for capturing such distinctions systematically and economically, by enabling the composition of new subconcepts within the ontology.

The question still remains of which types of events to represent, or more generally, the appropriate resolution of the ontology. For example, consider the event of damage or incapacitation of some sort. English contains around 40 verbs for events of incapacitation. Some, such as trample, cut, and rip, indicate the means, but sanction no inference about the resulting extent of the damage. Others, such as ruin, raze, and destroy, indicate damage of uncertain means but ultimate extent. Yet others, such as splinter, shatter, and crack, indicate the final state and possibly the nature of the material in question.

Figure 2 presents one possible representation of such events. The leaves of this tree are concepts that correspond to the Incapacitation sense of their label (so, for instance, we exclude the golfing sense of slice from consideration). Which details are relevant, and thus should enter the representation, ultimately depends on the nature of the document collection being represented. For example, an event in which a car door is dented is plausibly mundane, perhaps a parking lot fender-bender. In contrast, an event in which a car door is torn off is probably the result of a severe accident.

Such inferences arise from combinations of events with particular entity types in role relationships. Representing the underlying knowledge would require thousands of axioms. Rather than focus on this aspect of the problem, our approach is to start with a basic ontology and add detail to improve the system’s performance. We conjecture that complex inferences, such as those about the car door previously mentioned, will often arise more directly from other parts of the text.

Notice that even without representing the difference between tear and dent, Figure 2 presents three levels of abstraction, from Incapacitate down to ForceDestroy and its sibling concepts. Consider the following, to illustrate how simple differences in expression can drive representations far apart in this ontology:

  • (5) The short circuit burns the nearby drive gear.
  • (6) The heat from the short circuit cracks the drive gear.

With the ontology dipicted in Figure 2, the drive gear in (5) would participate in a Destroy event, whereas the gear in (6) would be the object of state change in a ForceDamage event.

However, each of these sentences contains a short causal sequence:

  • (7) ShortCircuitEvent rarr.gif HeatingEvent1 rarr.gif Burn(Gear)
  • (8) ShortCircuitEvent rarr.gif HeatingEvent rarr.gif Crack(Gear)

Clearly, these sentences are similar, despite the different representation of the damage to the gear. However, matching (7) and (8) will require a more complex comparison operator than identicality-anchored correspondences. One approach is minimal ascension in the ontological hierarchy. In this case, Burn and Crack share a common ancestor, Damage, so at a higher level of abstraction, representations of these sentences match.

Traversing subsumption relations in search of a common ancestor will not always result in an accurate similarity assessment. Obviously, anything can be matched to anything else at a sufficiently high level of abstraction (such as Thing or Event). One way to address this problem would be to assign diminishing weights to matches in proportion to the number of taxonomic links traversed.

This may help, but will not resolve the problem. At issue is the nature of the taxonomic relation, which can vary considerably in degree and type over the hierarchy. For example, the distinction made in Figure 2 between ForceDamage and ThermalDamage is one of means, but the distinction between Damage and Destroy is one of extent. Even worse, the Disable event may introduce a dimension of intentionality. In some contexts, a Crack event may be more similar to a Ruin event than a Tear event, even though Crack and Tear are siblings, and Ruin is at a different level in the hierarchy five links away from Crack.

Description logic formalisms allow for arbitrary binary relations, or roles, to hold between concepts. We can take advantage of this mechanism to minimize the number of distinct concepts, thus reducing or eliminating the need to traverse subsumption links. The role relationships enable us to retain the resolution we lose in reducing the number of concepts by making the remaining concepts richer. For example, we can define a single Damage concept that has four roles: Extent, Material, EndState, and Means, as shown in Figure 3. The representation of a Melt event becomes a Damage event that has a Flammable Material, a Deformed EndState, a Means of Heating, and either a Partial or a Total Extent.

The ontology can express more specific concepts, so Melt may well be reified, particularly if it occurs frequently in the domain. However, there is a middle level of abstraction, where concepts are broad enough to maximize the likelihood of matching yet specific enough to minimize the likelihood of spurious matches. Damage is an example of such a concept, and therefore has the value of MidLevel for the metaproperty CategoryLevel.2 It is a design criterion for our ontology that it can support the use of such metaproperties.

The matching process starts by looking for matches between MidLevel concepts. Failure here is strong evidence of dissimilarity. Success at the MidLevel, however, requires either a match between LowLevel categories (such as Melt), or a more expensive comparison of the properties of the base and target events, to ensure that there are no conflicts (for example, our knowledge base may mark an EndState of Pierced as incompatible with an EndState of Torn). Since most texts are different, we only incur this greater cost for promising candidates.

The description logic approach to the construction of our ontology provides advantage here by exposing all the properties of events to the similarity algorithm, thus enabling a fine-grained comparison. Plausible reasoning algorithms can efficiently produce similarity assessments that are explicitly based on the particular properties of the events.

Back to Top

Normalizing Comparisons

Matching based on midlevel concepts and role comparisons is one mechanism for assessing similarity, but we need others to handle some common linguistic constructions. For example, comparisons occur frequently in our domain, and require additional representational machinery, which we now describe.

The tip on the left in Figure 1 states that removing the plastic sheath from the cable makes it more flexible, which prevents it from breaking. In the tip on the right, we find the plastic makes the cable too stiff, which causes it to snap. In both cases, the underlying situation is the same—the rigidity of the cable is too high for normal operation, with a similar end result —the cable breaks. At issue is how we are to determine that the descriptions, one containing more flexible and the other too stiff, are in fact similar.

Natural language often contains two terms for a given dimension, such as high/low for height, deep/shallow for depth, or hot/cold for temperature, where one term implicitly encodes high values and the other low values along the dimension. To enable matching, the ontology must reify dimensions. This places a greater burden on the mechanism for transforming natural language text into our representation.

Specifically, we must represent a unique role for dimension concepts, a polarity, or normative direction of comparison. Consider the dimension of Rigidity, which we can describe qualitatively as a scalar value that has a range from Low to High. Making a cable more flexible results in a decrease in Rigidity, a movement along the dimension toward the Low end. In contrast, a cable that is too stiff has a High degree of Rigidity. The polarity of Rigidity, therefore, explicitly marks the High extreme of the scale as positive.

We also need to represent knowledge of the polarity implicit in dimensional adjectives. For example, High flexibility is equivalent to Low Rigidity, so we represent flexible as a negative-polarity predication. This knowledge about the dimension and its associated adjectives enables the transformation mechanism to produce a normalized representation of comparisons.

The choice of which extreme of a dimension to consider positive is arbitrary, so long as all representations adhere to the same convention. In many cases, language usage provides information to guide the choice. For example, (9) is felicitous, whereas (10) is not:

  • (9) That hole is five feet deep.
  • (10) That hole is five feet shallow.

Taking this cue from the language to define large values of depth as positive, we reduce the chance of knowledge engineering mistakes stemming from intuitive understanding of the predicates. Ultimately, we hope to use such evidence to make inferences about the author’s intent.

The next step after dimensional normalization is the representation of the comparison. There are three types of comparisons: to another entity, to a quantity, either a numerical or a landmark value, such as the boiling point of water, and to an extreme degree. In the case of comparisons across entities, the desired representation for both

  • (11) The upper socket is deeper than the lower socket,


  • (12) The lower socket is shallower than the upper socket


  • (13) (GreaterThan (Depth UpperSocket) (Depth LowerSocket))

Comparison to a norm is particularly common in texts concerning repairs. Consider this sentence, from the left side of Figure 1:

  • (14) The left cover safety cable is breaking, allowing the left cover to pivot too far, breaking the cover.

Pivoting is a normal function of the left cover; it is only that it has pivoted too far, that is, beyond its normal functional range, that there is a problem. The representation for this is along the lines of:

  • (15) (ExcessiveHighAmount
        (AngularDistance RotationEvent))
        ObjectRotating RotationEvent LeftCover)

Note that in this case the system must infer the relevant dimension, AngularDistance, from the RotationEvent and the too far comparison. Reified dimensions enable the system to represent this as the dimension role for this RotationEvent.

Finally, the representation of extreme degree, as in

  • (16) The cable is very flexible

is along the lines of

  • (17) (ExtremeLowAmount (Rigidity Cable)).

In contrast to representations of excessive amounts, an extreme amount does not sanction an inference of abnormal function or failure. For example, a very flexible cable might be highly desirable, which will not be the case for a cable that is too flexible.

Back to Top


We have discussed two design criteria for ontologies to support our task of finding similarities and redundancies across documents, the use of metaproperties in an ontology to support tasks such as identification of the appropriate level of abstraction in representation, and the normalization of dimensions for comparatives. In general, we are balancing adequacy in expressiveness against complexity in similarity reasoning.

Our system currently normalizes the dimensional comparisons that occur in our test set of 15 similar pairs of documents, and exploits mid-level categories to make similarity assessments. We expect that these criteria, along with others that will emerge as our research progresses, will form the basis for defining a powerful yet tractable ontology for large-scale knowledge extraction from documents.

Back to Top

Back to Top

Back to Top

Back to Top


F1 Figure 1. Example of Eureka tips.

F2 Figure 2. A direct word-to-concept ontology for the concept Incapacitate.

F3 Figure 3. A description of logic-based ontology for Damage.

Back to top

    1. Brachman, R.J., McGuinness, D.L., Patel-Schnieder, P.F., and Borgida, A. Reducing CLASSIC to practice: Knowledge representation theory meets reality. Artificial Intelligence 114, 1–2, (1999), 203–237.

    2. Condoravdi, C., Crouch, R. Everett, J.O. de Paiva, V. Stolle, R. Bobrow, D. and van den Berg, M. Preventing Existence. In Proceedings of the Second International Conference on Formal Ontology in Information Systems. ACM Press, New York, NY, 2001, 162–173

    3. Dalrymple, M. Syntax and Semantics: Lexical Functional Grammar. Vol. 34. Academic Press, San Diego, CA, 2001.

    4. Fabriani, P., Missikoff, M., Velardi, P. Using text processing techniques to automatically enrich a domain ontology. In Proceedings of the Second International Conference on Formal Ontology in Information Systems. (Ogonquit, ME), ACM Press, New York, 2001, 270–284

    5. Fellbaum, C. ed. WordNet: An Electronic Lexical Database. The MIT Press, Cambridge, MA, 1998.

    6. Forbus, K.D., Falkenhainer, B. Gentner, D. The structure mapping engine: Algorithm and examples. Artificial Intelligence 41, 1 (1989), 1–63.

    7. Gentner, D. Structure-mapping: A theoretical framework for analogy. Cog. Sci. 7 (1983), 155–170.

    8. Rosch, E. Principles of Categorisation. In E. Rosch and B.B. Lloyd, Eds. Cognition and Categorization. Erlbaum, Hillsdale, NJ, 1978.

    1Note that the system must infer the existence of the HeatingEvent in (7) from world knowledge about ShortCircuitEvents and BurnEvents.

    2This is consistent with a cognitive account of category formation that stresses the primacy of categories such as chair over the more general furniture and more specific loveseat. See [8].

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More