Evaluating semantic similarity of concepts is a problem that has been extensively investigated in the literature in different areas, such as Artificial Intelligence, Cognitive Science, Databases, and Software Engineering. Currently, it is growing in importance in different settings, such as digital libraries, heterogeneous databases, and, in particular, the Semantic Web. In such contexts, very often concepts are organized according to a taxonomy (or a hierarchy) and, in addition, are associated with structures (also referred to as feature vectors). With this regard, in general, the concept similarity measures proposed in the literature have not been conceived to address both these levels of information (taxonomy and structure), i.e., there exist contributions focusing on the similarity of hierarchically related concepts, and other proposals conceived to compare concept feature vectors. In this article, a method for evaluating similarity of concepts is presented, where both concept taxonomy and concept structures are considered. In particular, such a method has been defined by combining and revisiting (i) the information content approach,9 with regard to the comparison of concepts within the taxonomy, and (ii) a method inspired by the maximum weighted matching problem in bipartite graph,4 with regard to the comparison of feature vectors. The proposed approach is then compared with two among the most representative similarity measures defined in the literature, and a small data set shows how the proposed measure allows us to reduce the gap existing between them.
Let us consider a very general knowledge base (KB), essentially defined by a set of concepts that are organized according to a generalization (ISA) hierarchy, where each concept may be associated with a structure, or a feature vector, containing the properties describing the concept. Note that the proposed method can be suitably refined to deal with specific KBs, such as Object-Oriented KBs, Geographical KBs, XML-Schemas, GML-schemas, etc. (see, for instance, Formica3). In this article, a very simple and abstract data model has been addressed in order to present the essence of the method, without dealing with aspects that pertain to specific contexts.
A concept has a name, an enumerated set of super-concepts (taxonomic information), and a tuple of typed propertiesa (structural information). For instance, consider the following set of concepts:
where SSN, EIN, and VIN stand for Social Security Number, Employer Identification Number, and Vehicle Identification Number, respectively. Therefore, a concept has a left-hand side, defined by the name of the concept, and a right-hand side containing the hierarchical and/or structural information. For instance, in the case of the concept of name person, on the right-hand side only a structural information is present, i.e., two typed properties, namely name and SSN, both of type string. In the case of student, in addition to the structural information (the property college of type string), we also have a taxonomic information, expressed by the ISA construct. In particular, student has a super-concept, namely person, whose typed properties will be inherited. Inheritance is a well-known problem that has been extensively investigated in the literature. In the case of the set of concepts above, after inheritance, the following holds:
where all the ISA constructs have been removed, and concepts are expressed in terms of their typed properties only. Note that properties can be typed with atomic types (such as string or integer) or concepts names, as in the case of the property owner of railcar, that is typed with person.
In order to keep trace, after inheritance, of the ISA relations defined in the original KB, the previous set of concepts is coupled with the concept hierarchy of Figure 1. Such a hierarchy contains, for instance, an arc between student and person, since person was a super-concept of student. Note that it also includes properties, such as college or salary, that are concepts as well, i.e., their names can be associated withpossibly emptysets of super-concepts and typed properties. Furthermore, suppose it also contains arcs among VIN, SSN, EIN, and id_number (Identification number). The root of the concept hierarchy is labeled by Top which represents the most general concept. Dotted lines stand for paths of arbitrary length.
Traditionally, in order to evaluate the semantic similarity of hierarchically related concepts, the edge-counting approach is adopted. This approach is essentially based on the distance between nodes of the taxonomy, i.e., the shorter the path, the more similar the concepts these nodes represent. However, the main drawback of this approach is that links in the taxonomy do not represent uniform distances. In fact, the level of refinement between, for instance, the concept safety valve and valve is not comparable to the level of refinement between, for instance, knitting machine and machine.9 For this reason, a different approach, referred to as the information content approach, has been proposed in the mentioned paper, and successively refined,6 which does not depend on path lengths. It is based on the association of probabilities with the concepts of the hierarchy. In particular, the probability of a concept c, p(c), is defined as
where freq(c) is the frequency of the concept c estimated using noun frequencies from large text corpora, as for instance the Brown Corpus of American English, and M is the total number of observed instances of nouns in the corpus. In our example, probabilities have been assigned according to the SemCor project,2 which labels subsections of the Brown Corpus to senses in the WordNet lexicon.10 According to SemCor, the total number of observed instances of nouns is 88, 312. Below, the definitions of some of the previous concepts and the related frequencies (the numbers in parenthesis) are given:
Once probabilities have been associated with concepts, we obtain the weighted concept hierarchy of Figure 2, where Top, the most general concept, has probability 1. Note that each concept that occurs in the corpus is counted as an occurrence of each more abstract concept up in the ISA hierarchy. For instance, in Figure 2, an occurrence of worker is counted toward the frequency of worker and person (for further details, see Resnik9).
By following the standard argumentation of information theory, the information content of a concept c is defined as
that is, as the probability of a concept increases, the informativeness decreases, therefore the more abstract a concept, the lower its information content. For instance, in our example, the probability of the concept person is greater than the probability of student, therefore the information content of person is less than that of student (in fact, in our KB student has the additional property college).
In line with Lin,6 we define the notion of information content similarity (ics) of two concepts c1, c2 as follows:
where c is the concept providing the maximum information content shared by c1 and c2 in the taxonomy, i.e., the more information the two concepts share, the more similar they are. Note that c is the upper bound of c1, c2 in the taxonomy whose information content is maximum, i.e., when defined, the least upper bound.
For instance, consider the concepts student and worker. The maximum information content shared by these concepts is provided by person, i.e., their least upper bound in the concept hierarchy. Therefore, their ics is the following:
Analogously, in the case of SSN and EIN, the following holds:
and, if we consider the concepts student and railcar, the ics is null since
Note that, given a concept, a property, or a type c, ics(c, c) = 1, whereas in the case of two types such as, for instance, integer and string, ics(string, integer) = 0, since we assume that the maximum information content shared by integer and string in the concept taxonomy is Top. Furthermore, given two concepts c1, c2 that are synonyms (for instance, according to WordNet), ics(c1, c2) = 1 (we assume they label the same node in the hierarchy).
Concept structures are compared according to a method inspired by the maximum weighted matching problem in bipartite graphs that has been revisited in Formica.4 In essence, given two concept feature vectors, we have to identify one or more sets of pairs of typed properties that maximize the sum of the products of the ics of the properties and the ics of the related types. Note that such a sum is evaluated within all the sets of pairs of typed properties such that in the set there are no two pairs sharing an element. For instance, given two concepts, assume that their sets of typed properties represent a set of boys and a set of girls, respectively, a selected set of pairs is a possible set of marriages, when polygamy is not allowed.
In our example, consider the concepts student and worker defined above. As shown in Figure 3(a), two sets of pairs can be defined such that the above mentioned sum is maximal. These sets are obtained by pairing name with name, since they are both of type string, SSN with SSN, both typed again with string, whereas they differ for the third pair that, in one set, is defined by college and EIN and, in the other set, by college and salary, since in both cases their ics are null (in fact, according to the taxonomy of Figure 2, the least upper bound between college and EIN is Top, and the same holds in the case of college and salary).
Note that, as mentioned before, the maximal sum is computed by multiplying the ics of the properties with the ics of the related types. For this reason, the role of typing is fundamental. Suppose for instance that worker has the property SSN of type integer, rather than of type string. In this case, the two sets of pairs with maximal sum are obtained by pairing name with name as above, whereas SSN of student is paired with EIN of worker (we have seen that their ics is equal to 0.67 and their types are both string), and college is paired with SSN, in one set, and with salary, in the other set (since in both cases they have null ics).
Once identified one set of pairs of typed properties such that the sum of the ics is maximal, the sum is normalized. For instance, in our example, it is divided by 4 (that is the highest between the cardinalities of the sets of properties of student and worker).
Therefore, we have the notion of feature vector similarity (fvs). In particular, if we consider student and worker, the following holds:
Let us now compare student and railcar. As shown in Figure 3(b), name is paired with name; SSN is paired with VIN since according to the taxonomy of Figure 2 their ics is equal to 0.67 (it is computed similarly to the pair SSN, EIN shown before) and, as in the previous case, college can be paired with maker or owner indifferently. Therefore, in this case:
where 4 is the cardinality of the set of properties of railcar, that is greater than that of student.
The ics and fvs are combined to obtain the notion of concept similarity (Sim). Given two concepts c1, c2, Sim(c1, c2) is essentially given by a weighted average between ics and fvs, as defined below:
where w is a weight that is established by the domain expert according to the cases, such that 0 w 1. For instance, in our example, by assuming w = 1/2, we have
Regarding the other proposals, we have to distinguish the approaches aiming at evaluating semantic similarity of (i) hierarchically organized concepts (hierarchy-based approaches) and (ii) concept structures (feature-based approaches).
In the case of hierarchy-based approaches, we have already mentioned that the drawback of the traditional edge-counting approach has been overcome by the information content approach proposed by Resnik,9 further refined by Lin.6 In particular, it is important to note that by following the Resnik's approach, concept similarity is given by the maximum information content shared by the comparing concepts, whereas the Lin's approach takes also into account the information contents of the comparing concepts. Other approaches have been proposed in the literature as, for instance, that of Wu and Palmer.11 Here the Lin's proposal has been selected since it is widely acknowledged in the literature that it shows a higher correlation with human judgment with respect to other proposals, including that of Resnik.5
Regarding feature-based approaches, most of the contributions defined in the literature adopt the Dice's function.7 Essentially, with respect to the fvs proposed here, Dice's function allows concept similarity to be computed without explicitly considering the similarity degree of properties (that in our case is represented by the ics). In particular, given two concepts c1, c2, each described by a feature vector, say F(c1), F (c2), respectively, their similarity (Dice(c1, c2)) is defined as follows:
and Aff is the set of pairs of concepts showing affinity such that each feature in F(c1) and F(c2) can participate in one affinity pair at most (in the case a feature participates in more than one pair, a pair with maximal affinity value is selected and the remaining pairs are discarded), and |A(c1, c2)|, |F(c1)|, and |F(c2)| are the cardinalities of the sets A(c1, c2), F(c1), and F(c2), respectively.
Consider, for instance, student and railcar. The number of pairs of concepts (properties) showing affinity is 2, and in particular they are (name, name) and (SSN, VIN). Therefore, we have
It is important to note that, according to Dice's approach, the affinity of the pair (name, name) is equivalent to the affinity of the pair (SSN, VIN). This does not hold in the case of the fvs proposed in this article since, as shown before, it is defined on the basis of the ics of the pairs of properties, i.e., ics(name, name) = 1, and ics(SSN, VIN) = 0.67.
It is worth mentioning that, within feature-based approaches, vector-based measures have been defined, such as the cosine or the Jaccard measures.1 These are mainly used in Information Systems to deal with documents, and are based on vector space models for which semantic similarity of documents is represented by proximity in a vector space. Furthermore, it is worth recalling that in Ziegler,12 the SOQA-SimPack Toolkit has been presented, aiming at evaluating similarities within a given ontology, and between concepts of different ontologies. In particular, it provides a generic and extensible library of ontological similarity measures in order to capture various notions of similarity.
In Table 1, the similarity scores of some of the pairs of the concepts of our running example have been given according to the approaches of Lin, Dice, and Sim. It is possible to see how Sim is a combination of the Lin's approach that has been conceived to evaluate semantic similarity of hierarchically organized concepts, without addressing their structures, and the Dice's function that focuses on concept feature vectors, ignoring the taxonomic information. For instance, consider again the similarity between student and railcar. It is null according to Lin, since the maximum information content shared by the concepts in the taxonomy is Top, whereas, according to Dice, student and railcar show a discrete affinity (0.57) due to their structures. By taking into account both the taxonomy and the structure, on the basis of the Sim measure, we obtain an approximately average value, i.e., Sim(student, railcar) = 0.21.
In order to evaluate the Sim measure, some considerations have to be done, starting with the problem of "ideal values." In general, ideal values are established according to human judgment and, in the literature, often the similarity scores assigned by human subjects in the Miller and Charles experiments are addressed (where 28 selected pairs of concepts have been analyzed and one scoreon a scale of 0 to 4has been given for each pair).8 By using the Miller and Charles scores, it has already been shown in Lin,6 and also by other authors,5 that Lin's method shows a higher correlation with human judgment than other hierarchy-based approaches, e.g., Resnik9 and Wu and Palmer.11 However, in different contexts, such as conceptual schema analysis, information sharing from multiple heterogeneous sources, intelligent information integration, and, more recently, within Web service discovery, the feature-based approach is preferred which takes into account the heterogeneity of the concept features. In particular, the Dice's function recalled above is adopted, whose similarity scores, in most of cases, are significantly different from the ones obtained according to Lin, as shown in Table 1.b The Sim measure has been proposed in order to reduce the gap existing between the two approaches, leaving maximum flexibility to the domain expert in assigning specific weights.
6. Lin, D. An information-theoretic definition of similarity. In Proceedings of the 15th International Conference on Machine Learning (ICML) (Madison, Wisconsin, USA, July 2427, 1998), Morgan Kaufmann, 296304.
9. Resnik, P. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI) (Montréal, Québec, Canada, August 2025, 1995), Morgan Kaufmann, 448453.
11. Wu, Z., Palmer, M.S. Verb semantics and lexical selection. In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics (ACL) (Las Cruces, New Mexico, June 2730, 1994), Morgan Kaufmann, 133138.
12. Ziegler, P., Kiefer, C., Sturm, C., Dittrich, K.R., Bernstein, A. Detecting similarities in ontologies with the SOQA-SimPack toolkit. In Proceedings of the International Conference on Extending Database Technology (EDBT) (Munich, Germany, LNCS 3896, March 2631, 2006), Springer, 5976.
a. Typeless data models are also noteworthy in order to evaluate this proposal. In this paper, typing has been addressed to provide a method that is more easily adaptable to Semantic Web languages, as for instance XML-Schemas.
b. Note that the simple data set shown in Table 1 does not want to provide any evaluation of the method. It has been defined to show an example about the gap existing between the hierarchy- and feature-based approaches.
©2009 ACM 0001-0782/09/0300 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2009 ACM, Inc.
No entries found