Evaluating semantic similarity of concepts is a problem that has been extensively investigated in the literature in different areas, such as Artificial Intelligence, Cognitive Science, Databases, and Software Engineering. Currently, it is growing in importance in different settings, such as digital libraries, heterogeneous databases, and, in particular, the Semantic Web. In such contexts, very often concepts are organized according to a taxonomy (or a hierarchy) and, in addition, are associated with structures (also referred to as *feature vectors*). With this regard, in general, the concept similarity measures proposed in the literature have not been conceived to address both these levels of information (taxonomy and structure), i.e., there exist contributions focusing on the similarity of hierarchically related concepts, and other proposals conceived to compare concept feature vectors. In this article, a method for evaluating similarity of concepts is presented, where both concept taxonomy and concept structures are considered. In particular, such a method has been defined by combining and revisiting (i) the *information content* approach,^{9} with regard to the comparison of concepts within the taxonomy, and (ii) a method inspired by the *maximum weighted matching* problem in bipartite graph,^{4} with regard to the comparison of feature vectors. The proposed approach is then compared with two among the most representative similarity measures defined in the literature, and a small data set shows how the proposed measure allows us to reduce the gap existing between them.

### 1. The Knowledge Base

Let us consider a very general knowledge base (KB), essentially defined by a set of concepts that are organized according to a generalization (*ISA*) hierarchy, where each concept may be associated with a structure, or a feature vector, containing the properties describing the concept. Note that the proposed method can be suitably refined to deal with specific KBs, such as Object-Oriented KBs, Geographical KBs, XML-Schemas, GML-schemas, etc. (see, for instance, Formica^{3}). In this article, a very simple and abstract data model has been addressed in order to present the essence of the method, without dealing with aspects that pertain to specific contexts.

A concept has a *name*, an enumerated set of *super-concepts* (taxonomic information), and a tuple of *typed properties*^{a} (structural information). For instance, consider the following set of concepts:

where *SSN, EIN*, and *VIN* stand for *Social Security Number, Employer Identification Number*, and *Vehicle Identification Number*, respectively. Therefore, a concept has a left-hand side, defined by the name of the concept, and a right-hand side containing the hierarchical and/or structural information. For instance, in the case of the concept of name *person*, on the right-hand side only a structural information is present, i.e., two typed properties, namely *name* and *SSN*, both of type *string*. In the case of *student*, in addition to the structural information (the property *college* of type *string*), we also have a taxonomic information, expressed by the *ISA* construct. In particular, *student* has a super-concept, namely *person*, whose typed properties will be *inherited. Inheritance* is a well-known problem that has been extensively investigated in the literature. In the case of the set of concepts above, after inheritance, the following holds:

where all the *ISA* constructs have been removed, and concepts are expressed in terms of their typed properties only. Note that properties can be typed with atomic types (such as *string* or *integer*) or concepts names, as in the case of the property *owner* of *railcar*, that is typed with *person*.

In order to keep trace, after inheritance, of the *ISA* relations defined in the original KB, the previous set of concepts is coupled with the *concept hierarchy* of Figure 1. Such a hierarchy contains, for instance, an arc between *student* and *person*, since *person* was a super-concept of *student*. Note that it also includes properties, such as *college* or *salary*, that are concepts as well, i.e., their names can be associated with—possibly empty—sets of super-concepts and typed properties. Furthermore, suppose it also contains arcs among *VIN, SSN, EIN*, and *id_number* (*Identification number*). The root of the concept hierarchy is labeled by *Top* which represents the most general concept. Dotted lines stand for paths of arbitrary length.

### 2. Information Content Similarity

Traditionally, in order to evaluate the semantic similarity of hierarchically related concepts, the edge-counting approach is adopted. This approach is essentially based on the distance between nodes of the taxonomy, i.e., the shorter the path, the more similar the concepts these nodes represent. However, the main drawback of this approach is that links in the taxonomy do not represent uniform distances. In fact, the level of refinement between, for instance, the concept *safety valve* and *valve* is not comparable to the level of refinement between, for instance, *knitting machine* and *machine*.^{9} For this reason, a different approach, referred to as the *information content* approach, has been proposed in the mentioned paper, and successively refined,^{6} which does not depend on path lengths. It is based on the association of probabilities with the concepts of the hierarchy. In particular, the probability of a concept *c, p*(*c*), is defined as

where *freq*(*c*) is the *frequency* of the concept *c* estimated using noun frequencies from large text corpora, as for instance the *Brown Corpus of American English*, and *M* is the total number of observed instances of nouns in the corpus. In our example, probabilities have been assigned according to the *SemCor* project,^{2} which labels subsections of the *Brown Corpus* to senses in the *WordNet* lexicon.^{10} According to *SemCor*, the total number of observed instances of nouns is 88, 312. Below, the definitions of some of the previous concepts and the related frequencies (the numbers in parenthesis) are given:

- (7229)
*person*—a human being - (68)
*student*—a learner who is enrolled in an educational institution - (289)
*worker*—a person who works at a specific occupation - (600)
*machine*—any mechanical or electrical device that transmits or modifies energy to perform or assist in the performance of human tasks - (24)
*railcar*—a wheeled vehicle adapted to the rails of railroad - (45)
*identification number(id_ number)*—a numeral or string of numerals that is used for identification - (1)
*VIN*—Vehicle Identification Number - (1)
*SSN*—Social Security Number - (1)
*EIN*—Employer Identification Number - (10)
*salary*—…

Once probabilities have been associated with concepts, we obtain the weighted concept hierarchy of Figure 2, where *Top*, the most general concept, has probability 1. Note that each concept that occurs in the corpus is counted as an occurrence of each more abstract concept up in the *ISA* hierarchy. For instance, in Figure 2, an occurrence of *worker* is counted toward the frequency of *worker* and *person* (for further details, see Resnik^{9}).

By following the standard argumentation of information theory, the information content of a concept *c* is defined as

that is, as the probability of a concept increases, the informativeness decreases, therefore the more abstract a concept, the lower its information content. For instance, in our example, the probability of the concept *person* is greater than the probability of *student*, therefore the information content of *person* is less than that of *student* (in fact, in our KB *student* has the additional property *college*).

In line with Lin,^{6} we define the notion of *information content similarity* (*ics*) of two concepts *c*_{1}, *c*_{2} as follows:

where *c* is the concept providing the maximum information content shared by *c*_{1} and *c*_{2} in the taxonomy, i.e., the more information the two concepts share, the more similar they are. Note that *c* is the upper bound of *c*_{1}, *c*_{2} in the taxonomy whose information content is maximum, i.e., when defined, the least upper bound.

For instance, consider the concepts *student* and *worker*. The maximum information content shared by these concepts is provided by *person*, i.e., their least upper bound in the concept hierarchy. Therefore, their *ics* is the following:

Analogously, in the case of *SSN* and *EIN*, the following holds:

and, if we consider the concepts *student* and *railcar*, the *ics* is null since

Note that, given a concept, a property, or a type *c, ics*(*c, c*) = 1, whereas in the case of two types such as, for instance, *integer* and *string, ics*(*string, integer*) = 0, since we assume that the maximum information content shared by *integer* and *string* in the concept taxonomy is *Top*. Furthermore, given two concepts *c*_{1}, *c*_{2} that are synonyms (for instance, according to *WordNet*), *ics*(*c*_{1}, *c*_{2}) = 1 (we assume they label the same node in the hierarchy).

### 3. Feature Vector Similarity

Concept structures are compared according to a method inspired by the *maximum weighted matching* problem in bipartite graphs that has been revisited in Formica.^{4} In essence, given two concept feature vectors, we have to identify one or more sets of pairs of typed properties that maximize the sum of the products of the *ics* of the properties and the *ics* of the related types. Note that such a sum is evaluated within all the sets of pairs of typed properties such that in the set there are no two pairs sharing an element. For instance, given two concepts, assume that their sets of typed properties represent a set of boys and a set of girls, respectively, a selected set of pairs is a possible set of marriages, when polygamy is not allowed.

In our example, consider the concepts *student* and *worker* defined above. As shown in Figure 3(a), two sets of pairs can be defined such that the above mentioned sum is maximal. These sets are obtained by pairing *name* with *name*, since they are both of type *string, SSN* with *SSN*, both typed again with *string*, whereas they differ for the third pair that, in one set, is defined by *college* and *EIN* and, in the other set, by *college* and *salary*, since in both cases their *ics* are null (in fact, according to the taxonomy of Figure 2, the least upper bound between *college* and *EIN* is *Top*, and the same holds in the case of *college* and *salary*).

Note that, as mentioned before, the maximal sum is computed by multiplying the *ics* of the properties with the *ics* of the related types. For this reason, the role of typing is fundamental. Suppose for instance that *worker* has the property *SSN* of type *integer*, rather than of type *string*. In this case, the two sets of pairs with maximal sum are obtained by pairing *name* with *name* as above, whereas *SSN* of *student* is paired with *EIN* of *worker* (we have seen that their *ics* is equal to 0.67 and their types are both *string*), and *college* is paired with *SSN*, in one set, and with *salary*, in the other set (since in both cases they have null *ics*).

Once identified one set of pairs of typed properties such that the sum of the *ics* is maximal, the sum is normalized. For instance, in our example, it is divided by 4 (that is the highest between the cardinalities of the sets of properties of *student* and *worker*).

Therefore, we have the notion of *feature vector similarity* (*fvs*). In particular, if we consider *student* and *worker*, the following holds:

Let us now compare *student* and *railcar*. As shown in Figure 3(b), *name* is paired with *name*; *SSN* is paired with *VIN* since according to the taxonomy of Figure 2 their *ics* is equal to 0.67 (it is computed similarly to the pair *SSN, EIN* shown before) and, as in the previous case, *college* can be paired with *maker* or *owner* indifferently. Therefore, in this case:

where 4 is the cardinality of the set of properties of *railcar*, that is greater than that of *student*.

### 4. Concept Similarity and Related Work

The *ics* and *fvs* are combined to obtain the notion of *concept similarity* (*Sim*). Given two concepts *c*_{1}, *c*_{2}, *Sim*(*c*_{1}, *c*_{2}) is essentially given by a weighted average between *ics* and *fvs*, as defined below:

where *w* is a weight that is established by the domain expert according to the cases, such that 0 ≤ *w* ≤ 1. For instance, in our example, by assuming *w* = 1/2, we have

and

Regarding the other proposals, we have to distinguish the approaches aiming at evaluating semantic similarity of (i) hierarchically organized concepts (*hierarchy-based* approaches) and (ii) concept structures (*feature-based* approaches).

In the case of hierarchy-based approaches, we have already mentioned that the drawback of the traditional edge-counting approach has been overcome by the information content approach proposed by Resnik,^{9} further refined by Lin.^{6} In particular, it is important to note that by following the *Resnik*‘s approach, concept similarity is given by the maximum information content shared by the comparing concepts, whereas the *Lin*‘s approach takes also into account the information contents of the comparing concepts. Other approaches have been proposed in the literature as, for instance, that of Wu and Palmer.^{11} Here the *Lin*‘s proposal has been selected since it is widely acknowledged in the literature that it shows a higher correlation with human judgment with respect to other proposals, including that of Resnik.^{5}

Regarding feature-based approaches, most of the contributions defined in the literature adopt the *Dice*‘s function.^{7} Essentially, with respect to the *fvs* proposed here, *Dice*‘*s* function allows concept similarity to be computed without explicitly considering the similarity degree of properties (that in our case is represented by the *ics*). In particular, given two concepts *c*_{1}, *c*_{2}, each described by a feature vector, say *F*(*c*_{1}), *F* (*c*_{2}), respectively, their similarity (*Dice*(*c*_{1}, *c*_{2})) is defined as follows:

where

and *Aff* is the set of pairs of concepts showing affinity such that each feature in *F*(*c*_{1}) and *F*(*c*_{2}) can participate in one affinity pair at most (in the case a feature participates in more than one pair, a pair with maximal affinity value is selected and the remaining pairs are discarded), and |*A*(*c*_{1}, *c*_{2})|, |*F*(*c*_{1})|, and |*F*(*c*_{2})| are the cardinalities of the sets *A*(*c*_{1}, *c*_{2}), *F*(*c*_{1}), and *F*(*c*_{2}), respectively.

Consider, for instance, *student* and *railcar*. The number of pairs of concepts (properties) showing affinity is 2, and in particular they are (*name, name*) and (*SSN, VIN*). Therefore, we have

It is important to note that, according to *Dice*‘s approach, the affinity of the pair (*name, name*) is equivalent to the affinity of the pair (*SSN, VIN*). This does not hold in the case of the *fvs* proposed in this article since, as shown before, it is defined on the basis of the *ics* of the pairs of properties, i.e., *ics*(*name, name*) = 1, and *ics*(*SSN, VIN*) = 0.67.

It is worth mentioning that, within feature-based approaches, vector-based measures have been defined, such as the *cosine* or the *Jaccard* measures.^{1} These are mainly used in Information Systems to deal with documents, and are based on vector space models for which semantic similarity of documents is represented by proximity in a vector space. Furthermore, it is worth recalling that in Ziegler,^{12} the *SOQA-SimPack Toolkit* has been presented, aiming at evaluating similarities within a given ontology, and between concepts of different ontologies. In particular, it provides a generic and extensible library of ontological similarity measures in order to capture various notions of similarity.

### 5. Evaluation of the Method

In Table 1, the similarity scores of some of the pairs of the concepts of our running example have been given according to the approaches of *Lin, Dice*, and *Sim*. It is possible to see how *Sim* is a combination of the *Lin*‘s approach that has been conceived to evaluate semantic similarity of hierarchically organized concepts, without addressing their structures, and the *Dice*‘s function that focuses on concept feature vectors, ignoring the taxonomic information. For instance, consider again the similarity between *student* and *railcar*. It is null according to *Lin*, since the maximum information content shared by the concepts in the taxonomy is *Top*, whereas, according to *Dice, student* and *railcar* show a discrete affinity (0.57) due to their structures. By taking into account both the taxonomy and the structure, on the basis of the *Sim* measure, we obtain an approximately average value, i.e., *Sim*(*student, railcar*) = 0.21.

In order to evaluate the *Sim* measure, some considerations have to be done, starting with the problem of “ideal values.” In general, ideal values are established according to human judgment and, in the literature, often the similarity scores assigned by human subjects in the Miller and Charles experiments are addressed (where 28 selected pairs of concepts have been analyzed and one score—on a scale of 0 to 4—has been given for each pair).^{8} By using the Miller and Charles scores, it has already been shown in Lin,^{6} and also by other authors,^{5} that *Lin*‘s method shows a higher correlation with human judgment than other hierarchy-based approaches, e.g., Resnik^{9} and Wu and Palmer.^{11} However, in different contexts, such as conceptual schema analysis, information sharing from multiple heterogeneous sources, intelligent information integration, and, more recently, within Web service discovery, the feature-based approach is preferred which takes into account the heterogeneity of the concept features. In particular, the *Dice*‘s function recalled above is adopted, whose similarity scores, in most of cases, are significantly different from the ones obtained according to *Lin*, as shown in Table 1.^{b} The *Sim* measure has been proposed in order to reduce the gap existing between the two approaches, leaving maximum flexibility to the domain expert in assigning specific weights.

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment