Databases publish data. This is undoubtedly the case for scientific and statistical databases, which have largely replaced traditional reference works. Database and Web technologies have led to an explosion in the number of databases that support scientific research, for obvious reasons: Databases provide faster communication of knowledge, hold larger volumes of data, are more easily searched, and are both human- and machine-readable. Moreover, they can be developed rapidly and collaboratively by a mixture of researchers and curators. For example, more than 1,500 curated databases are relevant to molecular biology alone.10 The value of these databases lies not only in the data they present but also in how they organize that data.
In the case of an author or journal, most bibliometric measures are obtained from citations to an associated set of publications. There are typically many ways of decomposing a database into publications, so we might use its organization to guide our choice of decompositions. We will show that when the database has a hierarchical structure, there is a natural extension of the h-index that works on this hierarchy.
Although the main focus of this article is on evaluating the h-indexes of some well-known databases, this was not its original motivation. One of the authors was involved in a project4 to automatically generate a set of conventional papers to give credit—as authors—to the 1,000 or so researchers who had contributed to a database.8 By creating one publication for the whole database—as might happen with data papers that periodically publish data summaries and are citation proxies for databases6—we generate a document with an unhelpfully huge number of authors. Also, these authors receive only one additional publication credit, regardless of whether they contributed to numerous sections of the database or just a single part. On the other hand, generating thousands of documents, one for each “object” in the database, is almost equally useless, for while the authors are now associated with their areas of expertise, they are unwittingly guilty of having minimal publishable units. The tension between those two extremes underlies the rationale for the h-index.9 Therefore, the obvious question is: Can we extend the h-index to work on databases and—given that there is a hierarchical structure—is there a natural extension of the h-index to hierarchies? We then have at least one non-trivial measure of the importance of a database. However, even if this measure is not of interest, the decomposition that produced it may well be, perhaps as a starting point for the curators, who might want to find a useful set of publications to associate with the database for the benefit of those who would like to have their work properly cited.
Hierarchies are used frequently in curated databases. As in the three database examples we discuss in this article,8,13,15 they are based on some hierarchical classification scheme, taxonomy, or ontology. Moreover, datasets based on a file system or data format such as JSON or XML have an intrinsic hierarchical structure (we will refer to all of these as databases). We also note that when two very similar papers are published (for example, a preprint and its final, peer-reviewed version), citation analyzers such as Google Scholar have the option to transfer citations from one paper to the other (the parent). Given that a paper can have at most one parent and that the parent relationship is acyclic, there is at least a partial hierarchy (a set of tree-like structures) already present in the data structure maintained by the software.
Given a hierarchy, how do we use it to limit the possible decompositions into sets of publications? Consider the hierarchy in the DrugBank database, shown in Figure 1. In this database, citations are only to the terminal nodes or leaves of the tree. If we want a citation count for some higher-level node, we would use the sum of the citation counts of the leaves below that node, much as we would take the citation count for an issue of a journal to be the sum of the citation counts of the papers within it. We propose using a subset of these nodes as a possible decomposition; however, we cannot use an arbitrary subset because it allows double-counting of citations if we allow a node and one of its ancestors to appear in the same decomposition. For example, we could not allow the Lactones node (node g
at the “classes” level) and Lovastatin (terminal node n
) as a candidate set for the computation of an h-index. To do so would be double-counting citations to Lovastatin. Thus, we restrict our attention to antichains of nodes, sets of nodes in which no node is an ancestor of another node in that set. Looking at the Linnean-style stratification in Figure 1, the kingdom, superclass, and class nodes each constitute antichains, as do the drug nodes. The subclass nodes can have other subclass nodes as ancestors and thus are not antichains.
The h-index, which is used almost universally to measure an author’s output and often the importance of a journal, is defined as follows:
Given a set of publications, its h-index is the largest number for which there is a subset of of size in which each publication has at least citations.
The extension to hierarchies is immediate:
Given a hierarchy of publications, its h-index is the largest number for which there is an antichain in of size in which each publication has at least citations.
We will formalize this in the following section, but before doing that we should make some important observations. First, when the hierarchy is “flat” (there is no ordering), the two definitions coincide. Second, although a set has an exponential number of subsets, its h-index can be efficiently computed by sorting the set. Similarly, a hierarchy can have an exponentially large number of antichains, and we will demonstrate that, by using the appropriate data structures, the h-index of a hierarchy can be evaluated with the same efficiency. This is crucial if we are going to apply it to large databases.
Third, using a hierarchy to constrain the possible decompositions is essential. For example, based on a reduction from the partition problem, one can show that the more general problem of finding the maximal h-index under an arbitrary partition is strongly NP-hard,7 and there are algorithms and complexity research on improving and maximizing the citation indices (for example, h-index, g-index, and i10-index) under different merging rules.7,12,14
Finally, much of this research can be seen as an attempt to “game” the h-index—finding some merging strategy that enhances one’s h-index. If, for a hierarchical organization, we equivalently define the h-index of a hierarchy as the maximal h-index of any antichain within the hierarchy, then it might seem as though we too are engaged in gaming. This is not the case; we are merely employing what appears to be the most natural extension of the h-index definition to hierarchies.
In the following sections, we first formalize the problem and develop some simple results on antichains that allow the maximal h-index to be computed efficiently. Then we provide three examples of databases from which we can extract an h-index and look at some of the details of their hierarchical organization. We conclude with some speculation on how these results might be further developed.
There are some major caveats. The first is that we are not going to comment on the justification or fairness of the h-index and its variants;2,17 we only observe that it is almost universally adopted in measuring the impact of both authors and journals, and—as we have noted above—it may be useful in helping decompose a database into citable units.
Second, the results we present here are preliminary. As we have already stated, the practice of data citation is still in its infancy; people still fail to cite databases or cite them improperly. For comparison, we have included the results of using URLs as citations.
Third, we collected the relevant data in the second half of 2022, prior to submitting this article in October of that year. The unfortunate delay in publication should be taken into account should one want to repeat these experiments. Hence, the results we give should in no way be taken as an accurate or repeatable measure of the importance of the databases we examine. They simply demonstrate appropriate techniques and the feasibility of measuring the h-index at scale.
The h-Index of a Hierarchy
Hierarchies and antichains. Our starting point is the assumption that we can model a database as a hierarchy together with a citation count for each node in the hierarchy. As described in the introduction, to avoid double counting citations, we want to compute an h-index on the antichains in that hierarchy. To illustrate this, Figure 2 shows a hierarchy in which the root constitutes an antichain, as does and also the leaves. In this diagram, the level of a node is determined by its citation count, or rank. Even though the number of antichains in a hierarchy can be exponentially large, to compute the h-index, we can use the rank to cut down the number of antichains we need to examine. To see this, we need a small amount of formal development.
A hierarchy is a partial order such that for all , and implies that or .
This definition of hierarchy ensures that a node can have at most one parent. We note that is the parent of a node if it is the smallest node , distinct from , such that .
A set is called antichain if, for any distinct pair , neither nor .
A ranked hierarchy is a hierarchy together with a rank (or level) function , such that for all , if then .
Given a subset of , we will use the term minimal elements of to refer to those nodes for which there is no , distinct from , such that . That is, the minimal elements of are the lowest nodes in the hierarchy belonging to .
Given a ranked hierarchy , an -antichain in H is an antichain such that for all :
,
if , then .
In other words, an -antichain is a set in which each node has a rank of at least and all its children have a rank of less than . In Figure 2, the set is a 7-antichain and the leaves form a 1-antichain. Note that a leaf (a node with no children) with a rank of at least could be a member of an -antichain.
We use the term rank-minimal to describe the elements of a set that are minimal with respect to the rank function , that is, those elements for which is minimum for .
The following result guarantees that, given an antichain with minimal rank , it is possible to find an -antichain with no fewer elements.
Let be an antichain in a ranked hierarchy with a rank-minimal node of rank . Then, there exists an -antichain such that .
The proof, provided in the additional material, is straightforward; one constructs the antichain by a process that is similar to the top-down algorithm, which we describe shortly.
From this, we can obtain the main result. The h-index of an antichain in a ranked hierarchy is defined as .
For any antichain in a ranked hierarchy, there is an -antichain such that .
Let be the h-index of . Then there is a subset for which for all . A rank-minimal element of S will have rank and Prop. 1 gives us an -antichain with at least as many elements as ; that is, has an h-index no less than .
The importance of this result is that we only have to search maximal -antichains to find the h-index of any antichain in . The number of such antichains is not greater than the number of nodes, which guarantees us at least a polynomial-time algorithm. In fact, by pursuing a top-down approach, we can obtain a very efficient algorithm that, once the hierarchy is constructed, will, in practice, only visit a subset of the nodes in the hierarchy.
A top-down algorithm. Consider the -antichains in Figure 2. At the root, we have a 29-antichain of length 1 and below it a 24-antichain of length 1. Proceeding downward, the first longer antichain is the 7-antichain of length 2. Further down, we find the 5-antichain of length 5. This is the -antichain of greatest h-index: It has 5 nodes, all with rank greater or equal to 5. As we proceed downward, the level is strictly decreasing while the length of the antichains is non-decreasing. When the latter overtakes the former, we have just “overshot” the antichain that yields the maximal h-index and we can stop. Note that we stopped before examining all the nodes in the hierarchy.
In the following, we use the notation for the minimal elements of a subset , that is, an antichain. is defined similarly. An -antichain can thus be defined as .
In order to compute successively lower antichains, the algorithm maintains, for decreasing values of , the following two antichains:
, that is, the -antichain of maximum cardinality among all -antichains;
, that is, the antichain of nodes that are maximal (with respect to ) in the set of nodes with rank less than .
These two antichains have the following properties:
the children of any element in are contained in ;
the next lowest rank of any node in the hierarchy is possessed by at least one node in .
Also, we observe that: (i) a node is in iff and for all children of , ; (ii) a node is in iff and the parent of is such that .
Algorithm 1 details the procedure used to find the antichain that guarantees the maximum h-index. On lines 3 and 13, the nodes with the highest rank (that is, citation count) are found in the . These nodes are removed from the DCHAIN (line 7) and added to the (line 8). The children of these nodes are then added to (lines 9 and 10). Then, in lines 11 and 12, all of the non-minimal nodes in are removed, making it a proper -antichain. The process continues until the cardinality of is bigger than the highest rank in . At this point, each node in has a rank . Thus, even if it were to be moved from to , there would not be an increase in the value of the h-index for . At this point, the algorithm can stop.
Algorithm discussion. Referring to Algorithm 1, the most obvious thing to do is to use heaps to implement and . The heap values for each node in DCHAIN should be the node rank, while the values for those in LCHAIN should be the maximal rank of the children of the node, which can be computed at line 9. With these representations, the “search” operations on lines 3, 6, 11, and 13 all incur unit cost, while the insert or delete operations on lines 7, 8, 10, and 12 are all ( in the length of the antichain. This gives a bound of in the hierarchy size for the whole algorithm.
A further observation is that at line 9, we need only consider child nodes for inclusion if their rank is greater than the current length of LCHAIN; if this is not the case, they cannot participate in the final antichain, and they together with their descendants in the hierarchy may be safely ignored. In describing the results for the various databases in the next section, we have included two figures: visited (vis.), the number of nodes whose rank was interrogated; and digested (dig.), the number of nodes that entered DCHAIN. The algorithm’s running time will be bounded by a constant in vis. and a constant times in dig. In the measurements for a whole hierarchy (not the truncated ones where we cut off the hierarchy at a certain level), the digested nodes accounted for less than of the total.
Another observation is that the algorithm requires that the set of nodes as input is a hierarchy, and thus that this hierarchy is built before this algorithm is run. Finally, we note that in the degenerate case of a hierarchy made of only leaves, the algorithm requires time to create the LCHAIN heap.
Experimental Analysis
Using this algorithm, we compute and analyze the computation of the h-index of three widely used databases: Drugbank, which we describe in the first section; the IUPHAR/BPS Guide to Pharmacology8 (GtoPdb), which is an open-access database on the biological targets of drugs and other molecules; and the taxonomic database curated by the National Center for Biotechnology Information (NCBI), which maintains one of the most comprehensive and best-organized collections of medical data.11 The additional material online details all of the information about these databases and how we collected the corresponding citation counts.
Drugbank. The results of the analysis on Drugbank are shown in Table 1, where median and maximum refer to the citation counts of the antichain that yields the given h-index (the median is sometimes given in reporting the h-index of a journal). The visited (vis.) and digested (dig.) columns refer to the discussion about Algorithm 1. Only a few nodes are leaves in the antichain for the entire hierarchy. The Leaves/drugs row shows the h-index for the “flat” hierarchy of the drugs themselves, that is, the h-index obtained considering only the drugs as a set of “publications” that receive citations, without considering the hierarchy of the database. The Class and Subclass rows show the results for the hierarchies obtained by removing all nodes below those strata. The table clearly shows the advantage to computing the h-index on a full hierarchy and that constraining the antichains to a given stratum reduces the h-index. We also see that the algorithm calculates the h-index visiting approximately one-third of the nodes in the entire hierarchy.
hierarchy | h-index | median | max | nodes | vis. | dig. |
---|---|---|---|---|---|---|
Full | 69 | 100 | 260 | 11,803 | 3,743 | 417 |
Leaves/drugs | 33 | 44 | 76 | 10,303 | – | – |
Subclass | 63 | 105 | 1,369 | 837 | 541 | 276 |
Class | 49 | 120 | 2,327 | 328 | 287 | 154 |
GtoPdb. The IUPHAR/BPS Guide to Pharmacology (GtoPdb) is an open-access relational database with an accompanying website on the biological targets of drugs and other molecules. As shown in Figure 3, GtoPdb has a hierarchical structure. It combines the expertise of approximately 1,000 researchers worldwide and was an early stimulus for data citation5 to give credit to the contributors. Here, however, we are using it to assess credit to the database as a whole.
#cit
) that could be also aggregated (#cit agg
) with the citations of the child nodes.A critical difference between Drugbank and GtoPdb is that in GtoPdb, there are citations to the intermediate nodes in the hierarchy. In fact, the data for these “family” nodes often resembles short conventional papers with descriptive material and statistics on the targets (leaf nodes) included in the family. This raises the possibility that we might want to treat a family node as independent of the nodes it describes; we introduce a lifting transformation that allows us to do this.
Table 2 presents the results for the GtoPdb database divided into the horizontal sections Citations and Links. The first section corresponds to results obtained with citations extracted from the citing papers, while the second part is obtained by accounting for Web links as citations.
Cit. type | Hierarchy | h-index | median | max | nodes | vis. | dig. |
---|---|---|---|---|---|---|---|
Citations | Full | 55 | 77 | 276 | 4,079 | 1,185 | 275 |
No | 32 | 44 | 94 | 4,079 | – | – | |
Families | 54 | 86 | 276 | 807 | 572 | 253 | |
Lifted | 55 | 77 | 276 | 4,872 | 1,185 | 275 | |
Links | Full | 167 | 251 | 880 | 4,079 | 1,584 | 419 |
No | 172 | 253 | 909 | 4,079 | – | – | |
Families | 135 | 298 | 3,013 | 807 | 588 | 211 | |
Lifted | 179 | 265 | 909 | 4,872 | 1,604 | 468 |
As with Drugbank, the first two rows of each section in the table show the analysis on the full hierarchy and on the “flat” collection of nodes, which now contains a substantial number of cited intermediate (family) nodes together with the leaf target nodes. The “families only” rows (Families) show that little is lost by removing the leaves from the hierarchy, suggesting that many citations are also given to the intermediate nodes in the hierarchy. The meaning of the “lifted hierarchy” rows (Lifted) is discussed below.
It is also interesting to note that families of nearly all heights—where by height of a node we mean the maximum length of a path to a leaf node—figure in the antichain that provides the h-index for the full hierarchy. The figures are: five at height 3, nine at height 2, four to leaves, and the remaining 37 to bottom-level families at height 1.
Lifting. For reasons mentioned in the additional material online, the Links section of Table 2 is based on highly skewed citation counts and the results are almost meaningless. We have included it in this section because it shows an interesting phenomenon. We see—perhaps surprisingly—that computing the h-index on the full hierarchy gives a lower figure than that for the set of nodes with no structure (167 vs. 172). It appears that creating a hierarchy on a set of publications, in this case the GtoPdb webpages, and summing the citation counts as we have proposed, does not necessarily increase the h-index with respect to the simple set of all the publications.
Now consider the hierarchy in Figure 4a. Each node is annotated on the left with its number of “direct” citations and on the right with the sum of the number of citations to that node and to all of its descendants. It is easy to see that, for this first hierarchy, the longest antichain is the set , which has an h-index of 2. However, had we considered the set of nodes without a hierarchy and with a number of citations as the one obtained by direct citations to them (, respectively), its h-index would be 4. Therefore, in this example, the use of the hierarchy does not give the highest h-index that could be obtained with the available publications.
On the other hand, had Figure 4a been part of a larger hierarchy, it might have been advantageous to have a node such as with a relatively high rank. The implication is that, in a bigger hierarchy, new and longer antichains could have been found, where could contribute to the generation of a greater h-index. The problem is that in the computation of the h-index, as described earlier, the inclusion of a node in an antichain precludes the inclusion of any ancestor or descendant node in the same antichain. Thus, a descendant and ancestor node cannot be considered independently in the computation of the h-index.
In a case such as GtoPdb, however, the information associated with a family may complement the information associated with an object (child) in that same family. Therefore, it is reasonable to ask that both parent and child could be counted in the computation of the h-index.
To deal with this problem, we introduce a transformation called lifting: for each internal node in the hierarchy, a surrogate node is created. becomes the parent of along with all the children of , and becomes a leaf. ’s rank is now determined only by its direct citations, and the new node enters with a number of direct citations equal to 0 and a rank equal to the sum of its children’s citations.
Figure 4b shows the result of applying this transformation to Figure 4a. Note how the set of all non-surrogate nodes now can be an antichain at rank 4.
After applying this transformation in the GtoPdb hierarchy, the results in Table 2 (Lifted row) show an h-index of 55 with data citations and 179 with links.
In the introduction, we mentioned that, for conventional publications, when two papers are similar, one may designate one of the papers as the parent, which will then receive all the citations for the child. Whether or not it is advantageous to combine the two depends on the authors. When there are two authors, the advisor with a high existing h-index might want to do this, but the student with few publications might not. A lifted representation of the publication hierarchy and computing the h-index of an author on that hierarchy would keep both advisor and student happy.
NCBI taxonomy. Among the many databases maintained by NCBI is the taxonomy database,13 which contains “a list of names that are determined to be nomenclaturally correct or valid (as defined according to the different codes of nomenclature), classified in an approximately phylogenetic hierarchy.”a The hierarchy contains some 2.3 million nodes, or taxons. For each node, the Web interface displays incoming links from other NCBI databases and Pubmed, which can be considered as citations from conventional literature. The structure of NCBI is similar to that of IUPHAR/BPS, as all nodes in the hierarchy are citable. There are approximately 3.2 billion incoming links for all nodes, of which 7.8 million are from Pubmed. For reasons described in the additional material, we confined our attention to the vertebrate and animal sub-hierarchies. Vertebrates are a subset of animals, and animals account for approximately half the nodes in the complete hierarchy.
The results on the NCBI taxonomy hierarchy are shown in Table 3, where they are divided between the results from only the incoming links from Pubmed and the results obtained using all the available links. Unsurprisingly, the h-index for animals is substantially greater than that of its sub-hierarchy of vertebrates. Moreover, measuring an h-index using incoming links gives a substantially higher result and indicates the magnitudes we might see if we measure h-indexes using links rather than conventional citations. Lifted hierarchies noticeably raise the results.
Hierarchy | Refs from | Subtree | h-index | Nodes | vis | dig |
---|---|---|---|---|---|---|
Given | Pubmed | animals | 1,181 | 1,147,717 | 47,880 | 5,724 |
Pubmed | vertebrates | 681 | 112,271 | 17,503 | 3,133 | |
All | animals | 3,794 | 1,147,717 | 489,692 | 16,712 | |
All | vertebrates | 2,065 | 112,271 | 43,282 | 9,254 | |
Lifted | Pubmed | animals | 1,301 | 1,272,867 | 47,665 | 6,560 |
Pubmed | vertebrates | 758 | 131,464 | 17,700 | 3,638 | |
All | animals | 3,887 | 1,272,867 | 494,769 | 18,284 | |
All | vertebrates | 2,148 | 131,464 | 45,996 | 10,371 |
Conclusion
This article demonstrates that databases can be treated similarly to authors and journals for the purpose of measuring scholarly impact. We have shown a natural and efficient way to compute the h-index of a hierarchy of citable units and to apply this to computing the h-index of databases. Preliminary results show that, for some well-known curated databases, the h-index, when measured by citations from conventional literature, is comparable with that of journals. When measured by links from other databases, it can be substantially higher. The results are preliminary because data citation is not as widely practiced as it should be.
Throughout this article, we have deliberately avoided any discussion of the rights and wrongs of using the h-index as a measure of the impact or quality of research output, which is a matter of continuing discussion.3,17 One empirical and theoretical observation is that the h-index is correlated with the square root of the total number of citations. In fact, the ratio is in the neighborhood of 0.5. This holds for our results when considering citations from the literature. However, the results are very different when we measure—as for the NCBI taxonomy—an h-index for incoming links, whose total is in the billions. Here, the ratio is an order of magnitude lower, indicating a skewed distribution that requires further investigation.
There are many more open problems. The most obvious is what one can do without an apparent hierarchy. If no classification scheme is available, some form of cluster analysis can nearly always find one. The more likely problem is that there are several classification schemes,1 and one may want to choose or combine them. Also, a classification scheme may not be hierarchical because a node may have more than one parent. Apart from coming up against the complexity issues described in the first section, one has the prior problem of how the citations for a given node should be distributed among its parents to achieve a sensible ranking. It may be that the notion of disjunctive citations16 could help with this. Finally, unlike conventional publications, databases evolve over time, yet for the purpose of citation, we want to treat them as conventional, immutable publications. This is a general problem with data citation and will also affect how we measure the impact of databases.
Acknowledgments
The authors would like to thank the reviewers for constructive criticisms and Simon Harding, Jonathan Bard, and Aris Filos-Ratsikas for their help. Huawei UK partially supported Peter Buneman. Dennis Dosso’s work was carried out while he was with the University of Padua. Most of the research was conducted while Matteo Lissandrini was at Aalborg University. Gianmaria Silvello was partially supported by the HEREDITARY Project as part of the European Union’s Horizon Europe research and innovation programme under grant agreement No GA 101137074. He Sun was supported by an EPSRC Early Career Fellowship (EP/T00729X/1).
More Online
To view the additional material referenced in the article, please visit https://dl.acm.org/doi/10.1145/3704723 and click on Supplemental Material.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment