Research and Advances
Computing Profession

Can We Measure the Impact of a Database?

If we want to measure the impact of a database, can we use its organization to treat it the same way we treat any other publishing agent, such as a journal or an author?

Posted
documents and checklist

Databases publish data. This is undoubtedly the case for scientific and statistical databases, which have largely replaced traditional reference works. Database and Web technologies have led to an explosion in the number of databases that support scientific research, for obvious reasons: Databases provide faster communication of knowledge, hold larger volumes of data, are more easily searched, and are both human- and machine-readable. Moreover, they can be developed rapidly and collaboratively by a mixture of researchers and curators. For example, more than 1,500 curated databases are relevant to molecular biology alone.10 The value of these databases lies not only in the data they present but also in how they organize that data.

In the case of an author or journal, most bibliometric measures are obtained from citations to an associated set of publications. There are typically many ways of decomposing a database into publications, so we might use its organization to guide our choice of decompositions. We will show that when the database has a hierarchical structure, there is a natural extension of the h-index that works on this hierarchy.

Although the main focus of this article is on evaluating the h-indexes of some well-known databases, this was not its original motivation. One of the authors was involved in a project4 to automatically generate a set of conventional papers to give credit—as authors—to the 1,000 or so researchers who had contributed to a database.8 By creating one publication for the whole database—as might happen with data papers that periodically publish data summaries and are citation proxies for databases6—we generate a document with an unhelpfully huge number of authors. Also, these authors receive only one additional publication credit, regardless of whether they contributed to numerous sections of the database or just a single part. On the other hand, generating thousands of documents, one for each “object” in the database, is almost equally useless, for while the authors are now associated with their areas of expertise, they are unwittingly guilty of having minimal publishable units. The tension between those two extremes underlies the rationale for the h-index.9 Therefore, the obvious question is: Can we extend the h-index to work on databases and—given that there is a hierarchical structure—is there a natural extension of the h-index to hierarchies? We then have at least one non-trivial measure of the importance of a database. However, even if this measure is not of interest, the decomposition that produced it may well be, perhaps as a starting point for the curators, who might want to find a useful set of publications to associate with the database for the benefit of those who would like to have their work properly cited.

Hierarchies are used frequently in curated databases. As in the three database examples we discuss in this article,8,13,15 they are based on some hierarchical classification scheme, taxonomy, or ontology. Moreover, datasets based on a file system or data format such as JSON or XML have an intrinsic hierarchical structure (we will refer to all of these as databases). We also note that when two very similar papers are published (for example, a preprint and its final, peer-reviewed version), citation analyzers such as Google Scholar have the option to transfer citations from one paper to the other (the parent). Given that a paper can have at most one parent and that the parent relationship is acyclic, there is at least a partial hierarchy (a set of tree-like structures) already present in the data structure maintained by the software.

Given a hierarchy, how do we use it to limit the possible decompositions into sets of publications? Consider the hierarchy in the DrugBank database, shown in Figure 1. In this database, citations are only to the terminal nodes or leaves of the tree. If we want a citation count for some higher-level node, we would use the sum of the citation counts of the leaves below that node, much as we would take the citation count for an issue of a journal to be the sum of the citation counts of the papers within it. We propose using a subset of these nodes as a possible decomposition; however, we cannot use an arbitrary subset because it allows double-counting of citations if we allow a node and one of its ancestors to appear in the same decomposition. For example, we could not allow the Lactones node (node g at the “classes” level) and Lovastatin (terminal node n) as a candidate set for the computation of an h-index. To do so would be double-counting citations to Lovastatin. Thus, we restrict our attention to antichains of nodes, sets of nodes in which no node is an ancestor of another node in that set. Looking at the Linnean-style stratification in Figure 1, the kingdom, superclass, and class nodes each constitute antichains, as do the drug nodes. The subclass nodes can have other subclass nodes as ancestors and thus are not antichains.

Figure 1.  Partial representation of the Drugbank hierarchical structure. In Drugbank, only the leaves of the hierarchy are directly cited; for example, we can see that the citations of Lactones are the aggregation of the citations to the drugs (we report only Lovastatin and Erythromycin) belonging to the class.

The h-index, which is used almost universally to measure an author’s output and often the importance of a journal, is defined as follows:

Given a set S of publications, its h-index is the largest number n for which there is a subset of S of size n in which each publication has at least n citations.

The extension to hierarchies is immediate:

Given a hierarchy H of publications, its h-index is the largest number n for which there is an antichain in H of size n in which each publication has at least n citations.

We will formalize this in the following section, but before doing that we should make some important observations. First, when the hierarchy is “flat” (there is no ordering), the two definitions coincide. Second, although a set has an exponential number of subsets, its h-index can be efficiently computed by sorting the set. Similarly, a hierarchy can have an exponentially large number of antichains, and we will demonstrate that, by using the appropriate data structures, the h-index of a hierarchy can be evaluated with the same efficiency. This is crucial if we are going to apply it to large databases.

Third, using a hierarchy to constrain the possible decompositions is essential. For example, based on a reduction from the partition problem, one can show that the more general problem of finding the maximal h-index under an arbitrary partition is strongly NP-hard,7 and there are algorithms and complexity research on improving and maximizing the citation indices (for example, h-index, g-index, and i10-index) under different merging rules.7,12,14

Finally, much of this research can be seen as an attempt to “game” the h-index—finding some merging strategy that enhances one’s h-index. If, for a hierarchical organization, we equivalently define the h-index of a hierarchy as the maximal h-index of any antichain within the hierarchy, then it might seem as though we too are engaged in gaming. This is not the case; we are merely employing what appears to be the most natural extension of the h-index definition to hierarchies.

In the following sections, we first formalize the problem and develop some simple results on antichains that allow the maximal h-index to be computed efficiently. Then we provide three examples of databases from which we can extract an h-index and look at some of the details of their hierarchical organization. We conclude with some speculation on how these results might be further developed.

There are some major caveats. The first is that we are not going to comment on the justification or fairness of the h-index and its variants;2,17 we only observe that it is almost universally adopted in measuring the impact of both authors and journals, and—as we have noted above—it may be useful in helping decompose a database into citable units.

Second, the results we present here are preliminary. As we have already stated, the practice of data citation is still in its infancy; people still fail to cite databases or cite them improperly. For comparison, we have included the results of using URLs as citations.

Third, we collected the relevant data in the second half of 2022, prior to submitting this article in October of that year. The unfortunate delay in publication should be taken into account should one want to repeat these experiments. Hence, the results we give should in no way be taken as an accurate or repeatable measure of the importance of the databases we examine. They simply demonstrate appropriate techniques and the feasibility of measuring the h-index at scale.

The h-Index of a Hierarchy

Hierarchies and antichains.  Our starting point is the assumption that we can model a database as a hierarchy together with a citation count for each node in the hierarchy. As described in the introduction, to avoid double counting citations, we want to compute an h-index on the antichains in that hierarchy. To illustrate this, Figure 2 shows a hierarchy in which the root {A} constitutes an antichain, as does {B,G,N} and also the leaves. In this diagram, the level of a node is determined by its citation count, or rank. Even though the number of antichains in a hierarchy can be exponentially large, to compute the h-index, we can use the rank to cut down the number of antichains we need to examine. To see this, we need a small amount of formal development.

Figure 2.  A hierarchical citation structure in which the level of each node is determined by its rank—the number of citations to it. Columns: R, the level; N, the length of the maximal l-antichain at that level; and H, the h-index for that antichain.
Definition 1.

A hierarchy is a partial order (H,) such that for all a,b,cH, ab and ac implies that bc or cb.

This definition of hierarchy ensures that a node can have at most one parent. We note that n is the parent of a node n if it is the smallest node nH, distinct from n, such that nn.

Definition 2.

A set AH is called antichain if, for any distinct pair n1,n2A, neither n1n2 nor n2n1.

Definition 3.

A ranked hierarchy (H,,r) is a hierarchy (H,) together with a rank (or level) function r:HN, such that for all n1,n2H, if n1n2 then r(n1)r(n2).

Given a subset S of H, we will use the term minimal elements of S to refer to those nodes nS for which there is no nS, distinct from n, such that nn. That is, the minimal elements of S are the lowest nodes in the hierarchy belonging to S.

Definition 4.

Given a ranked hierarchy (H,,r), an l-antichain in H is an antichain A such that for all nA:

  1. r(n)l,

  2. nH if nn, then r(n)<l.

In other words, an l-antichain is a set in which each node has a rank of at least l and all its children have a rank of less than l. In Figure 2, the set {H,K} is a 7-antichain and the leaves form a 1-antichain. Note that a leaf (a node with no children) with a rank of at least l could be a member of an l-antichain.

We use the term rank-minimal to describe the elements of a set that are minimal with respect to the rank function r, that is, those elements nS for which r(n) is minimum for S.

The following result guarantees that, given an antichain with minimal rank l, it is possible to find an l-antichain with no fewer elements.

Proposition 1.

Let A be an antichain in a ranked hierarchy (H,,r) with a rank-minimal node of rank l. Then, there exists an l-antichain AH such that |A||A|.

The proof, provided in the additional material, is straightforward; one constructs the antichain by a process that is similar to the top-down algorithm, which we describe shortly.

From this, we can obtain the main result. The h-index of an antichain AH in a ranked hierarchy (H,,r) is defined as h(A)=max{|S||SAnS.r(n)|S|}.

Proposition 2.

For any antichain A in a ranked hierarchy, there is an l-antichain A such that h(A)h(A).

Proof.

Let l=h(A) be the h-index of A. Then there is a subset SA for which r(A)|S|=l for all nS. A rank-minimal element of S will have rank ll and Prop. 1 gives us an l-antichain S with at least as many elements as S; that is, S has an h-index no less than l.

The importance of this result is that we only have to search maximal l-antichains to find the h-index of any antichain in H. The number of such antichains is not greater than the number of nodes, which guarantees us at least a polynomial-time algorithm. In fact, by pursuing a top-down approach, we can obtain a very efficient algorithm that, once the hierarchy is constructed, will, in practice, only visit a subset of the nodes in the hierarchy.

A top-down algorithm.  Consider the l-antichains in Figure 2. At the root, we have a 29-antichain A of length 1 and below it a 24-antichain F of length 1. Proceeding downward, the first longer antichain is the 7-antichain {H,K} of length 2. Further down, we find the 5-antichain {B,J,L,M,N} of length 5. This is the l-antichain of greatest h-index: It has 5 nodes, all with rank greater or equal to 5. As we proceed downward, the level is strictly decreasing while the length of the antichains is non-decreasing. When the latter overtakes the former, we have just “overshot” the antichain that yields the maximal h-index and we can stop. Note that we stopped before examining all the nodes in the hierarchy.

In the following, we use the notation MIN(S) for the minimal elements of a subset SH, that is, an antichain. MAX is defined similarly. An l-antichain can thus be defined as MIN{nH|r(n)l}.

In order to compute successively lower antichains, the algorithm maintains, for decreasing values of l, the following two antichains:

  1. LCHAIN=min{nH|r(n)l}, that is, the l-antichain of maximum cardinality among all l-antichains;

  2. DCHAIN=max{nH|r(n)<l}, that is, the antichain of nodes that are maximal (with respect to ) in the set of nodes with rank less than l.

These two antichains have the following properties:

  1. the children of any element in LCHAIN are contained in DCHAIN;

  2. the next lowest rank l<l of any node in the hierarchy is possessed by at least one node in DCHAIN.

Also, we observe that: (i) a node n is in LCHAIN iff r(n)l and for all children m of n, r(m)<l; (ii) a node n is in DCHAIN iff r(n)<l and the parent p of n is such that r(p)l.

Algorithm 1 details the procedure used to find the antichain that guarantees the maximum h-index. On lines 3 and 13, the nodes with the highest rank (that is, citation count) l are found in the DCHAIN. These nodes are removed from the DCHAIN (line 7) and added to the LCHAIN (line 8). The children of these nodes are then added to DCHAIN (lines 9 and 10). Then, in lines 11 and 12, all of the non-minimal nodes in LCHAIN are removed, making it a proper l-antichain. The process continues until the cardinality of LCHAIN is bigger than the highest rank l in DCHAIN. At this point, each node in DCHAIN has a rank l. Thus, even if it were to be moved from DCHAIN to LCHAIN, there would not be an increase in the value of the h-index for LCHAIN. At this point, the algorithm can stop.

Algorithm discussion.  Referring to Algorithm 1, the most obvious thing to do is to use heaps to implement DCHAIN and LCHAIN. The heap values for each node in DCHAIN should be the node rank, while the values for those in LCHAIN should be the maximal rank of the children of the node, which can be computed at line 9. With these representations, the “search” operations on lines 3, 6, 11, and 13 all incur unit cost, while the insert or delete operations on lines 7, 8, 10, and 12 are all (Ologn) in the length of the antichain. This gives a bound of O(nlogn) in the hierarchy size for the whole algorithm.

A further observation is that at line 9, we need only consider child nodes for inclusion if their rank is greater than the current length of LCHAIN; if this is not the case, they cannot participate in the final antichain, and they together with their descendants in the hierarchy may be safely ignored. In describing the results for the various databases in the next section, we have included two figures: visited (vis.), the number of nodes whose rank was interrogated; and digested (dig.), the number of nodes that entered DCHAIN. The algorithm’s running time will be bounded by a constant in vis. and a constant times nlogn in dig. In the measurements for a whole hierarchy (not the truncated ones where we cut off the hierarchy at a certain level), the digested nodes accounted for less than 10% of the total.

Another observation is that the algorithm requires that the set of nodes as input is a hierarchy, and thus that this hierarchy is built before this algorithm is run. Finally, we note that in the degenerate case of a hierarchy made of only leaves, the algorithm requires O(nlogn) time to create the LCHAIN heap.

Experimental Analysis

Using this algorithm, we compute and analyze the computation of the h-index of three widely used databases: Drugbank, which we describe in the first section; the IUPHAR/BPS Guide to Pharmacology8 (GtoPdb), which is an open-access database on the biological targets of drugs and other molecules; and the taxonomic database curated by the National Center for Biotechnology Information (NCBI), which maintains one of the most comprehensive and best-organized collections of medical data.11 The additional material online details all of the information about these databases and how we collected the corresponding citation counts.

Drugbank.  The results of the analysis on Drugbank are shown in Table 1, where median and maximum refer to the citation counts of the antichain that yields the given h-index (the median is sometimes given in reporting the h-index of a journal). The visited (vis.) and digested (dig.) columns refer to the discussion about Algorithm 1. Only a few nodes are leaves in the antichain for the entire hierarchy. The Leaves/drugs row shows the h-index for the “flat” hierarchy of the drugs themselves, that is, the h-index obtained considering only the drugs as a set of “publications” that receive citations, without considering the hierarchy of the database. The Class and Subclass rows show the results for the hierarchies obtained by removing all nodes below those strata. The table clearly shows the advantage to computing the h-index on a full hierarchy and that constraining the antichains to a given stratum reduces the h-index. We also see that the algorithm calculates the h-index visiting approximately one-third of the nodes in the entire hierarchy.

Table 1.  H-index and statistics on Drugbank. Full (top row): the calculations on the full hierarchy; Leaves/drugs (second row): those on the flat hierarchy (only leaves).
hierarchyh-indexmedianmaxnodesvis.dig.
Full6910026011,8033,743417
Leaves/drugs33447610,303
Subclass631051,369837541276
Class491202,327328287154

GtoPdb.  The IUPHAR/BPS Guide to Pharmacology (GtoPdb) is an open-access relational database with an accompanying website on the biological targets of drugs and other molecules. As shown in Figure 3, GtoPdb has a hierarchical structure. It combines the expertise of approximately 1,000 researchers worldwide and was an early stimulus for data citation5 to give credit to the contributors. Here, however, we are using it to assess credit to the database as a whole.

Figure 3.  Part of the IUPHAR/BPS hierarchical structure. The root node represents the whole database, and the family structure is not stratified, as families may have subfamilies. All the nodes in the hierarchy can be independently cited; for example, we show some sample citation numbers for the GPCR branch of the tree, where the internal nodes can receive direct citations (#cit) that could be also aggregated (#cit agg) with the citations of the child nodes.

A critical difference between Drugbank and GtoPdb is that in GtoPdb, there are citations to the intermediate nodes in the hierarchy. In fact, the data for these “family” nodes often resembles short conventional papers with descriptive material and statistics on the targets (leaf nodes) included in the family. This raises the possibility that we might want to treat a family node as independent of the nodes it describes; we introduce a lifting transformation that allows us to do this.

Table 2 presents the results for the GtoPdb database divided into the horizontal sections Citations and Links. The first section corresponds to results obtained with citations extracted from the citing papers, while the second part is obtained by accounting for Web links as citations.

Table 2.  h-index calculations for GtoPdb. Full (top rows) are the calculations on the full hierarchy; No (second rows) are those on the flat hierarchy (only leaves); and Lifted (bottom rows) are those on the lifted hierarchy.
Cit. typeHierarchyh-indexmedianmaxnodesvis.dig.
CitationsFull55772764,0791,185275
No3244944,079
Families5486276807572253
Lifted55772764,8721,185275
LinksFull1672518804,0791,584419
No1722539094,079
Families1352983,013807588211
Lifted1792659094,8721,604468

As with Drugbank, the first two rows of each section in the table show the analysis on the full hierarchy and on the “flat” collection of nodes, which now contains a substantial number of cited intermediate (family) nodes together with the leaf target nodes. The “families only” rows (Families) show that little is lost by removing the leaves from the hierarchy, suggesting that many citations are also given to the intermediate nodes in the hierarchy. The meaning of the “lifted hierarchy” rows (Lifted) is discussed below.

It is also interesting to note that families of nearly all heights—where by height of a node we mean the maximum length of a path to a leaf node—figure in the antichain that provides the h-index for the full hierarchy. The figures are: five at height 3, nine at height 2, four to leaves, and the remaining 37 to bottom-level families at height 1.

Lifting.  For reasons mentioned in the additional material online, the Links section of Table 2 is based on highly skewed citation counts and the results are almost meaningless. We have included it in this section because it shows an interesting phenomenon. We see—perhaps surprisingly—that computing the h-index on the full hierarchy gives a lower figure than that for the set of nodes with no structure (167 vs. 172). It appears that creating a hierarchy on a set of publications, in this case the GtoPdb webpages, and summing the citation counts as we have proposed, does not necessarily increase the h-index with respect to the simple set of all the publications.

Now consider the hierarchy in Figure 4a. Each node is annotated on the left with its number of “direct” citations and on the right with the sum of the number of citations to that node and to all of its descendants. It is easy to see that, for this first hierarchy, the longest antichain is the set {N3,N4}, which has an h-index of 2. However, had we considered the set of nodes {N1,N2,N3,N4} without a hierarchy and with a number of citations as the one obtained by direct citations to them (4,5,4,5, respectively), its h-index would be 4. Therefore, in this example, the use of the hierarchy does not give the highest h-index that could be obtained with the available publications.

Figure 4.  Example of one hierarchy (a) and its corresponding “lifted” version (b). Once again, R is the value of the rank at each level, N is the cardinality of the maximum antichain at each level, and H is the h-index of that antichain. Below each hierarchy, its h-index is highlighted.

On the other hand, had Figure 4a been part of a larger hierarchy, it might have been advantageous to have a node such as N1 with a relatively high rank. The implication is that, in a bigger hierarchy, new and longer antichains could have been found, where N1 could contribute to the generation of a greater h-index. The problem is that in the computation of the h-index, as described earlier, the inclusion of a node in an antichain precludes the inclusion of any ancestor or descendant node in the same antichain. Thus, a descendant and ancestor node cannot be considered independently in the computation of the h-index.

In a case such as GtoPdb, however, the information associated with a family may complement the information associated with an object (child) in that same family. Therefore, it is reasonable to ask that both parent and child could be counted in the computation of the h-index.

To deal with this problem, we introduce a transformation called lifting: for each internal node Ni in the hierarchy, a surrogate node Vi is created. Vi becomes the parent of Ni along with all the children of Ni, and Ni becomes a leaf. Ni’s rank is now determined only by its direct citations, and the new node Vi enters with a number of direct citations equal to 0 and a rank equal to the sum of its children’s citations.

Figure 4b shows the result of applying this transformation to Figure 4a. Note how the set of all non-surrogate nodes {N1,N2,N3,N4} now can be an antichain at rank 4.

After applying this transformation in the GtoPdb hierarchy, the results in Table 2 (Lifted row) show an h-index of 55 with data citations and 179 with links.

In the introduction, we mentioned that, for conventional publications, when two papers are similar, one may designate one of the papers as the parent, which will then receive all the citations for the child. Whether or not it is advantageous to combine the two depends on the authors. When there are two authors, the advisor with a high existing h-index might want to do this, but the student with few publications might not. A lifted representation of the publication hierarchy and computing the h-index of an author on that hierarchy would keep both advisor and student happy.

NCBI taxonomy.  Among the many databases maintained by NCBI is the taxonomy database,13 which contains “a list of names that are determined to be nomenclaturally correct or valid (as defined according to the different codes of nomenclature), classified in an approximately phylogenetic hierarchy.”a The hierarchy contains some 2.3 million nodes, or taxons. For each node, the Web interface displays incoming links from other NCBI databases and Pubmed, which can be considered as citations from conventional literature. The structure of NCBI is similar to that of IUPHAR/BPS, as all nodes in the hierarchy are citable. There are approximately 3.2 billion incoming links for all nodes, of which 7.8 million are from Pubmed. For reasons described in the additional material, we confined our attention to the vertebrate and animal sub-hierarchies. Vertebrates are a subset of animals, and animals account for approximately half the nodes in the complete hierarchy.

The results on the NCBI taxonomy hierarchy are shown in Table 3, where they are divided between the results from only the incoming links from Pubmed and the results obtained using all the available links. Unsurprisingly, the h-index for animals is substantially greater than that of its sub-hierarchy of vertebrates. Moreover, measuring an h-index using incoming links gives a substantially higher result and indicates the magnitudes we might see if we measure h-indexes using links rather than conventional citations. Lifted hierarchies noticeably raise the results.

Table 3.  h-index calculations for the NCBI taxonomy. Given (first rows) refers to the NCBI taxonomy without modifications, and Lifted (bottom rows) are those on the lifted hierarchy. Pubmed (top rows) are the calculations counting only the citations from Pubmed, and All are the calculations counting Pubmed and incoming Web links as citations.
HierarchyRefs fromSubtreeh-indexNodesvisdig
GivenPubmedanimals1,1811,147,71747,8805,724
Pubmedvertebrates681112,27117,5033,133
Allanimals3,7941,147,717489,69216,712
Allvertebrates2,065112,27143,2829,254
LiftedPubmedanimals1,3011,272,86747,6656,560
Pubmedvertebrates758131,46417,7003,638
Allanimals3,8871,272,867494,76918,284
Allvertebrates2,148131,46445,99610,371

Conclusion

This article demonstrates that databases can be treated similarly to authors and journals for the purpose of measuring scholarly impact. We have shown a natural and efficient way to compute the h-index of a hierarchy of citable units and to apply this to computing the h-index of databases. Preliminary results show that, for some well-known curated databases, the h-index, when measured by citations from conventional literature, is comparable with that of journals. When measured by links from other databases, it can be substantially higher. The results are preliminary because data citation is not as widely practiced as it should be.

Throughout this article, we have deliberately avoided any discussion of the rights and wrongs of using the h-index as a measure of the impact or quality of research output, which is a matter of continuing discussion.3,17 One empirical and theoretical observation is that the h-index is correlated with the square root of the total number of citations. In fact, the ratio is in the neighborhood of 0.5. This holds for our results when considering citations from the literature. However, the results are very different when we measure—as for the NCBI taxonomy—an h-index for incoming links, whose total is in the billions. Here, the ratio is an order of magnitude lower, indicating a skewed distribution that requires further investigation.

There are many more open problems. The most obvious is what one can do without an apparent hierarchy. If no classification scheme is available, some form of cluster analysis can nearly always find one. The more likely problem is that there are several classification schemes,1 and one may want to choose or combine them. Also, a classification scheme may not be hierarchical because a node may have more than one parent. Apart from coming up against the complexity issues described in the first section, one has the prior problem of how the citations for a given node should be distributed among its parents to achieve a sensible ranking. It may be that the notion of disjunctive citations16 could help with this. Finally, unlike conventional publications, databases evolve over time, yet for the purpose of citation, we want to treat them as conventional, immutable publications. This is a general problem with data citation and will also affect how we measure the impact of databases.

Acknowledgments

The authors would like to thank the reviewers for constructive criticisms and Simon Harding, Jonathan Bard, and Aris Filos-Ratsikas for their help. Huawei UK partially supported Peter Buneman. Dennis Dosso’s work was carried out while he was with the University of Padua. Most of the research was conducted while Matteo Lissandrini was at Aalborg University. Gianmaria Silvello was partially supported by the HEREDITARY Project as part of the European Union’s Horizon Europe research and innovation programme under grant agreement No GA 101137074. He Sun was supported by an EPSRC Early Career Fellowship (EP/T00729X/1).

More Online

To view the additional material referenced in the article, please visit https://dl.acm.org/doi/10.1145/3704723 and click on Supplemental Material.

    References

    • 1. Ashburner, M. et al. Gene ontology: Tool for the unification of biology. Nature Genetics 25, 1 (2000), 2529.
    • 2. Bornmann, L. and Daniel, H.-D. What do we know about the h index? J. of the American Society for Information Science and Technology 58, 9 (2007), 13811385.
    • 3. Brito, R. and Navarro, A.R. The inconsistency of h-index: A mathematical analysis. J. of Informetrics 15, 1 (2021), 101106.
    • 4. Buneman, P. et al. Why data citation isn’t working, and what to do about it. Database 2020 (2020), baaa022.
    • 5. Buneman, P. and Silvello, G. A rule-based citation system for structured and evolving datasets. IEEE Data Eng. Bull. 33, 3 (2010), 3341.
    • 6. Candela, L., Castelli, D., Manghi, P., and Tani, A. Data journals: A survey. J. of the Association for Information Science and Technology 66, 9 (2015), 17471762.
    • 7. de Keijzer, B. and Apt, K.R. The h-index can be easily manipulated. Bulletin of the EATCS 110 (June 2013), 79–85.
    • 8. Harding, S.D. et al. The IUPHAR/BPS Guide to Pharmacology in 2018: Updates and expansion to encompass the new guide to immunopharmacology. Nucleic Acids Research 46, D1 (2018), D1091D1106.
    • 9. Hirsch, J.E. An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences 102, 46 (2005), 1656916572.
    • 10. Imker, H.J. 25 Years of molecular biology databases: A study of proliferation, impact, and maintenance. Frontiers in Research Metrics and Analytics 3 (2018).
    • 11. McEntyre, J. and Ostell, J. The NCBI Handbook. National Center for Biotechnology Information (2002).
    • 12. Pavlou, C. and Elkind, E. Manipulating citation indices in a social context. In Proceedings of the 2016 Intern. Conf. on Autonomous Agents & Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems (2016).
    • 13. Schoch, C. L. et al. NCBI Taxonomy: A comprehensive update on curation, resources and tools. Database 2020 (2020).
    • 14. Van Bevern, R. et al. H-index manipulation by merging articles: Models, theory, and experiments. Artificial Intelligence 240 (2016), 1935.
    • 15. Wishart, D. S. et al. Drugbank 5.0: A major update to the drugbank database for 2018. Nucleic Acids Research 46 (2018), D1074D1082.
    • 16. Wu, Y. et al. Data citation: Giving credit where credit is due. In Proceedings of the 2018 Intern. Conf. on Management of Data. ACM (2018), 99114.
    • 17. Yong, A. et al. Critique of Hirsch’s citation index: A combinatorial Fermi problem. Notices of the AMS 619 (2014), 10401050.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More