Scientific Web intelligence encompasses all techniques designed to extract information about aspects of scientific research on the Web. The focus in this article is on the most developed area: academic hyperlink analysis. Most computer scientists will know that the power of the search engine Google is partly due to its exploitation of the link structure of the Web. Google regards a link to a page as a kind of endorsement of its importance, especially if the link itself originates from an important page. Google’s founders, Sergey Brin and Lawrence Page, credit the idea to another tool of use to scientists: citation analysis. Citations in journal and conference articles are like hyperlinks in that they connect pairs of documents; a basic belief in citation analysis is that an article cited many times is more likely to have scientific value that an uncited one. This drives various measures used by governments, academic departments, and individual scholars to help evaluate the impact of scientific research. The Institute for Scientific Information’s (ISI) Web of Knowledge contains citation-based impact factors to help evaluate journal quality, in addition to its searchable citation database for finding articles citing and cited by given authors and publications. Computer scientists are likely to be equally familiar with CiteSeer.com, a free site that finds and indexes online scientific documents in a variety of formats, including from large numbers of computing conferences, also reporting citation statistics.
Over the past few years there has been considerable interest in whether the techniques of citation analysis could be directly applied to the Web to extract useful information about online research by counting or individually examining hyperlinks. This question was originally posed by Danish information retrieval experts Peter Ingwersen and Tomas Almind [1], and information scientists Ronald Rousseau [4] and Josep Manuel Rodríguez Gairín [5]. Six years later some answers have emerged. Hyperlinks are a useful source of information about connections to Web pages, whether from the perspective of an individual set of researchers’ pages (see the sidebar) or from a larger scale Web mining perspective, but an overall picture of why links are created and the patterns that are present is important in order to set results in a appropriate context.
This piece focuses on links between university Web sites only, leaving aside the rest of the Web. Although researchers studying their own university Web site may be able to get an exhaustive list of its pages by scanning its directory structure, others, including commercial search engines, must rely primarily on link crawling to find pages and cannot guarantee complete site coverage. The crawler used to collect the data reported in this article typically operated by starting at a university’s home page and followed links to the same site iteratively until all known pages had been visited. This covers the "publicly indexable set," adapting the terminology of Lawrence and Giles [3]. Crawling is not as simple as this, however, and every mature crawler needs code to cope with hundreds of special cases, exceptions, and pathological behaviors. The most important exceptions are as follows:
- Duplicate pages within the same university were rejected;
- The robots.txt convention was followed;
- Mirror sites were rejected, even if copied from uncrawled sites;
- Subsites with derivative domain names were included (for example, www.scit.wlv.ac.uk was included as part of www.wlv.ac.uk);
- Error correction was applied to the HTML (such as for missing quotes or tag ends); and
- Sites without HTML links on their home page were crawled from an alternative starting point, such as a departmental home pages list.
Given the range of problematic cases, in conjunction with the necessary restriction to link crawling, it is not possible to claim complete coverage of a large academic Web site, and almost certainly all crawls were incomplete in various ways. The very strong correlations with research activity found in various studies indicate (but do not prove) that these problems have not had a material effect on the data set.
Web pages are not the best counting unit for links between university Web sites, since academics frequently replicate links over many pages. For example, multi-institution research project Web sites often carry a link to the home page of each participating institution on every page. This is a case of one link motivation being multiplied by a site design decision to give many links. As a result, experiments have shown it is better to count links between directories or domains rather than pages, a technique known as the Alternative Document Model (ADM) [10]. For example, using the domain model, the count of links from domain A to domain B would be 1 if one or more pages in domain A linked to one or more pages in B; otherwise it would be 0. The domain count of links between two universities is obtained by totaling the domain counts over all pairs of domains. A similar domain aggregation model has been used by Google researchers to report link data [2].
Universities that conduct more and better research attract more inlinks simply because they produce more Web content, and not because that content is, on average, more attractive to links. This is again in contrast to citations: universities conducting better research have higher average citation rates.
Why Are Academic Links Created?
Citation analysis was originally driven by the theories of the sociologist Robert Merton, who postulated that citations typically represented cognitive connection between the content of the cited and citing document. Subsequent work revealed a wide range of additional motivations, and similar investigations were needed for hyperlinks.
We now know that hyperlinks between university Web sites are totally different from citations. It seems that less than 1% have creation motivations equivalent to those for citations between two journal articles. But about 90% do have some connection to scholarly activity, whether teaching, general descriptions of research or researchers, or free online software or databases [12]. If the inlinks (or links from other university Web sites) to a university Web site were totaled, then this figure would represent a wide range of types of informal scholarly communication, but would certainly not be an estimate of its contribution to science, in the way that a citation count could claim to be.
With this in mind, there is no reason to believe that counts of links to a university Web site should be good predictors of research performance but in fact they are. For the U.K., plotting inlink counts against research productivity for its 108 university institutions gives a near straight line (Spearman correlation coefficient 0.923, p < 0.001) [10], if individual Web links are appropriately aggregated using the ADMs. So inlink counts could be used with a reasonable degree of accuracy to predict institutional research productivity, but why is this the case, given they rarely directly target research? Given the highly skewed nature of linking for individual pages, as discovered by the topological investigations of Web properties, the linear relationship is doubly surprising. Size partly explains it: bigger universities tend to both attract more inlinks and conduct more research. But even after accounting for university size, universities conducting more and better research attract more inlinks. Early theories for this were that either "better" universities produce better Web pages or a "halo effect:" scholars liked to be associated with the top institutions by linking to them. But both of these turned out to be false. The average number of inlinks per page (or per ADM domain) seems to be approximately the same across all universities within a country. Universities that conduct more and better research attract more inlinks simply because they produce more Web content, and not because that content is, on average, more attractive to links [7]. This is again in contrast to citations: universities conducting better research have higher average citation rates.
Do different subjects publish different amounts on the Web, and therefore attract differing amounts of links? Unsurprisingly, this is true: computer science is the biggest Web user, and some disciplines, such as philosophy and ethics, are almost invisible [8]. But the relative visibility of disciplines varies enormously between countries, perhaps reflecting national differing research specializations. For instance, numerous high link impact academic sites in Taiwan are based upon technology and engineering (excluding computer science), but very few in Australia are [8].
Is geography also a factor? Since it takes no more effort to link a page on the other side of the world than to one in the same country, has the Web overcome distance? In fact, geography is important even within a single country, with remote institutions being much less likely to be linked [11]. Given anywhere in the world for no extra technical effort, an academic is most likely to link to the institute next door. This applies to links across whole universities, but within individual subjects geography seems to be less important. Perhaps sub-specialism is more important in this case. On an international scale geography is a factor too: inter-university links seem to be most common between countries sharing a common language. Moreover, in the European Union, English accounts for about 50% of link pages in most member states [9]. This is a logical extension from the dominance of the English language for top conferences and journals, and a reason for non-English speaking scholars to consider Web publishing in English. The pattern may change radically in the future, however, as more Spanish-speaking countries become extensive Web publishers, in addition to sleeping giants like China.
Scientific Web Intelligence: Applications and Future Directions
One of the scholarly applications of hyperlinks is discussed in the sidebar: individuals can easily use commercial search engines to explore the context of their research area on the Web, a crude form of Web intelligence. The potential exists, however, for data mining approaches along the lines of those based upon citation analysis, such as discovering previously unidentified connections between different lines of research [6]. Another initiative is the ISI’s attempt to map the whole of science through citations, with fields and their interconnectivities identified by citation mining. Articles in the same specialty tend to cite each other, have similar citing patterns and similar cited-ness patterns. This allows fields to be clustered and their interrelationships mapped, but can the same be achieved on the Web? The advantages of the Web over large collections of journals are that it contains much more information, some of which will be more current. But there is an important reason why hyperlink-based science mapping will not work in the same way. Stemming from the relative scarcity of hyperlinks between online articles in university Web sites, there is no compelling reason why pages on similar topics should interlink. An article that does not cite others in the same field risks rejection by referees, but there is rarely such a powerful incentive to link to similar pages. There are exceptions, however, such as copyright concerns necessitating link inclusion when using HTML creation software such as LaTeX2HTML. Because of the relative scarcity of links on the Web (not counting navigational links within the same Web site), the logical alternative is to cluster and map using the full text of pages. But this is a significantly more complex undertaking, with orders of magnitude more data, presenting a real challenge for computer science.
There are two competing candidates for text-based Web mining: invocation mining and text mining. Invocation mining is very similar to citation analysis and hyperlink analysis. The starting point is a list of text strings that will be sought in Web pages such as journal names, article titles, or academics. The analysis would then be based upon counts of invocations of these strings, and may be pattern-based or a comparative evaluation of summary statistics. In contrast, text mining does not start with a pre-set list of strings but analyzes the full text of all documents. This is a well-plowed furrow and there is a range of standard approaches available including clustering and classification. Nevertheless, the task of applying these techniques to the academic Web with the purpose of extracting academic Web intelligence is relatively new and unexplored, a fertile area for future computer science research.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment