Just as explorers need maps and compasses to aid their travel, so too do computing professionals need effective aids for exploring the computing literature. A modest number of efforts have aimed to address these needs of our community. For example, the Networked Computer Technical Reference Library (www.ncstrl.org) grew out of efforts in the early 1990s to collect technical reports from academic and research departments. In the late 1990s, the Computing Research Repository (arxiv.org/corr/) was developed as a supplement, allowing (co)authors of papers on computing to self-archive, that is, submit to the arXiv.org e-print archive. Yet while these services play useful roles, their focus is on collecting and archiving, rather than on supporting exploration.
Today, large content collections in the computing field can be searched and accessed through sites like the ACM Digital Library (portal.acm.org/dl.cfm), the IEEE Computer Society Digital Library (www.computer.org/portal/site/csdl/), and the Digital Bibliography & Library Project (dblp.uni-trier.de/). Their services are typical of modern digital library systems. Additional services are provided by CiteSeer (www.citeseer.ist.psu.edu), which leverages powerful analysis methods to extract citation data from online documents, and then allows citation counts and links to be considered, to broaden browsing and searching.
In this article we provide illustrations of how richer support for exploratory search can be afforded to the computing community, building upon a long history of investigations at Virginia Tech (for example, [5]), most recently related to our work with the Computing and Information Technology Interactive Digital Educational Library (CITIDEL; www.citidel.org). We begin with a brief explanation of how clustering can aid exploration (also see Hearst’s article in this section). Then, we describe how a combination of methods facilitates exploration that can involve rapid switching between searching, browsing in the ACM Computing Classification System, following links between works, and selecting points in a scatter-plot (2D grid) according to any of a number of pairs of facets. Finally, we explain how Stepping Stones & Pathways (SSP) integrates visualization, clustering, and Bayesian inference to support exploration and the resolution of complex information needs that can be met by sets of related documents. Preliminary results suggest SSP also may facilitate searching when faced with difficult queries, for which no current information retrieval system yields high effectiveness.
Clustering Results
CITIDEL has approximately a half-million metadata records, with more to be added. It was developed to support teaching and learning in the computing field, as a part of the National Science Digital Library (www.nsdl.org). Because education-oriented systems should support exploration, we turned to clustering integrated with visualization to help in that regard [6]. A basic problem is that for people unfamiliar with a large collection of literature, such as occurs in the computing field, the results of a search are too voluminous to manage. When the results are organized well, and a smaller number of groups can be examined instead of a very long list of disparate works, users can save time when they explore. Our usability studies have shown that even such a simple approach can be beneficial [6]. Figure 1 illustrates the clusters identified by applying the Carrot2 system [7] to the results of searching CITIDEL for “digital library.”
Selecting an interesting cluster, for example, “Library of Congress,” in the expandable list on the left of Figure 1 causes all of the four documents belonging to that cluster to be displayed on the right; one can explore among clusters, and then among the documents in a cluster.
Integrated Visualization
Digital libraries often have a variety of types of information that can be leveraged in an integrated manner to provide even richer support. We developed CitiViz as a front-end to CITIDEL, combining searching, text mining, and information visualization [8]. As can be seen in Figure 2, it applies two major visualization techniques—a hyperbolic tree of a hierarchical classification system and a 2D scatter-plot graph.
The visual interface provides overviews of retrieved results from CITIDEL, or a subcollection thereof, in this case from metadata provided from the ACM Digital Library. Figure 2, in the upper left, shows the results for the query “object-oriented programming.”
The retrieved results are classified according to the ACM Computing Classification System (1998 Version; www.acm.org/class/1998). A hierarchical view of that subset of the classification system which covers the retrieved results is shown using a hyperbolic tree in the upper right of Figure 2. Users can explore through “focus + context” navigation, that balances the amount of local detail and global context presented. Each node represents a category in the system; nodes have attached a bubble whose size reflects the number of documents that have that category assigned.
If a user selects a bubble in the hyperbolic tree, the documents associated are plotted in the middle right, using a 2D graph, whose axes are user selectable, in similar fashion to the visualization built into our Envision system [5]. Here the facets of interest are date and rank, which is based on estimated relevance. However, instead of having dots for each retrieved document, or simple icons as was allowed with Envision, a tower is used to represent a document. If a tower is selected, that tower is shown, somewhat larger, in the lower left, where it is clear which category is assigned to each level of the tower. Also, if that document is connected with another document through a citation link, that link is illustrated, so both documents in the pair are identified. Metadata details about the selected document are shown at the bottom right.
Categories selected while browsing the hyperbolic tree also are shown in the left middle. If one of these is selected, a list appears below it, with a title for each of the works to which that category is assigned. We have multiple views of the search results, with varying levels of detail, and with several different types of contextualization. Those exploring the collection can combine searching, following citation links, navigating through category systems, organizing results according to a number of different facets of interest, and choosing varying levels of detail. Our experiments have shown this approach to be powerful and helpful [8].
Stepping Stones & Pathways
Both of the approaches to aid exploration described here begin with the results of a search. However, people exploring a collection may have trouble with searching, and may need some help to get started in that activity. Our SSP system aims to address such situations [2, 3]. It provides special support for those with complex information needs, and combines searching with visualization of, and navigation through, paths that connect sequences of related topics.
SSP is more powerful than conventional search systems since it loosens a key assumption built into other search engines. Those systems assume that users want to find one or more individual documents and that each should satisfy the user’s information need. In real life, however, a complex information need may be satisfied only by a set of documents:
- When preparing a meal from scratch, one might need separate recipes for different parts, for example, the salad and the salad dressing;
- When preparing a syllabus for an advanced course, one might want a series of related supplemental readings, that use common terminology, and that flow well together;
- When identifying a treatment for an unusual set of symptoms, one might need papers that relate symptoms to illnesses, as well as results of clinical trials of treatments for illnesses; and
- When exploring a collection to help with the research for a report or paper, one might need a sequence of papers that each cover part of the path from problem to hypothesis to approach to design to implementation to solution.
Solving such complex needs is best accomplished by humans and computers working together. As users are aided in breaking down complex information needs into parts, as they see what terminology is used for each of the subtopics that emerge during the work on partial solutions, and as they see how subtopics are related, they learn more about the collection and area of interest, as well as gain facility in problem solving. Thus, SSP provides rich support for exploration and discovery.
Our development of SSP was inspired by the early work of Swanson [9]. Originally, Literature-based Discovery [4] was used to find connections between “cures” for “problems” in the medical domain and to discover novel connections between seemingly unrelated topics. New techniques may help us assess how informative a particular connection is in diverse settings [11]. Within the framework of Literature-based Discovery, SSP can support a closed discovery model, where the two topics to connect are given by the user, or an open discovery model, where the two topics are inferred from the query and the collection content.
Thus, SSP can work in two modes. In the first mode, the user provides two subqueries instead of a single, complex, information need statement. In the second mode, the user provides a single statement, and the computer helps with “query splitting,” finding two subqueries. In either case, SSP can proceed with two subqueries, and then work to find the pathways that connect those subqueries. Figure 3 illustrates a simple case of SSP working with the two subqueries parallel and distributed, to support exploration in a focused, educator/learner-oriented, collection of papers on operating systems.
We use the analogy of crossing a shallow stream, where there are stones on either bank, and stepping stones in the stream, so that people can follow different pathways to cross. The two original subqueries are like the stones on the banks, that is, the end stones. SSP begins with those, retrieves documents for each subquery, and then begins its analysis. In Figure 3b, the small circle on the left indicates the documents retrieved for end stone 1, while the large circle on the right indicates the documents retrieved for end stone 2.
Sometimes there are documents retrieved by the system for both original subqueries. For example, Document 1 in Figure 3b, would also be found by a more conventional search engine. Each one of the found documents fully satisfies the user’s information need. Since our collection is small and focused, in keeping with the trend toward focused crawlers, avoiding the distractions resulting from searching over the Web or in collections like INSPEC, we find few documents satisfying both subqueries.
Fortunately, there often are documents retrieved for end stone 1 that are linked to, or can be connected through clustering or other means to, some document retrieved for end stone 2. In Figure 3b, we see that documents 2 and 3 are retrieved, respectively, for end stone 1 and end stone 2, and are related; thus, the pair of documents 23 may satisfy the user’s information need. The same applies for document pair 45.
Looking again at Figure 3a, we see in the middle what we call stepping stones. A stepping stone has a label, and represents a group of documents, each of which can be described in part by that label. That group is related, by citation, or by a high similarity value, to other groups. There are documents about parallel that are related to documents about message passing. There also are documents about message passing that are related to distributed. Hence, message passing is a topic that is a stepping stone connecting the two end stones; we have learned through exploration, and the help of SSP, that this is a bridging topic for the subqueries. The pathway 456 in Figure 3b illustrates how that stepping stone effects a connection between the end stones.
SSP allows exploration at the level of browsing through such networks, seeing what topics connect other topics. Additional insights can be developed about the collection by requesting SSP to expand a pathway. For example, one could request an expansion to find topics that link end stone 1, parallel, to stepping stone 1, message passing. Such exploration can continue as long as there are enough documents in the group and sufficient connectivity among the documents.
Originally, we tested SSP using a collection of documents about operating systems. To ensure its generality, we also tested it with education-oriented collections about two other topics: data mining and information retrieval. Table 1 gives examples of the operation of SSP with each of these three collections—each prepared in connection with our work on CITIDEL, and using CiteSeer to provide citation data. For the sake of brevity, we show only the end stones and a single stepping stone. In the third column we give titles of documents that provide support. For the operating system collection, we see a pair of documents that allow us to address an information need that involves both message passing and remote procedure calls. In particular, Implementing Object-based Distributed Shared Memory relates to two topics: message passing (the first end stone) and area networks (our stepping stone). Also, A Causally Consistent Protocol for Distributed Shared Memory relates to end stone 2, but also is related to documents about area networks. The two documents together reflect a pathway between the end stones, and together provide an answer to the user’s information need.
Eventually, users go beyond exploring topics and want concrete information, that is, they want to see the documents that support a particular stone, and sequences of documents that correspond to a pathway. SSP provides a variety of presentations of the stepping stones, pathways, and documents that relate. Often one can find a single document (see the bottom part of Table 1), pairs of documents (see the top and middle of Table 1), or even triples of documents that can support an information need.
Our usability tests have shown that SSP finds connections between query topics not found by users, as well as saves them time during exploration. Further testing is under way. Readers are invited to use the subqueries shown in Table 1 for the three collections so far prepared, working with our online demo [2], and to explore with SSP.
Query Splitting for SSP
Some users of SSP may not know enough, since they are just beginning to explore, to give the system two initial subqueries. Similarly, when a difficult query is given, for which all current search systems fail [1], it would be very helpful to split that query into two parts with which SSP can operate and yield good results.
A number of difficult queries have been identified in connection with analysis of failures when searching against large collections, as called for in an annual competition for information retrieval techniques and systems—TREC (Text REtrieval Conference) [10]. We tested our query splitting methods using queries and judgments of which documents are relevant to those queries [10].
We picked queries identified as difficult to answer by a majority of retrieval systems [1, 10]. Extensive experimentation with those queries has allowed us to pick the best method among the several we considered, and to tune its parameters to further improve its performance [12].
Table 2 illustrates our query splitting results for some of those difficult queries [12]. The first column is a short summary of the query; column 2 gives a more detailed description. Our algorithm returns pairs of subqueries, shown in the last two columns. These are found through a query expansion process that starts with the original query, along with documents retrieved for those queries, and then tries to separate the many terms associated, into groups about different topics. We have explored a variety of approaches as well as indicators to assess if the subqueries are different but related [12]; the sample of results shown indicates that plausible splits are possible.
Conclusion
Exploratory search takes many forms and can be aided by a variety of methods. Our work with digital libraries and information retrieval has included tests of such methods in their use with collections in the computing field, as well as large standard collections connected with TREC [10]. Our experiments have shown that visualization is a key ingredient in many of the recipes for exploration. Browsing through results clusters, when Carrot2 [7] is combined with our CITIDEL system, can help. More support is provided by integrated searching, browsing, and visualizing, as with CitiViz [8]. For more complex information needs, visualization connected with our Stepping Stones & Pathways [2, 3] approach, along with query splitting for complex and/or difficult queries [12], can leverage even more sophisticated algorithms to find sets of documents that in combination support more advanced types of exploration.
Figures
Figure 1. Carrot2 clustering results for “digital library.”
Figure 2. Visualization of “object-oriented programming” results through CitiViz.
Figure 3. An illustration of Stepping Stones & Pathways software: (a) the top portion is a simplification of a screen dump, with labels added, when searching the operating system collection with the pair of topics parallel and distributed. It shows three stepping stones: message passing, communication overhead, and shared memory. (b) Our simplified summary of another part of the results, for the first stepping stone. In the Venn diagram the left circle represents end stone 1, the right circle represents end stone 2, and the yellow circle at the top represents the first stepping stone, message passing. The table gives titles for the documents shown in the Venn diagram.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment