Digital libraries can be viewed as infrastructures for supporting the creation of information sources, facilitating the movement of information across global networks, and allowing the effective and efficient interaction among knowledge producers, librarians, and information and knowledge seekers. Typically, a digital library is a vast collection of objects stored and maintained by multiple information sources, including databases, image banks, file systems, email systems, the Web, and other methods and formats. Often, these information sources are heterogeneous, in terms of how the objects are stored, organized and managed, and the platforms on which they reside. Moreover, the information sources are dynamic in the sense they may be either included in or removed from the digital library system. Furthermore, digital libraries are composite multimedia objects comprising different media components including text, video, images, or audio. Therefore, the challenge is to provide users with the ability to seamlessly and transparently access digital library objects in spite of the heterogeneity and dynamism among the information sources, and the composite multimedia nature of the objects. To accomplish this, the problem of heterogeneity and SI must first be resolved.
Although there have been several techniques proposed by the research community, especially the database and agent communities, these techniques cannot be easily adapted to digital library environments. This is because, while database integration primarily deals with structured textual data, SI in digital libraries requires the ability to deal with massive amount of multimedia objects. In this article, we identify research challenges involved with SI that are specific to digital libraries, discuss the available solutions that meet the digital library requirements and the digital library prototypes, and describe the environmental digital library system under development at the Center for Information Management, Integration and Connectivity (CIMIC) at Rutgers University.
Issues in SI
In general, system integration includes pre-integration, identification of schema matching, schema integration, and source-data integration sub-processes [6, 10].
The pre-integration process deals with transforming the data models used within the underlying information sources into a “common” model. A common model that can represent all the models at the underlying sources is essential to resolving integration problems due to the heterogeneity in data models such as relational, object-oriented, and hierarchical formats. Therefore, the first task in the pre-integration process is to identify the different data models and extract their schema. Since this is normally a one-time task, manual efforts are often used. The next task is to select a data model as the common model. One important consideration in this selection process is that it must be at least as expressive as any of the underlying models. The final task is to transform every data model of the underlying sources into the common model.
The semantics of the information may differ from one source data model to the other, and moreover, there may exist differences in schema as a result of designers’ views of the world. This level of heterogeneity is addressed in the schema-matching process. Here correspondence of similar concepts among the different schema is identified. A dictionary of names, synonyms, and homonyms is used to help identify such correspondence. An ontology system is typically used to store and manage the dictionary. Semantic description of concepts associated with the source schema can be manually evaluated. In some cases, the schema has very little semantic description associated with it. Sometimes, the underlying source schema need to be modified to better serve the integration process. One of the major hurdles in identifying correspondence of similar concepts is that it normally requires involvement of multiple domain experts.
The schema integration process starts with identifying the portion of the source schema to be included in the integrated schema, and then defining mappings between the integrated schema and the corresponding portion of the source schema. Issues such as conflict resolution can arise during the mapping process. Conflicts can be of several types: classification, structural, descriptive, model-heterogeneity, data-metadata, and data conflicts [10]. Classification conflicts emerge when the data objects describe different real-world elements. For example, a relational table in one source may describe both elements A and B, while the same relational table in a different source may describe only A. Structural conflicts arise when two objects use different levels of constructs. For example, a schema element may be described in a class in one source schema, while the same element is described as an attribute in a different source schema. Descriptive conflicts are due to differences in properties such as different names and keys. With data-metadata conflicts, data in one source may be described as metadata in another source. Finally, data conflicts are associated with the instance of data—data in one source may be identified incorrectly in another source. Normally, this may arise due to errors in data entry, or due to the existence of multiple versions. To ensure the integrated schema is at least as expressive as the individual source schema, one may simply choose the integrated schema to be the union of the entire source schema. One benefit of having this union is that in the case of changes to source schema, modification efforts can be isolated to a specific portion of the integrated schema without affecting the schema of other sources. At the other extreme, the integrated schema can be made tightly integrated [6], whereas many overlapped portions of source schema as possible are matched and integrated.
SI Issues Specific to Digital Libraries
Specific issues to integration in digital libraries are a direct result of their characteristics. These include:
- Massive amount of data;
- Structured, semi-structured, and unstructured data;
- Multimedia nature of data;
- Frequent modification of information sources; and
- The need to cater to different user characteristics, preferences, and capabilities.
Massive amount of data. As a result of the vast amount of data in digital libraries, integration must be scalable.
Structured, semi-structured, and unstructured data. The type of data in digital libraries ranges from structured to semi-structured to unstructured data. While structured data has been well understood and many integration systems are based on such data, semi-structured and unstructured data have not been dealt with in great detail and pose additional challenges. Semi-structured data, such as HTML documents, include lists, sets, variants, and nested constructs that are nonexistent in structured data. In choosing the common model for integration, such constructs would naturally lead to the object-oriented data model. However, object-oriented models fail to meet other requirements—they cannot cope with rapid changes to source schema. Other characteristics of semi-structured data include strict ordering and mandatory and optional existence of data (for example, the use of OL and BODY markups in an HTML document). Perhaps unstructured data is the most challenging type to deal with—normally, it is stored as a collection of flat files.
To resolve the problem of modeling the different types of data, one may use a semi-structured model such as the Stanford Object Exchange Model (OEM) used in the TSIMMIS project. In OEM, a semi-structured model has been used to represent queries and responses among integration components. Structured data can be easily represented in a semi-structured model, for example, a row can be represented as a sub-item of the corresponding table object. Unstructured data can be represented in the model as-is, where the entire content of unstructured data is represented as a single item in a semi-structured model. Alternatively, to better represent unstructured data, any potential structures within the data can be exploited. For example, in the body of email messages, the first two lines may be considered an author’s greeting and the last few lines the author’s signature. However, this structuring is specific to an application.
Multimedia data. Integrating multimedia objects can be achieved by integrating the metadata, with pointers to the actual multimedia object, and by mapping the semantics of multimedia metadata with the schema of other information sources. Unfortunately, extracting semantics from multimedia objects is not trivial. Although some image databases have been well cataloged by creating extensive metadata, a vast amount of multimedia objects have no semantic metadata, such as multimedia electronic catalogs, images used for Internet advertising, and personal collections. In the absence of metadata, establishing correspondence among multimedia objects from multiple sources has to be accomplished either manually or by means of specialized metadata extraction tools.
In some cases, access to multimedia objects is provided through special-purpose interfaces to textually or graphically interpret the binary files. In such a case, SI must be mediated via these interfaces. In other words, integration involves the capability to automatically communicate with the specialized interface and interpret the generated results.
Many times, only a portion of the multimedia object is of interest. We illustrate this case with the following example. Consider the availability of two information sources, one maintaining a collection of all novels and the other a collection of popular movies. A user may wish to access the climax portion of a selected novel and the corresponding scene in the movie (assuming a movie version of the novel exists). In such a case, integration requires extraction of part of the novel and the corresponding movie portion. Such extraction can be performed by the source, where metadata for sections of multimedia objects is maintained. Alternatively, the system can supply the users with the entire multimedia object. Because multimedia objects are typically large in size, extracting and delivering the entire object across the communication network potentially degrades the response time to users’ requests. Frequently requested multimedia objects can be cached at the integration layer. Since the cache size is limited, one needs to develop optimization strategies to determine which objects can reside in the cache and which ones should be removed.
Several approaches to extracting semantics from multimedia objects, especially images and videos, exist today. However, integrating multimedia objects still relies heavily on pre-extracted metadata that is stored in textual format. This means newly created multimedia objects cannot be made immediately accessible to the integration systems.
While structured data has been well understood and many integration systems are based on such data, semi-structured and unstructured data have not been dealt with in great detail and pose additional challenges.
Modification of sources. Changes to the information sources are not limited to data but also to the schema. Change to schema is relatively a frequent occurrence in many environments. For example, in the case of scientific databases, as new advances and experimental procedures are discovered, the corresponding database schema must be modified to better support the new procedures. It appears that on an average, there are two to three changes to the schema of information sources per year [6]. Changes to schema at the underlying information sources can result in failures of proper functioning of the integration components. This is because integration components must have the knowledge of the schema of the underlying sources. Although modification to integration components can be performed manually, it would be slower and more expensive when a relatively large number of sources are involved. Thus, semi-automatic, if not automatic, modification of integration components is desirable. This problem is more challenging when the sources are non-cooperative, because such a source may not notify schema changes, may not make the new schema publicly available, or provide it in a form not directly understandable by the integrating system. In such cases, one has to rely on a human for the purposes of extracting changes to schema.
Change detection has been studied extensively by both commercial and academic sectors. For structured data, change detection can be accomplished either by deploying database triggers or evaluating database logs. Zhang et al. and Chawathe et al. use tree structure to represent semi-structured documents and compare two trees to detect differences between two successive versions. Ball et al. view semi-structured documents as containing sequences of sentences and sentence-breaking markups, and detect differences between two successive versions by comparing sentences and sentence-breaking markups. There is, however, little work related to detecting changes to schema. Traditionally, changes to schema were considered to be infrequent, thus manual approaches to detecting schema changes and modifying the integration components was acceptable. Many of the sources were considered to be cooperative in the sense they propagate schema changes to the integration components in a format that can be easily understood. Adam et al. [1] use a graph structure to represent schema and view the content of a semi-structured document as an instance of its schema. In their approach, change detection to data is performed during parsing of the documents, and change detection to schema is accomplished by collecting changed semi-structured documents and inferring partial schema reflected in them.
Catering to different users’ characteristics, preferences, and capabilities. SI needs to cater to users with varying characteristics, preferences, and capabilities. Users wishing to access digital library objects possess diverse capabilities, for example, may possess different physical, technical, linguistics, and domain expertise capabilities, may have varying characteristics such as mobility, interests, preferences, and may possess different information appliances such as PC, PDA, mobile phone, and Web TV. For example, for a user equipped with text-only display monitor, but requesting a multimedia object consisting of text, image, audio and video, the system should only extract the textual data from the sources and deliver only that part since extracting and delivering modalities other than text would only lead to network congestion and unnecessary processing. Additionally, prior to extracting data from the sources, the integration system can evaluate the user characteristics, preferences, and capabilities and request suitable parts of the object.
Currently, work in this area has focused primarily on identifying end-user preferences and their information appliance capabilities. For identifying end-user preferences, several commercial tools that are capable of capturing users profiles and interests based on users past query history are available. Many research prototypes exist for identifying client appliance capabilities. Mohan et al. provide different object modalities and fidelities to accommodate diverse client appliance capabilities. For example, for client with no audio capability, the textual version of the audio object will be delivered. The necessary conversion from one modality to others is done either in advance or upon request. Adam et al. propose an approach that enables multimedia digital library objects to manifest themselves based on client appliance capabilities. (Details are presented later in this article.)
In contrast to traditional libraries, many digital libraries maintain multiple versions of a document [4], and allow multiple authors to edit them. This poses additional challenges to SI. Moreover, multiple users accessing different digital library objects give raise to the issue of security. Therefore, to ensure that only authorized users gain access to revise, annotate, or simply view objects, digital libraries must have scalable security systems in place. In addition, digital libraries may need to provide anonymity to protect the identity of the users while accessing digital library objects. With continuous advances to technology, another issue that librarians encounter is that they have to deal with new interfaces to access, create, and organize digital library materials.
Approaches to SI in Digital Libraries
One of the main goals in SI is to provide the capability to communicate with heterogeneous sources. In this section, we discuss three popular approaches to resolving the problems, namely, CORBA (a commercially available tool), mediators, and agents. It is important to note that these three approaches are not orthogonal in the sense that a mediator may employ CORBA and an agent may use mediators.
CORBA approach. The Object Management Group (OMG) was formed in 1989 to develop standards for application development within heterogeneous environments. The Common Object Request Broker Architecture (CORBA), one of the components of Object Management Architecture (OMA), came into existence because of lack of programming interfaces and packages that can deal with heterogeneous platforms [11]. The main components of OMA include object services, common facilities, domain interfaces, application interfaces, and the object request broker, as depicted by Figure 1.
Part of OMA directly related to integration is the CORBA component. CORBA consists of numerous features, including ORB Core, Interface Definition Language, Stubs, Skeletons, and others. ORB Core is responsible for delivering requests to object implementation and responses from objects to the client requesting the service. The main feature of ORB Core is its abstractions of the object implementation. While requesting for services, the client does not need to know where the object is located, how the object is implemented (such as which programming language was used), the state of the object, and how to communicate with the objects (such as via TCP/IP, RPC, shared memory, and other methods). All the client needs to worry about is its own application and how to specify the objects of interest. Specifying objects of interest is done through object references. IDL generates two components: stub and skeleton. The stub is responsible for creating and issuing client requests, while the skeleton is responsible for delivering requests to the object implementation. Stub and Skeleton are specific to object implementation.
In regard to integration, object implementation can be used to define interfaces for interacting with the data sources. Even though CORBA provides abstraction of the implementation of services at the object implementation (services provided by the data sources), the task of integrating multiple data as responses from multiple object implementations must be performed by the client application. Thus, the client application needs to know, to some degree, the metadata of the responses of each object implementation. Furthermore, since IDL is specific to object implementation, changes to the services provided by the data source require changes to the object application and propagation of the updated stub and skeleton. This leads to a complex and customized client application. The mediated approach, discussed next, attempts to address these issues.
Mediated approach. The mediated approach utilizes a component called mediator to perform integration. The general architecture of this approach, comprising the information sources, the wrappers, the mediators, and the user interface is depicted in Figure 2. A common query language, such as KQML, is typically used as a means of communication among the components.
The function of the wrapper is to interact with its corresponding information source, converting mediator queries represented in the common language into queries native to the sources and vice versa. To perform its task, a wrapper must have the knowledge of the underlying source. The complexity of the wrappers depends on the amount of cooperation from the source itself. For example, a source can be cooperative by performing many of the processing tasks of answering the wrapper query. At the other extreme, a source may be non-cooperative, in which case, upon receiving an answer to a query from the source, the wrapper has to perform additional processing before sending it to the mediator. To help dealing with the heterogeneous sources, the CORBA approach can be used. If CORBA is employed, the wrappers do not need to deal with the different interfaces of the sources, but need to focus only on formatting the response to query into the common format used within the integration components.
The function of the mediator is to accept users’ queries and transform them into the common model. Each query is later broken up into smaller sub-queries. Subsequently, each sub-query is sent to the appropriate source via the wrapper. Upon receiving answers to sub-queries, the mediator combines and integrates these answers to form the complete answer and presents it to the users. To perform its task, the mediator must have the knowledge of the sources and their schema to determine which sources provide what information. Whenever more than one mediator must be stored and maintained on heterogeneous systems, CORBA can also be used to hide the complexity of the different systems.
One drawback with the mediated approach is that the mediator component does not have the capability to search for new sources or discover potential sources that should be included in the integration. For this reason, agent technology has been introduced into the mediator. However, current work on mediator agents focuses only on information search and response, with little work on integration.
Agent-based approach. This approach concerns the vast amount of not only data but also services offered by digital libraries. Services can be dynamically added and removed from a digital library, thus a digital library system must be adaptable and scalable. An agent is software as well as a hardware component, having local decision capabilities to perform specialized task(s) on behalf of the end user. An agent has knowledge on how to perform its specialized task. Agents can interact with the end users, with other related agents, and with the information sources. There are many types of agents, including collaborative, learning, interface, information, and mediator. Agent-based integration systems in general are comprised of interface agents, mediator agents, and source agents. Figure 3 depicts the general architecture of agent systems.
Mediator agents [12] interact with interface agents, source agents, and other mediator agents. There are many types of mediator agents in which each type performs specific intermediate tasks, including accepting user queries, evaluating user profiles if any, locating the appropriate source agents based on user queries, sending queries to appropriate source agents, monitoring the query progress, formatting and integrating responses from source agents, and communicating and working together with other mediator agents to accomplish a task. Though there is little work in integration being performed, mediator agents can be extended to also include the responsibility for resolving conflicts and avoiding redundant tasks. Ontology can be used to resolve heterogeneity in terms and definitions used among the agents.
In order to send a query to the appropriate agents, a repository of agent description and services is maintained. To locate desired services, the agent can consult the repository. Alternatively, each agent may have the capability to describe their services and to send their description to other agents in a way that can be understood by other agents. To send a query for processing, a mediator agent does not necessarily have to send to the appropriate agent: it can send to its neighboring agent. If this neighboring agent cannot fulfill the query, it would forward the query to the next agent, and so on. Upon receiving a response to the query, the original agent needs to update its knowledge base. This way, when it submits the same type of query for processing the next time, it can direct the query to the appropriate agent [5].
Interface agents interact with the end users, whose function is to accept user query, transform it into the proper language used within the system, and send the transformed query to the appropriate mediator agents.
Interface agents interact with the end users, whose function is to accept user query, transform it into the proper language used within the system, and send the transformed query to the appropriate mediator agents. When sending a user query to mediator agents, interface agents may submit a user profile as well, so that mediator agents can search for information that corresponds to user preferences. Source agents’ function is similar to that of wrappers mentioned earlier.
In the remainder of this section we briefly discuss major digital library initiatives (DLI) that deal with issues in integration to some degree. DLI testbeds under the Stanford Digital Library project and the University of Illinois Digital Library project make use of CORBA technology to provide access to heterogeneous sources. The Stanford Digital Library Project is aimed at resolving the issues of heterogeneity of information and services [9]. Based on CORBA technology, the Information Bus (InfoBus) is the core system of the project that provides uniform access to heterogeneous information sources and services. An interoperability protocol, called InterOp [9], allows flexible control of information flow, where common information exchange protocols such as Z39.50 and HTTP could work for both state and stateless servers. Similarly, the University of Illinois Digital Library project is aimed at developing a large-scale testbed [8]. Accessing heterogeneous information sources is achieved through the use of CORBA architecture with stateful gateway so that client complete-search sessions can be implemented. Users searching for desired information may possess the concept, but not the correct terms to supply the system. The system would provide suggestions of comparable terms the user may recognize.
An example of mediated digital library systems is the Alexandria Digital Libraries (ADL) system. ADL architecture [7] is comprised of three layers: the client, middleware, servers and ingest facilities. The middleware layer consists of several sub-components, including access control to support client host-based and user-based access policies, query and retrieval mappings to transform user queries into SQL and vice-versa, and database access that sends queries in SQL to the underlying servers.
The University of Illinois Digital Library project and the University of Michigan Digital Library project use agent-based systems. The University of Michigan Digital Library (UMDL) [5] utilizes highly specialized information agents to perform information retrieval across heterogeneous sources. Each agent has two properties: autonomy (own reasoning to fulfill the task) and negotiation (the capability to negotiate with other agents). UMDL architecture consists of the user interface agents (UIAs), the mediator agents, collection interface agents (CIAs) and the collections or information sources. UIAs maintain user profiles so that it can cater to user preferences. The mediator agents are enhanced with sub-classes of agents: registry agents to capture the address and content description of every CIA, query-planning agents to route query to the appropriate CIAs, and facilitator agents to facilitate negotiation among mediator agents.
DigiTerra: An Environmental Digital Library
An example of a digital library integration problem we are currently trying to resolve is presented here. We first describe the nature of the problem and the heterogeneity of the data and sources, followed by a brief discussion of our architecture and specific approaches to resolving some of the issues mentioned earlier.
Rutgers CIMIC is in the process of developing an Environmental Digital Library called DigiTerra under the sponsorship from NASA and Hackensack Meadowlands Development Commission (HMDC), a New Jersey State government agency. DigiTerra is a space and land-based system with a goal of providing continuous land monitoring, fire detection, water and air quality testing, urban planning, development of outreach educational materials as well as supporting research and instructional activities in related areas of science.
DigiTerra will facilitate the collection, assimilation, cataloging, dissemination, and retrieval of a vast array of environmental data that includes images from a variety of space-borne satellites, ground data from continuous monitoring weather stations, water and air quality, maps, reports and data sets from federal, state, and local government agencies, and serve diverse user communities ranging from scientists to K12 school children to environmental policymakers.
Our proposed DigiTerra is envisioned as a multi-layered system as depicted in Figure 4. Each of these layers engenders a unique set of research issues being addressed in this project. We limit our discussion here to only those layers relevant to this article.
The integration and interoperability layer is concerned with the collection and assimilation of a vast array of environmental data. The challenge is to develop reliable, inexpensive and non-intrusive mechanisms that do not require extensive changes to underlying data sources without having to impose rigid standards. Thus, we have adopted the mediated approach and have chosen XML as a common language to be used by the mediators and wrappers to represent queries and responses. We choose XML because (1) it lends itself to automatic generation of wrappers through its transportability between heterogeneous information systems in a neutral and system-amenable manner, and (2) many XML parsers, tools, and libraries are readily available. Application programs require the capability to parse and extract the content of the queries, answers, and specifications. By using the readily available XML parsers and tools, the implementation effort can be reduced significantly.
The ontology layer enables users with diverse backgrounds to query across multiple domains. Such a layer must be able to cater to users with diverse backgrounds by offering broader, more general, ontologies that are interlinked to cover many domains.
The data warehousing/data-mining layer provides fast and efficient access to the integrated data, efficient data analysis, and historical, temporal and chronological views without interrupting the operational processing at the underlying information sources. The concept indexing and content-based retrieval layer provides efficient retrieval by suitably organizing the multimedia data based on the concepts associated with the objects.
The universal access [2] layer provides methodologies to cater to diverse users’ characteristics, preferences, and capabilities, as described earlier. Our approaches transparently adapt to changes in current and future information appliances and networking environments, in which objects have built-in intelligence so they can automatically manifest themselves to cater to different users’ capabilities and characteristics. Specific features include the ability to author objects suitable for automatic manifestation, to detect, identify, and manage user capabilities and preferences/profiles, to accommodate the user mobility, and to automatically detect the information appliance capabilities. The security and privacy layer provides support for security of objects through suitable access control, watermarking, and secure networking technologies for protecting the privacy of users. This layer allows specification of authorizations based on users’ qualifications and characteristics rather than the users themselves and based on the contents within an object rather than the object identifier, much like those in the traditional libraries [3].
Conclusion
SI is a central concern for digital libraries to scale to an international level. This is because they are typically constructed as collections of independently evolved components, operated by autonomous organizations, yet should permit interoperation of all components efficiently and conveniently.
While problems in SI of heterogeneous databases have been well investigated within the database communities, SI in digital libraries presents a new set of challenges, including heterogeneous and autonomous information sources storing collections of objects that are of multimedia nature as well as servicing end users with diverse backgrounds, preferences, and characteristics.
New techniques and the extension of current ones are being explored to address the problem of SI in digital libraries. Several large-scale digital library system prototypes are being developed and tested with major funding initiatives by agencies such as NSF, DARPA, NIH and others, as well as by universities and commercial and government organizations. This article is an attempt to provide insight into the various challenges posed by SI and the efforts made to meet these challenges.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment