In recent years, the number of digital information storage and retrieval systems has increased immensely. These information sources are generally interconnected via some network, and hence the task of integrating data from different sources to serve it up to users is an increasingly important one [10]. Applications that could benefit from this wealth of digital information are thus experiencing a pressing need for suitable integration tools that allow them to make effective use of such distributed and diverse data sets.
In contrast to the on-demand approach to information integration, the approach of tailored information repository construction, commonly referred to as data warehousing, is characterized by the following properties:
- At setup time, relevant information is extracted from different information sources on the network, transformed and cleaned as necessary, merged with information from other sources, and then loaded into a centralized data store—the data warehouse.
- During query processing time, queries posed against the system are directly evaluated against the data warehouse without further interaction with the original sources.
- During operation time, modifications of the sources are filtered for relevance and are then propagated in some manner to upgrade the data warehouse.
Data warehousing (materialized views) offers higher availability and better query performance as all information can be retrieved from a single location, and thus is a suitable choice when high-performance query processing and data analysis is critical. Similarly, this approach is advantageous when the information sources are very expensive to access or even occasionally become unavailable, when network delays cause high costs, or when the middle-layer tasks such as translation or merging are too complex and ineffective, possibly requiring human input.
Data warehousing has been found to be an extremely useful technology for a large number of modern applications. Such applications range over diverse domains such as business (for example, trade-market analysis), leisure (travel, weather, or tourist information), science (integration of diagnosis and results from specialists, nurses, and doctors), libraries (museums, art collections, and other multimedia online resources), and education (lecture notes, syllabi, exams, and transparencies from different Web sites).
The data warehousing paradigm may be applied to the following environments:
- Monolithic systems in which a single underlying source providing the data feed is controlled by the same organization as the back end data warehouse store. For example, in online stores such as Amazon.com, the Web-based front end must handle high-performance transactions by customers whereas the underlying data warehouse serves as container of all transactions logged over time for off-line analysis.
- Distributed but closed environments in which a small or moderate number of distributed sources are independent units controlled by trusted owners often with a joint cooperative goal. For example, enterprises of even moderate size with multiple sources of data such as personnel information, customer databases, products, suppliers, and accounting systems may attempt to integrate some of them electronically for analysis purposes.
- Large-scale open environments such as the Web in which unrelated sources appear at unpredictable times and where content providers aim to extract and integrate sources for possibly non-anticipated purposes. For example, the Web currently contains all aspects of travel information including fares and special bargains by different airlines, maps from different cities, and hotel availability in numerous different sources. A travel consolidator service may collect such travel-related information from different online sources and organize it into a tailored travel service repository.
Data Warehousing Architecture
Technology for information integration often features a multi-tier architecture as depicted in Figure 1. There is generally a large variety and number of sources in the underlying environment, each modeled by diverse data models and each supporting different query interfaces. This may include legacy systems, proprietary application programmer interfaces, traditional database technology such as relational or object-oriented database systems as well as new technology such as HTML Web sites, SGML or XML documents, news wires, and even multi-media sites.
The data warehouse store is (at least logically) a centralized repository that must support complex decision support queries at high performance. A data warehouse store typically utilizes relational database technology due to the maturity of this technology. For this reason, there is a wrapper associated with each information source connecting the source to the system. The wrapper is in charge of the translation between the native model of the source to the common model of the data warehousing system, including the exposure of constraints, schema and data chunks from the specific source to the warehousing system, the mapping of query requests, and the notification of source changes.
The data warehouse middle layer consists of a collection of tools in charge of the exploitation of services provided by the individual sources and the proper management of the integrated data warehouse store. This includes, for example, tools for filtering and then merging information from several sources, for managing a metadescription of the information space, and for maintaining the data warehouse store under any change notifications from sources. In recent years, numerous software companies have successfully started to produce the “middleware” software necessary for data integration, often in Web-based e-commerce applications. This represents a growing market of opportunities where more and more database vendors as well as other software companies will be producing related tools and services.
Dynamicity of Information Sources
Individual information sources are often autonomous and generally have an existence and purpose beyond that of supporting the data warehouse itself. In fact, they may not even be aware that one of the clients accessing them may be a wrapper connecting it to an integration system. An important consequence of the autonomy of sources is the fact that those sources may change without being controlled from a higher data integration layer. Many sources, particularly Web-based data sources, may not only change their data, but also their capabilities1 without cooperating with users of their data.
Assuming the relational model as a common integration paradigm, the type of changes visible at the middle layer can be categorized as:
- Data updates such as add-tuple and change-value.
- Schema changes such as add-column and delete-table.
- Constraint modifications such as remove-key-constraint and add-containment-constraint.
- Statistics and metadata adjustments such as selectivity of attributes and size of relations.
It has thus far largely been overlooked that these types of changes are possible and thus their impact on data warehousing needs to be studied as well. Most commercial data warehousing systems only deal with the propagation of updates of data, if at all, and then often only in a batch mode. This means they collect all updates (such as transactions over a source database), say, over the course of a day and then load them all at once into the data warehouse store. Recent research in data warehousing has begun to go one step further and has studied the propagation of and data warehouse maintenance under (concurrent) data updates [1].
Reasons for Schema Changes
Schema and interface changes are quite common because not only is it difficult to predetermine the database schema for many complex applications during the first pass, but application requirements also typically change over time. For example, [9] documents the extent of schema evolution during the development and the initial use of a health management system at several hospitals. There was an increase of 139% in the number of relations and an increase of 274% in the number of attributes, and every relation in the schema was changed at least once during the 19-month period of the study. In another study [6], significant changes (about 59% of attributes on the average) were reported for seven applications, ranging from project tracking, real estate inventory and accounting, to government administration of the skill trades and apprenticeship programs. These studies suggest that handling schema changes is an inevitable task not only during the development of a project but also once a project has become operational.
Implicit Causes for Schema Changes
Schematically diverging information sources. Even though both the information sources and the data warehouse may abide by the relational data model, it has long been recognized that the same information content may be stored in schematically incompatible ways. For example, a payroll system could store the salaries of all employees in one table, denoting the department as one value of each tuple; or it could also maintain a separate table for each department. Both schemas would be capable of holding the same information (semantics), but cannot be directly integrated into one conventional view using currently available variants of SQL. Due to their schema restructuring capability, second-order languages such as SchemaSQL [3] could be utilized to integrate such relational sources with schematic discrepancies by constructing SQL wrappers (see Figure 2).
In such a schema-transforming SQL wrapper, update propagation raises an interesting new issue. Namely, it turns out that data updates in a SchemaSQL-wrapped source may be transformed into updates of the schema in the associated SchemaSQL view (like in the data warehouse middle layer), and vice versa. For example, as seen in Figure 2, adding a tuple with a new dept value (in this case, Physics) to the upper table would require the addition of a new attribute in the lower relation with its attribute name equal to that value (the Physics column). In fact, a closer examination of this issue reveals that any change of the type of an update (single/multiple, data/schema) is possible, causing more frequent schema changes than in a simple relational database.
Heterogeneous model information sources. The integration of information modeled in other data models into a data warehousing system exhibits a similar behavior. For example, XML, an increasingly popular format for information encoding and exchange, could be integrated by building relational wrappers over XML datasets by mapping XML Document Type Definitions (DTD) into relational metadata and the corresponding XML data files into relational data. However, since XML data files carry their schematic descriptors in the form of tagged fields, updates to XML data files in the DTD-less case can lead to schematic modifications of the relational wrapper view exposed to the data warehouse (see Figure 3).
Data Warehousing Issues for Dynamic Sources
Integrating data from dynamic information sources raises new challenges in the maintenance and evolution of data warehouse systems over such data that can be classified in three groups:
Adapting to information source changes at the wrapper level. Many solutions for wrappers currently assume some source cooperation in the sense that sources notify the higher layers of the data warehouse of their data updates. For non-cooperating sources, different strategies for the discovery of such changes are being investigated in the research community [2]. Given that changes other than data updates are possible, there is now also a need for the development of algorithms for detecting data and schema updates in diverse sources in order to include an even larger class of possible data providers in an information integration system.
Given that sources may be integrated into the data warehouse using a custom-made wrapper, the evolution of such wrappers under schema changes (ideally without requiring manual input) becomes an extremely important and difficult problem. If this issue remains unaddressed, dynamic sources will tend to become unaccessible to integration systems over time. Clearly, the utilization of wrapper generator technology [5], possibly with built-in facilities for handling evolution of the wrapper itself, would be a step in the direction of solving this problem. A solution must include the capability of translating all update types through wrappers.
In the case of non-cooperative sources, multiple data warehouses or other integration services may need to inquire at different times about the changes an information source has undergone based on their respective data warehouse refresh strategy. Also, the data warehouse managers may not be ready to receive the update notifications when the wrappers happen to broadcast them. The spawning of many possibly computationally expensive change detection algorithms would be prohibitively expensive, intrusive, and thus infeasible in practice. Hence, one important component for dynamic data warehousing architectures is a change-history server (or query subscription service, [2]). This manager would be responsible for collecting all changes on a source and for servicing queries about source changes from the middle layer. Such a manager may even support registrations for standing notification requests from the middle layer tools. An example may be “Give me all updates concerning the relation Product from the regional-sales source if you have collected either at least 10 updates or one hour has passed since the last notification.”
Adapting the definition of the data warehouse. Given that a data warehouse schema is often defined by a derivation specified over the exported schemas of the underlying information sources, dynamicity of the sources could now render the derivation invalid and possibly the data warehouse undefined. Hence, there is a need for technology to either correctly propagate such changes to the data warehouse definition itself or to appropriately isolate the date warehouse from them.
Many data warehousing systems, especially over medium to large-scale information environments, use some form of metaknowledge to facilitate the discovery and integration of particular information sources into the data warehouse. While not necessarily containing a static global schema constructed from all source schemas common for federated database systems, such metaknowledge bases (MKBs) typically contain some information about the schemas and capabilities of all registered information sources [8]. Given that sources are dynamic, algorithms must be developed for the evolution of the metaknowledge itself to continue to correctly reflect the underlying information space.
Adapting the data content in the data warehouse. A data warehouse may have special-purpose data feeds set up that “pump” data sets into the store. Once the definition and schema structure of a data warehouse evolves, it becomes necessary to correspondingly adjust the content of the data warehouse store. Due to performance demands on the system and the size of the data warehouse store, such data warehouse content adjustments are ideally done incrementally rather than by a complete data reload [1, 12]. Hence, incremental techniques for the adaption of the data warehouse extent become a critical technology to support.
While data warehouse maintenance under data updates of information sources is a fairly well-established field, maintenance under a mixture of data and schema updates remains largely unexplored [10, 11]. Furthermore, with the dynamicity in terms of schema information, such incremental data warehouse maintenance algorithms would now need to also deal with the concurrency of source updates (and their notifications to the warehouse) and maintenance queries (and their result submission to the warehouse). Also, maintenance queries submitted to the information source may find an altered schema due to a source schema change that no longer matches the submitted query format, resulting in an aborted query request [11].
A Tour of EVE: Toward Flexible Data Warehousing
The Evolvable View Environment (EVE) project [4, 8, 11] is one of the first attempts of addressing this issue of dynamic sources in its entirety. An architectural overview of EVE is given in Figure 4.
The EVE system follows our earlier assumptions and addresses several of the problems presented. When a source joins the system, it notifies EVE of its content, its capabilities, and possibly its interrelationships with other sources. One example of such information is overlaps in information content offered by two source databases. This source description knowledge is then kept in a MKB.
One core feature of EVE is Evolvable-SQL (E-SQL), an extension of SQL that allows the view definer to express preferences for view evolution. Using E-SQL, a user defining a view can specify what information is indispensable, what information is replaceable by similar information from other ISs, and whether a changing view extent is acceptable. This is done by attaching preference parameters to elements of the view query. This then is the key to evolving a view by rewriting its view definition into a possibly non-equivalent one that still preserves the user’s intended semantics.
Once the system is notified about a source schema change that affects a view, the view synchronizer module explores alternative techniques for view query rewritings with the goal of adapting the definition of the data warehouse in a way acceptable to the user. View synchronization algorithms evolve a view definition by finding appropriate replacements for affected view components based on available metaknowledge, and by dropping nonessential view components if flagged as such by E-SQL [7]. Correspondingly, the MKB evolver will evolve the metaknowledge to match the modified state of the information space.
Since view synchronization algorithms may generate many possible query rewritings, one needs to be selected as the new view definition. For this purpose the quality-and-cost (QC) model has been developed to estimate the quality and cost of the rewritings [4]. Each possible query rewriting will in general preserve a different amount (extent) and different types (interface) of information, referred to as the quality of the view. Also, each new view query may cause different view maintenance costs. With these two dimensions, the QC-Model can compare different view queries with each other, even if they are not equivalent.
The view maintainer applies different strategies to update the data content of the warehouse, both after modification of its view definition by the view synchronizer [7], as well as under source data updates [11].
Lastly, the concurrency control module incorporates a solution to the evolving view definition and maintaining the view extent under concurrent data and schema updates. For this, existing algorithms for view synchronization and for view maintenance are integrated into one system by providing protocols that enable them to correctly coexist and collaborate [11].
A demonstration of the EVE system supporting several commercial database systems as information sources has been done in Java. The demo can be viewed and downloaded in a source code version at davis.wpi.edu/dsrg/EVE. In its current state, our software addresses the issues on adapting the data warehouse definition as well as some issues in adapting the data content. We have not yet dealt on a practical level with the discovery of source changes.
Conclusions
Data warehouses have emerged as one key technology for the integration of distributed information sources. However, networked environments like the Web have produced a need for information integration solutions that take autonomy, heterogeneity, and dynamicity of sources into account. Challenging issues for data warehouse solutions arise due to the dynamicity of the information sources to be integrated. This article characterizes the types of dynamicity (such as data updates, schema changes, and constraint modifications) as well as their explicit and implicit generation. This article also identifies issues for data warehousing systems that occur when the possibility of information source changes is taken into consideration. Such issues include adapting wrappers to information source changes, adapting view definitions to changes, or adapting the data content (extent) of the data warehouse. Lastly, possible candidate solutions for some of these problems, especially as explored in the context of the Evolvable View Environment (EVE) system, are outlined. Such flexible data warehousing technology will allow more users to make use of distributed information over networks and increase the productivity of users and system administrators by maintaining customized interfaces to information that can be automatically maintained even under changes of the underlying systems.
Figures
Figure 1. Architecture of a data warehouse.
Figure 2. Relational schema transformation with SchemaSQL: Data change turned schema change.
Figure 3. XML relation mapping and effect of XML data file updates in the relational model context.
Figure 4. The framework of the evolvable view environment (EVE).
Join the Discussion (0)
Become a Member or Sign In to Post a Comment