Data warehousing has become a standard practice for most large companies worldwide. The data stored in the data warehouse can capture many different aspects of the business process, such as manufacturing, distribution, sales, and marketing. This data reflects explicitly and implicitly customer patterns and trends, the effectiveness of business strategies and resultant practices, and other characteristics. Such data is of vital importance to the success of the business whose state it captures. This is why companies decide to engage in the relatively expensive undertaking of creating and maintaining a data warehouse, where the costs routinely reach millions of dollars [11].
For a data warehousing project to succeed, it is essential to choose a suitable data modeling approach. Organizations considering a data warehousing project should examine the real differences and trade-offs between available methodologies and determine for themselves which approach is best suited for their environments. Despite the growing pervasiveness of data warehouses, there is hardly a consensus among researchers and practitioners about the most appropriate data modeling strategies for data warehousing projects. In order to help readers recognize what the choices are and the implications of making a particular selection, this article provides an impartial and concise view of the competing methodologies and the issues that drive the ongoing debate about them.
Data Warehouses and Data Marts
A typical organization maintains and utilizes a number of operational data sources. These operational data sources include the databases and other data repositories that are used to support the organization’s day-to-day operations. A data warehouse is created within an organization as a separate data store whose primary purpose is data analysis for the support of management’s decision-making process [7]. Often, the same fact can have both operational and analytical purposes. For example, data describing that customer X bought product Y in store Z can be stored in an operational data store for business-process support purposes such as inventory monitoring or financial transaction record keeping. That same fact can also be stored in a data warehouse where, combined with vast numbers of similar facts accumulated over a time period, it serves to reveal important trends such as sales patterns or customer behavior.
There are two main reasons that necessitate the creation of a data warehouse as a separate analytical data store. The first reason is that the performance of operational queries can be severely diminished if they must compete for computing resources with analytical queries. The second reason lies in the fact that, even if the performance is not an issue, it is often impossible to structure a database that can be used (queried) in a straightforward manner for both operational and analytical purposes. Therefore, a data warehouse is created as a separate data store, designed for accommodating analytical queries. A typical data warehouse periodically retrieves selected analytically useful data from the operational data sources. In so-called active data warehouses [2], the retrieval of data from operational data sources is continuous. For any data warehouse, the infrastructure that facilitates the retrieval of data from operational databases into the data warehouses is known as ETL, which stands for Extraction, Transformation, and Load.
A data mart is a data store based on the same principles as a data warehouse, but with a more limited scope. Whereas a data warehouse combines data from operational databases across an entire enterprise, a data mart is usually smaller and focuses on a particular department or subject. Dimensional modeling [8] is a principal data mart modeling technique. It uses two types of tables: facts and dimensions. A fact table contains one or more measures (usually numerical) of a subject that is being modeled for analysis. Dimension tables contain various descriptive attributes (usually textual) that are related to the subject depicted by the fact table. The intent of the dimensional model is to represent relevant questions whose answers enable appropriate decision making in a specific business area [4].
The two figures here illustrate an example where dimensional modeling is used to design a data mart that retrieves data from two operational relational databases. Figure 1 shows two separate operational databases for a retail business. Table 1 shows the sample values of data stored in databases A and B. The operational database A stores information about sales transactions. In addition to transaction identifier and date, each sale transaction records which products were sold to which customer and at which store. Operational database B stores information about customers’ demographic and credit rating data.
In order to enable analysis of sales-related data, a dimensionally modeled data mart C, shown in Figure 2, is created. This data mart contains information from relational databases A and B. The purpose of data mart C is to enable analysis of sales across all dimensions that are relevant for the decision-making process and are based on existing and available operational data. The fact table SALES contains a numeric measure (Units Sold) of the subject sales and foreign keys pointing to the relevant dimensions. Dimension Customer integrates the data from the Customer Table in database A and all three tables in database B. Dimension Store merges data from tables Store and District in database A. Dimension Product merges data from tables Product and Category, also from the database A. Dimension Calendar contains details for each date that fits the range between the date of first and last transaction recorded in Sales Transaction Table in database A.
The dimensions Customer, Store, and Product are not normalized. Since the data in data marts is almost never updated, given that most data marts are “read and append only” data stores, the main motivation for normalization (the elimination of the possibility of update anomalies) does not exist in this case. On the other hand, de-normalized (pre-joined) dimensions can greatly improve the performance and convenience of analysis, by reducing the need for joining tables in queries.
Each dimension has a new key, specially designed for the dimension itself. As shown in Table 2, the value of the key is not imported from the operational database. Instead, the value of the key is a unique system-generated semantic-free identifier. This feature insulates dimensions from possible changes in the way operational keys are defined (and possibly redefined) in operational databases over time. The system-generated key also has a role in tracking the history of changes in dimensions’ records.
Once data mart C is modeled using dimensional modeling techniques, and then populated with the data from A and B, finding answers to questions such as “Find the top 10 products sold to customers of demographic type `Hispanic’ and credit rating `Good’ during the month of August for the past three years” is achieved in a quick fashion by issuing one simple query. If the data mart was not developed, the process of finding an answer to this question would be much more complicated and would involve rummaging through the operational databases A and B, and issuing multiple queries.
Contemporary methodologies offer several data modeling options for designing a data warehouse.
When a dimensionally modeled data mart is in place, performing data analysis is fairly straightforward. It usually involves using online analytical processing (OLAP) [3] tools, which allow users to query facts and dimensions by using simple point-and-click interfaces.
Data Warehouse Modeling Options
Contemporary methodologies offer several data modeling options for designing a data warehouse, which are described and compared here with regard to how they approach and utilize the data mart modeling process.
One option for modeling data warehouses, first proposed by Inmon [7], envisions a data warehouse as an integrated database modeled by using the traditional database modeling technique (ER modeling). After such a data warehouse is created, it then serves as a source of data for dimensionally modeled data marts and for any other (non-dimensional) analytically useful data sets. Figure 3a illustrates this option.
The idea behind this method is to have a physically stored central data warehouse modeled as an Entity-Relationship (ER) model. All integration of the underlying operational data sources occurs within a central data warehouse ER model. As mentioned previously, the process of integrating and consolidating data from the operational databases into the data warehouse is known as the ETL process. Developing the infrastructure needed for the ETL process is usually the most time- and resource-consuming part of the entire data warehousing effort [9]. It is not uncommon for a data warehousing project team to spend as much as 50%70% of development time on ETL functions [10]. However, this entire process is driven by its target: the data warehouse model. When a proper data warehouse data model is developed, the requirements for the ETL development process are clearly defined. The success of the ETL development phase is then a matter of accurate and efficient execution. Once a data warehouse is completed and populated with the data, various analytically useful extracts are possible, based on this powerful fully integrated relational database. One of the primary types of analytical extracts is indeed a dimensionally modeled data mart, which is then queried using OLAP tools. The Inmon method envisions the need for other, non-dimensional, data set extracts as also potentially useful for analysis and decision support. Such non-dimensional extracts may include single tables, data sets intended for data mining, flat files, and so forth. Inmon and his collaborators view these extracts, the data warehouse itself, as well as additional analytical data stores, as part of a larger concept they call a Corporate Information Factory (CIF) [5].
Another method, championed by Kimball [8], views a data warehouse as a collection of dimensionally modeled data marts. Figure 3b illustrates this option. As Figure 3b illustrates, this approach is analogous to the previous approach when it comes to the utilization of operational data sources and the ETL process. The difference is the modeling technique used for modeling the data warehouse. In this approach, a set of commonly used dimensions (such as Calendar) known as conformed dimensions is designed first. Fact tables corresponding to the subjects of analysis are then added. A set of dimensional models is created where each fact table is connected to multiple dimensions, and some of the dimensions are shared by more than one fact table. In addition to the originally created set of conformed dimensions, additional dimensions are included as needed. The result is a data warehouse that is a collection of intertwined dimensionally modeled data marts.
Corporations and organizations should choose between these modeling methodologies based on the nature of their current and future analytical needs.
The trade-off between the two methodologies can be described as a trade-off between extensiveness and power versus quickness and simplicity. The Inmon approach requires the creation of a data warehouse ER model as a first step. The result of this process can then be used in subsequent steps as a basis for modeling dimensional and non-dimensional extracts. In the Kimball approach, dimensionally modeled structures are created without creating an underlying ER model for them. If dimensional structures are all that an organization will ever require to fulfill its data analysis needs, then the Kimball approach is a quicker and simpler way to create a data warehouse. However, if other types of analytical data stores will be needed in addition to the dimensional structures, then the Inmon approach provides a more powerful method. An ER modeled data repository can be used as a basis for extracts into data stores structured in a variety of ways. On the other hand, the dimensional model is developed strictly for end user OLAP-style analysis.
Corporations and organizations should choose between these modeling methodologies based on the nature of their current and future analytical needs. The analysis of those needs should indicate to them which method is more appropriate and suitable for their environments.
Comparing these two approaches has been a topic of lively debate in the data warehousing field for the past few years [1, 5, 6]. Often, attempts are made to describe one method as superior to another. While doing so, certain supposed comparative disadvantages are frequently mentioned for one (or the other) of the approaches. The most common such comment about the Inmon approach is that its data modeling phase requires high levels of expertise and a considerable upfront time commitment. On the other hand, the most common criticism of the Kimball approach is that it lacks enterprisewide focus and concentrates primarily on the individual business units or groups of users. However, a closer inspection of these arguments reveals they do not withstand impartial scrutiny.
Looking first at the Inmon approach, it is obvious that it requires more time to be spent on modeling, but this is due to the fact that the created model is usable in a number of different ways. Also, recall that the majority of data warehousing projects are serious multimillion-dollar endeavors, and that the majority of resources must be spent on the ETL process (whose success is highly dependent on the data modeling phase). Having this in mind, it is difficult to see how ensuring that the data modeling aspect is done in a non-hurried way using a high level of expertise is a shortcoming or inconvenience.
As mentioned, the common criticism of the Kimball approach is it lacks enterprisewide focus. The Kimball approach does allow the fact tables to represent subjects of interest to the individual business units or group of users (for example, a fact table showing employee sick days, used by the HR department). However, the same approach also allows for the creation of enterprisewide fact tables used simultaneously by many (if not a majority) of the departments and units within the organization (such as a fact table showing complete global and domestic sales, and profits for all products.) Therefore, the assertion that the Kimball approach is incapable of providing enterprisewide focus is simply not accurate.
The reality is that both approaches offer a viable alternative for modeling and creating data warehouses. When choosing a data modeling approach for a data warehouse, the decision should be based on which approach is a better fit, rather than on trying to decide which methodology is “better.” An organization may even choose a mixture of these approaches (for example, using dimensional modeling for conceptual design of subject areas and then using this conceptual design as a basis for creating the ER model of an actual physical data warehouse).
Even though the majority of the discussions about data modeling approaches for data warehouses involve the Inmon and Kimball approaches (and their variations), there is a third approach that should also be acknowledged and discussed. This approach involves the creation of independent data marts, as illustrated by Figure 3c.
In this method, standalone data marts are created independently of other data marts in the organization. Consequently, multiple ETL systems are created and maintained.
Whereas there is ongoing discourse among data warehousing practitioners and researchers on the merits of the Inmon vs. the Kimball approaches, there is a consensus among virtually all members of the data warehousing community about the inappropriateness of using the independent data marts approach as a strategy for designing a data warehouse. There are obvious reasons why independent data marts are considered an inferior strategy. Two major shortcomings are the unnecessary repetition of the ETL effort and the inability for cross-department analysis and communication. In spite of these obvious disadvantages, a significant number of corporate data warehousing projects end up being developed as a collection of independent data marts. The reason for this seeming paradox lies in the lack of initial enterprisewide focus when data analysis is concerned. Simply, a number of departments within an organization take a “go-at-it-alone” approach in developing the data marts needed for their analytical needs. This is commonly due to the “turf” culture in organizations, where individual departments put more value on their independence than on cross-department collaboration. In those cases, the existence of independent data marts is more of a symptom of the organization’s structural culture than a result of deliberately adopting an inferior data warehousing approach. Moreover, in some cases the budgeting structure of an organization forces parts of the organization to undertake isolated initiatives. When contemplating the development of data analysis systems, departments and other constituent groups within organizations are often presented with the choice of creating independent data marts or doing nothing. Given those two choices, independent data marts certainly represent a better option.
Summary
The issue of data warehouse modeling choice is often misunderstood or simply avoided by accepting a perceived benchmark practice by default. However, the choice of a proper modeling approach is a crucial decision, which often determines whether the data warehouse implementation will succeed or end up as a costly failure.
Creating a data warehouse is an effort that requires participation from across an assortment of organizations’ functional areas and the integration of data from multiple sources. Consequently, people associated with data warehousing projects come with different backgrounds and various technical levels of expertise. The description of central issues determining a data modeling approach for the data warehouse given in this article can help all parties involved with the data warehousing project to understand the available alternatives and make a contribution toward making an appropriate choice for their organizations.
Figures
Figure 1. ER-modeled operational database.
Figure 2. Dimensionally modeled data mart C.
Figure 3a. ER-modeled data warehouse.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment