Technical Opinion: Which Data Warehouse Architecture Is Best?

Over the past 15 years, companies have spent billions of dollars on data marts and warehouses. Despite this experience, there is an important design decision that still causes heated discussion: Which data warehouse architecture is best?

While there are many consultants and vendors that claim that a particular architecture is best, there has been surprisingly little rigorous, empirical research on the topic. The literature tends to describe the architectures, provide case study examples, or present survey data about the popularity of the various architectures. The lack of empirical research on the topic motivated our study.

For the research, in addition to reviewing the data warehousing literature, we formed a group of 20 experts to help identify the architectures to study and the success metrics to use. Bill Inmon and Ralph Kimball, leading authorities in the field and advocates of the two major competing architectures (i.e., hub and spoke and bus, respectively), were among the experts who participated.

We ultimately investigated five architectures: independent data marts, bus, hub and spoke, centralized (no dependent data marts), and federated. While other architectures (e.g., hybrid) are mentioned in the literature, they tend to be variations on these five.

For many organizations, independent data marts are the initial efforts to provide a repository of decision support data. These marts are typically independent of other data stores, and serve specific, localized needs, such as providing data for a particular application or business unit. The data is stored in a data model that best supports how the data is used (e.g., an OLAP cube).

The bus architecture has data marts that support various business processes, such as orders, deliveries, or customer calls. The first mart is built for a single business process using dimensions and measures that are used with other marts (i.e., conformed dimensions). Additional marts are developed using these conformed dimensions, which results in logically integrated marts and an enterprise view of the data. There is no normalized, relational data in this architecture; it is entirely dimensional.

The hub and spoke architecture begins with an enterprise-level analysis of data requirements. Attention is focused on building a scalable and maintainable infrastructure. Usingthe enterprise view of the data, the architecture is developed in an iterative manner, subject area by subject area. In this architecture, atomic level data is maintained in the warehouse in 3^rd normal form. Dependent data marts are created that source data from the warehouse, thus maintaining a "single version of the truth." The dependent data marts may be developed for departmental, functional area, or specialized purposes (e.g., data mining) and may have normalized, denormalized, or summarized dimensional data structures depending on user needs.

The centralized architecture is similar to the hub and spoke except that there are no dependent data marts. The warehouse contains atomic level data, some summarized data, and logical dimensional views of the data. This architecture is a logical rather than a physical implementation of the hub and spoke architecture.

The federated architecture is advocated when there is a fragmented decision support data environment and there is a need to integrate at least some of the data. This is often the case when there are mergers, acquisitions, and company reorganizations. The federated architecture leaves existing decision support structures (e.g., operational systems, data marts) in place. The data is either logically or physically integrated using shared keys, global metadata, distributed queries, or other methods.

The literature and expert interviews identified two major categories of success metrics. Product measures are associated with information and system quality, impacts on individual users, and impacts on the organization. Project measures relate to the time and cost of implementing the architecture.

The Survey and Findings

We developed a Web-based survey that asked about the data warehouse in the respondent’s company, the architecture that was implemented, the success of the architecture, the respondent’s company, and the respondent; 454 respondents provided completed questionnaires.^a

The respondents were relatively evenly distributed over data warehouse managers, data warehouse staff members, IS managers, and independent consultants/system integrators. The latter were asked to complete the survey with a particular client in mind.

The companies participating in the survey ranged from small (i.e., less than $10M in revenues) to large (i.e., in excess of $10B). Most of the companies are located in the U.S. (60%) and represent a variety of industries, with financial services (15%) providing the most responses.

The hub and spoke is the most prevalent architecture (39%), followed by the bus architecture (26%), centralized (17 %), independent data marts (12%), and federated (4%). The most common platform for hosting the data warehouses is Oracle (41%), followed by Microsoft (19%) and IBM (18%).

Most of the data warehouses support either several business units (38%) or the entire company (36%). Fewer than 12 percent of the warehouses support a single function area or sub unit. However, the domain or scope of the warehouse varies with the architecture. The hub and spoke and centralized architectures have the broadest domain and are company wide in over 40% of the organizations. The bus architecture is enterprise wide in about 30 percent of the companies, followed by the federated (26%), and independent data marts architecture (18%).

We computed mean product success measures for the various architectures. The independent data mart architecture scored lowest on all measures. Next lowest on all measures was the federated architecture. What was most interesting was the similarity of the success scores for the bus, hub and spoke, and centralized architectures. No statistically significant differences (MANOVA was used) were found for any of the three architectures’ product success metrics. All of these three architectures provide similar, consistently high scores on all of the product success metrics (generally in the mid 5s on the 1-7 scale).

The survey instrument asked respondents to indicate the average amount of time required to implement the first subject area or business process in the warehouse. It took just under nine months for the independent data marts, bus, and centralized architectures. The next longest time was required by the federated architecture, with the hub and spoke architecture taking the most time, at 11.5 months.

The average initial roll out cost for the hub and spoke was the most expensive of all the architectures at close to $2.5M. It was also the most costly architecture to maintain, at an average cost of $1.24M.

Conclusion

We found why there are both agreements and disagreements over which architecture is best. The study findings show conclusively that independent data marts are the weakest solution in terms of information quality, system quality, individual impacts, and organizational impacts. This is consistent with conventional wisdom. Though not as weak, the federated architecture tended to score relatively low on the success metrics. This is also not surprising. A federated architecture must "make do" with an existing decision support infrastructure and to some extent has to live with its weaknesses.

The most important finding is how similar the bus, hub and spoke, and centralized architectures scored on the product success metrics. It also helps explain why these competing architectures have survived over time—they can be equally successful.

The similarity of the product success of the bus, hub and spoke, and centralized architectures may not be too surprising. Over time, each approach has incorporated strengths from the others. For example, the hub and spoke architecture typically includes dimensional data marts, which is fundamental to the bus architecture. Advocates of all architectures now recognize the importance of rolling out an initial version quickly in order to realize early "wins" or financial "lift" and maintain management support.

There were differences in terms of development time and cost. Because of the up front planning, large organizational domain, and additional components (e.g., dependent data marts), the hub and spoke architecture takes the longest time and is the most costly to initially develop. The other architectures tend to be similar in terms of development time and cost.

Overall, we found that the major data warehouse architectures can deliver good information quality, system quality, individual impacts, and organizational impacts. The study did not find a clear "winner" in the "data warehouse architecture wars" because there is not one. The product success metrics are very similar for the bus, hub and spoke, and centralized architectures. Companies can select an architecture based on other relevant factors, such as the availability of resources, the urgency of the need for the warehouse, management’s strategic view of the warehouse, the organizational domain served, compatibility with existing systems and technologies, the recommendations of consultants, and others.

Footnotes

a. The full research report is available at http://www.terry.uga.edu/~hwatson/DW_Architecture_Report.pdf

DOI: http://doi.acm.org/10.1145/1400181.1400213

Technical Opinion: Which Data Warehouse Architecture Is Best?

The Survey and Findings

Conclusion

Technical Opinion: Which Data Warehouse Architecture Is Best?

DOI

October 2008 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

The Survey and Findings

Conclusion

Technical Opinion: Which Data Warehouse Architecture Is Best?

DOI

October 2008 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.