Research and Advances
Computing Applications

Enhancing Data Quality in Data Warehouse Environments

Nothing is more likely to undermine the performance and business value of a data warehouse than inappropriate, misunderstood, or ignored data quality.
Posted
  1. Introduction
  2. Think Systematically
  3. Which Projects First?
  4. Conclusion
  5. References
  6. Authors
  7. Tables
cybersecurity, illustration

Decisions by senior management lay the groundwork for lower corporate levels to develop policies and procedures for various corporate activities. However, the potential business contribution of these activities depends on the quality of the decisions and in turn on the quality of the data used to make them. Some inputs are judgmental, others are from transactional systems, and still others are from external sources, but all must have a level of quality appropriate for the decisions they will be part of.

Although concern about the quality of one’s data is not new, what is fairly recent is using the same data for multiple purposes, which can be quite different from their original purposes. Users working with a particular data set come to know and internalize its deficiencies and idiosyncrasies. This knowledge is lost when data is made available to other parties, like when data needed for decision making is collected in repositories called data warehouses.

Here, we offer a conceptual framework for enhancing data quality in data warehouse environments. We explore the factors that should be considered, such as the current level of data quality, the levels of quality needed by the relevant decision processes, and the potential benefits of projects designed to enhance data quality. Those responsible for data quality have to understand the importance of such factors, as well as the interaction among them. This understanding is mandatory in data warehousing environments characterized by multiple users with differing needs for data quality.

For warehouses supporting a limited number of decision processes, awareness of these issues coupled with good judgment should suffice. For more complex situations, however, the number and diversity of trade-offs make reliance on judgment alone problematic. For such situations, we offer a methodology that systematically incorporates various factors, including trade-offs, and guides the data warehousing manager in allocating resources by identifying the data quality enhancement projects that maximize value to the users of the data.


Part of the enhancement effort is to elicit from data users as precisely as possible what it is about the data they consider unsatisfactory.


Data warehousing efforts may not succeed for various reasons, but nothing is more certain to yield failure than lack of concern for the quality of the data. “Ignore or trivialize problems with the existing data at the start of the project, and that oversight will brutally assert itself . . .” [4]. Another reason warehousing efforts may not succeed is failure to store the appropriate data. Although this fact seems obvious, availability is still sometimes the sole criterion for storage. Data from outside sources and soft data may be ignored completely, even though such data is critical for many decision purposes. Although it might seem that data availability is not a data quality issue, users who regularly work with data include data availability as an attribute of data quality [12].

Data supporting organizational activities in a meaningful way should be warehoused. However, a particular data set may support several low-level organizational activities, whereas another supports only one activity but with higher priority. If a choice has to be made as to which data should be warehoused, how should a warehouse manager decide? Moreover, it may be relatively inexpensive to clean up a data set that is seldom used, but expensive to improve the quality of a frequently used data set. Again, if a choice has to be made, which data set should be worked on? There are other kinds of trade-offs. For example, it may be possible to improve the timeliness of certain data at the expense of completeness. Or it may be inexpensive to include some relatively unimportant data but costly to obtain unavailable but important data. All these trade-offs are in the context of limited resources available for improving the data’s quality.

A distinguishing characteristic of warehoused data is that it is used for decision making, rather than for operations. A particular data set may well support several decision processes. Support for several processes complicates data management, because these uses are likely to require different degrees of data quality. Research on enhancing data quality has for the most part been in the context of data supporting individual activities, usually of an operational nature [3].

Data warehousing efforts have to address several potential problems. For example, data from different sources may exhibit serious semantic differences. A classic case is the varying definitions of “sales” employed by different stakeholders in a corporation. Furthermore, data from various sources is likely to contain syntactic inconsistencies, which also have to be addressed. For example, there may well be discrepancies in the time periods for activity reports (such as bimonthly vs. every two weeks). Moreover, the desired data may simply not have been gathered.

A significant characteristic of data warehouses is the prominent roles of soft data and historical data. Operational systems exclude soft data, yet such data is often critical for decision making. By “soft data,” we mean data whose quality is inherently uncertain; an example is human resources evaluations related to future task assignments involving subjective rankings. For a data warehouse to support decision making, soft data, though imperfect, has to be made available. Whereas operational systems focus on current data, decision making often involves temporal comparisons. Thus, historical data often has to be included in data warehouses.

It is not sufficient to state that the data is wrong or not useful; such evaluations offer no guidance as to how warehouse managers should go about improving the data. To report that the data is inconsistent indicates a problem rather different from saying that the data is out of date. Part of the enhancement effort is to elicit from data users as precisely as possible what it is about the data they consider unsatisfactory. Determining what’s wrong with the data is facilitated by being aware of the dimensions of data quality.

It has long been recognized that data is best described or characterized via multiple attributes, or dimensions. For example, in 1985, Ballou and Pazer identified four dimensions of data quality: accuracy, completeness, consistency, and timeliness [1]. More recently, Wang and Strong [12] analyzed the various attributes of data quality from the perspective of the people using the data. They identified a full set of data quality dimensions, adding believability, value added, interpretability, accessibility, and others to the earlier four. These dimensions were then grouped into four broad categories: intrinsic, contextual, representational, and accessibility. For example, accuracy belongs to intrinsic; completeness and timeliness to contextual; consistency to representational; and availability to accessibility.

The earliest work in data quality was carried out by the accounting profession (such as [5]). Several studies have looked at the social and economic effects of inadequate data quality [6, 9]. The consequences of poor data quality have been explored using modeling approaches [1]. Various studies have sought to determine how to improve the quality of data [7, 8]. And other work has examined the nature of data quality [10–12].

Back to Top

Think Systematically

For enhancement efforts to be worthwhile, users and data warehouse managers alike have to think systematically about what is required. And organizations have to distinguish between what is vital and what is merely desirable. Data warehouses are developed to support some subset of the organization’s activities. For example, in the marketing realm, success depends on such organizational activities as market segmentation, brand and sales management, and marketing strategy. These activities are fundamental to business profitability and success. To make decisions, managers need access to both historical and soft data. Historical data can include sales by product, region, and time period; soft data can include customer preference surveys, fashion trends, and sales forecasts.

For data quality enhancement efforts to succeed, the data warehouse manager should first ascertain from decision makers what organizational activities the data warehouse has to support. Some mechanism has to be used to determine the priority of these activities. All other factors being equal, data quality enhancement efforts should be directed first to the activities with the highest priority.

Identifying data sets required to support the targeted organizational activities is also needed. By “data set,” we mean some clearly distinguishable collection of data, such as a traditional file or some aggregation of external data. Some data sets may not currently exist but can be created at reasonable cost and effort. It is important to note that each data set potentially supports several organizational activities simultaneously.

Finally, existing and potential problems with the data set have to be identified. The dimensions of data quality guide this determination. For example, if the required data set does not exist, then its data quality is deficient on the availability dimension. If the data exists but for whatever reason cannot be obtained, then accessibility is a problem. If there are semantic discrepancies within the data set, then interpretability is problematical. A data set can be deficient on more than one dimension. For example, a data set may not be accessible and may also be incomplete.

It is assumed that various projects can be undertaken to improve the quality of data. One might be to resolve syntactic differences among customer data records. Another might be to obtain regional sales data on a more timely basis. Another could be to identify and enforce a common definition of “sales” (at least as far as the warehouse is concerned). Another could involve obtaining external data regarding competitors’ activities—a form of soft data not currently available. And yet another might be to extract a relevant subset of some transaction file. Each one influences the quality of one or more data sets that in turn influences one or more of the organization’s activities. To make these ideas more precise, we introduce the following index notation:

  • I. Index for organizational activities supported by a data warehouse
  • J. Index for data sets
  • K. Index for data quality attributes or dimensions
  • L. Index for possible data quality projects

Consider the following scenario. The data warehouse supports three organizational activities: production planning (I=1), sales tracking (I=2), and promotion effectiveness (I=3). These activities are supported by three data sets: an inventory file (J=1), an historical sales file (J=2), and a promotional activity file (J=3). In terms of the quality dimensions, the inventory file is inaccurate (K=1, accuracy) and somewhat out of date (K=2, timeliness). However, the sales file is accurate (K=1) but has a monthly lag (K=2). The promotional activity file is incomplete (K=3, completeness), containing only aggregate data. To enhance the quality of these files, we undertake several projects to eliminate about 50% of the errors in the inventory file (L=1); make the sales file more timely, even if it sacrifices some accuracy (L=2); and tie the promotional file more closely to the actual sales (L=3). In practice, limited resources preclude undertaking all proposed projects.

A number of factors influence data quality enhancement projects. In practice, however, determining them precisely is quite difficult.

Current quality CQ(J,K). The current quality of each data set J is evaluated on each relevant data quality dimension K. For this scenario, the current accuracy CQ(J=1,K=1) and timeliness CQ(J=1,K=2) of the inventory file, among others, has to be evaluated.

Required quality RQ(I,J,K). This factor represents the level of data quality required by activity I, which uses data set J and depends on dimension K. Data warehousing implicitly assumes that the stored data is used for more than one purpose and the quality requirements for them can vary dramatically. For one use (such as aggregate production planning) of a particular data set (sales forecasts), order-of-magnitude correctness may suffice; for another (such as item-level planning), each value must be precise. The required quality also depends on the data quality dimensions.

In this scenario, the sales data set (J=2) supports the organizational activities production planning (I=1) and sales tracking (I=2). The current timeliness (K=2) of the sales file is more than adequate for sales tracking but inadequate for production planning.

For data sets used by more than one application, the current quality of a particular dimension may be insufficient for some of the applications and more than sufficient for others.

Anticipated quality AQ(J,K;L). This factor represents the quality of data set J on dimension K resulting from undertaking project L. The people responsible for data quality can undertake projects affecting the quality of various data sets, but may well affect the various dimensions of data quality in different ways. It is quite possible that efforts to improve the quality of a particular dimension diminishes the quality on another dimension; an example is the trade-off between accuracy and timeliness [2]. So, if project L=2 (to make sales file J=2 more timely) is undertaken, then the timeliness dimension (K=2) would improve, but accuracy (K=1) would suffer.

Although data quality projects usually seek to improve data quality, such a goal is not always on the agenda. For example, suppose that for a certain class of data much more data is gathered and stored than is needed. It might well make sense to actually lower the level of completeness for the data set. The reason would be to save resources.

The metric used to measure each of the three factors—CQ, RQ, and AQ—has to be the same. One could use the domain [0,1], with 0 representing the worst case and 1 the best [2]. For example, in regard to the inventory file, suppose CQ(1,1) = 0.6 (60% of the items are accurate). And for project L=1, which seeks to eliminate 50% of the errors, then AQ(1,1,1) = 0.8. Other possible metrics could be based on ordinal scales.

Priority of organizational activity, Weight(I). Some organizational activities are clearly more important than others. Weight (I) represents the priority of activity I. All factors being equal, common sense dictates that data sets supporting high-priority activities receive preference over those involved in low-priority activities. Weight(I) should satisfy O< Weight(I)<1, and the weights should add up to 1.

Cost of data quality enhancement, Cost(L). The cost of undertaking project L is represented by Cost(L). The various data-quality enhancement projects involve commitment of such resources as funds, personnel, and time, all of which are inherently limited. There are also various crosscurrents between the Cost(L) and the Weight(I), including the question: Is it better to undertake a project that is expensive and supports an activity whose weight is large or to undertake a project that might support several activities, each of moderate importance? The only restriction on the measurement of the Cost(L) values is that the units are the same in all cases. Cost(L) values should be viewed as the total cost associated with project L over some period of time, such as fiscal year or budget cycle. The Cost(L) values include not only the cost of actual project work but the costs associated with ongoing data-quality-related activities necessitated by the project. This interpretation of Cost(L) means it is possible for Cost(L) to be negative. For example, should the current quality of a particular data set be better than required, so reducing the quality saves resources, resulting in a negative Cost(L)?; an example is storing summary rather than unneeded detail data.

Value added, Utility(I,J,K;L). This factor represents the change in value or utility for organizational activity I should project L be undertaken. Since project L could affect different dimensions of different data sets required by activity I, the utility depends explicitly on each relevant data set J and dimension K. Several considerations regarding the utility values have to be addressed; for example, if a particular project L is undertaken, any data set is totally unaffected; for such a data set, Utility(I,J,K;L) = 0. The utility can be either positive (the project enhances data quality) or negative (the project does not enhance data quality). There is no value in improving quality more than required. And finally, a particular project need not, and usually will not, completely remove all data quality deficiencies.

A logical consequence of these observations is that for each organizational activity I, there is a conceptual relationship between the magnitude of change in data quality and the value for Utility(I,J,K;L). In practice, determining all the functional relationships between the change in data quality and Utility(I,J,K;L) would be daunting. The simplest way to make that determination would be to use whatever knowledge the organization has to estimate all the non-zero Utility(I,J,K;L) values. In our scenario, there are 3×3×3×3 = 81 potential Utility(I,J,K;L) values. I has three values, as do J, K, and L. However, only eight such values have to be estimated, as the others are zero (see Table 1).

Obtaining the Utility (I,J,K;L) estimates may be difficult and imprecise but nevertheless has to be done. After all, warehouse managers can’t make an intelligent decision about which projects should be undertaken unless they are aware of what data sets and data quality dimensions are affected by a particular project, as well as the relative magnitude of the benefit of the effect. Even more important, this process forces warehouse managers to think systematically about data quality enhancement.

Back to Top

Which Projects First?

For relatively simple situations, these factors and issues can guide data quality enhancement of warehoused data. However, as the number of organizational activities supported, data sets involved, and potential enhancement projects increases, it becomes difficult to cope with the various trade-offs and impossible to determine an optimal project mix. For such situations, an integer programming model can help identify the subset of projects that best enhance the quality of the warehoused data. See Table 2 for a mathematical formulation based on this model’s description.

Each project is associated with a weight Weight(I) and a set of utilities Utility(I,J,K;L). The weighted sum of these utilities produces an overall worth for project L, call it Value(L). Note that Value(L) captures the fact that a particular data quality project can influence several dimensions of the same data set as well as multiple data sets involved in various organizational activities.

Optimization models include an objective function that, in this case, is the sum of the Value(L) of all selected projects. There are several constraints, in addition to the objective function. For example, the sum of the cost of all projects selected cannot exceed the budget (Resource Constraint). Suppose that projects (P(1), P(2), . . . , P(S) are mutually exclusive, so only one can be carried out. A constraint is needed to select at most one such project (Exclusiveness Constraint). Also suppose that, for example, projects P(1) and P(2) overlap in the sense they share various aspects. Then develop a new project P(3) representing both P(1) and P(2), and select at most one of them (Interaction Constraint). Such constraints are not necessarily exhaustive, and as in all modeling, it can be refined. For example, if appropriate, Cost(L) can be organized into fixed and variable costs.

Implementation of this integer-programming model identifies data quality projects that would increase to the greatest possible extent the utility of the warehoused data, systematically incorporating various trade-offs, including the value of obtaining data sets not currently available and evaluating gains from reducing the quantity of stored data in certain cases.

However, the difficulty implementing any model is in obtaining values for model parameters. Therefore, for our model, the data warehouse manager has to:

  • Determine the organizational activities the data warehouse will support;
  • Identify all sets of data needed to support the organizational activities;
  • Estimate the quality of each data set on each relevant data quality dimension;
  • Identify a set of potential projects (and their cost) that could be undertaken for enhancing or affecting data quality;
  • Estimate for each project the likely effect of that project on the quality of the various data sets, by data quality dimension; and
  • Determine for each project, data set, and relevant data quality dimension the change in utility should a particular project be undertaken.

The first two items apply even if data quality is not addressed. The third and fourth are required if a data warehouse manager wants to improve data quality. A problem arises with the third item in that the task of measuring data quality is not yet fully solved [11]. The difficulty in implementing the model resides in the fifth and sixth items.

However, a model can always be simplified at the expense of its ability to capture reality. One of the model’s complexities arises from the need to evaluate the quality of the data sets on the relevant dimensions. This evaluation is necessary if trade-offs regarding the dimensions of data quality represent a significant issue. If the trade-offs are not important, then an overall quality measure for each data set is adequate. Thus, the current quality CQ would depend on the data set only; the required quality RQ on the activity and data set only; the anticipated quality AQ on the data set and project only; and the change in utility on the activity, data set, and project only. Ignoring the trade-offs would substantially reduce the effort of estimating parameters.

The model can be formulated using a continuous measurement scale, implying CQ is some value between 0 and 1, and U(I,J,K;L) is a continuous function of the change in data quality. Implementation of the model can be simplified considerably by converting to a discrete version. For example, CQ could be qualitatively evaluated as low, medium, or high, and the change in utility of a data set on a certain dimension resulting from a specific project could also be so measured. Finally, these qualitative estimates—low, medium, high—would be converted into corresponding numerical values, such as 1, 2, 3.

Back to Top

Conclusion

Data warehousing depends on integrating data quality assurance into all warehousing phases—planning, implementation, and maintenance. Since warehoused data is accessed by users with varying needs, data quality activities should be balanced to best support organizational activities. Therefore, the warehouse manager has to be aware of the various issues involved in enhancing data quality. That’s why we presented ideas and described a model to support quality enhancement while highlighting relevant issues and factors. Combined, they are a framework for thinking comprehensively and systematically about data quality enhancement in data warehouse environments.

Back to Top

Back to Top

Back to Top

Tables

T1 Table 1. Required U(I,J,K;L) values for the illustrative scenario

T2 Table 2. Integer programming model formulation

Back to top

    1. Ballou, D., and Pazer, H. Modeling data and process quality in multi-input, multi-output information systems. Management Science 31, 2 (Feb. 1985), 150–162.

    2. Ballou, D., and Pazer, H. Designing information systems to optimize the accuracy-timeliness trade-off. Information Systems Research 6, 1 (Mar. 1995), 51–72.

    3. Ballou, D., and Tayi, G. Methodology for allocating resources for data quality enhancement. Commun. ACM 32, 3 (Mar. 1989), 320–329.

    4. Celko, J., and McDonald, J. Don't warehouse dirty data. Datamation 41, 19 (Oct. 1995), 42–53.

    5. Cushing, B. A mathematical approach to the analysis and design of internal control systems. Accounting Review 49, 1 (Jan. 1974), 24–41.

    6. Laudon, K. Data quality and due process in large interorganizational record systems. Commun. ACM 29, 1 (Jan. 1986), 4–18.

    7. Morey, R.C. Estimating and improving the quality of information in an MIS. Commun. ACM 25, 5 (May 1982), 337–342.

    8. Redman, T. Data Quality: Management and Technology. Bantam Books, New York, 1992.

    9. Strong, D., and Miller, S. Exceptions and exception handling in computerized information processes. ACM Transactions on Information Systems 13, 2 (Apr. 1995), 206–233.

    10. Wand, Y. and Wang, R. Anchoring data quality dimensions ontological foundations. Commun. ACM 39, 11 (Nov. 1996), 86–95.

    11. Wang, R., Storey, V., and Firth C. A framework for analysis of data quality research. IEEE Transactions on Knowledge and Data Engineering 7, 4 (Aug. 1995), 623–640.

    12. Wang, R.Y. and Strong, D.M. Beyond accuracy: What data quality means to data consumers. Journal of Management Information Systems 12, 4 (Spring 1996), 5–34.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More