Artificial Intelligence and Machine Learning

The Metadata Enigma

Metadata promises too much value as a business management tool to dismiss its implementation and maintenance effort as the equivalent of Sisyphean torture.

By Ganesan Shankaranarayanan and Adir Even

Posted Feb 1 2006

Introduction
Commercial Solutions 2
Design and Implementation Challenges
Metadata Research
References
Authors
Footnotes
Figures
Tables

Metadata, the layer of data abstraction, is among the more enigmatic elements in information systems. Enigmatic, since IT professionals have vague and sometimes conflicting views of its role and value. While system administrators contend that no system can function robustly without metadata, business decision makers often fail to see its value and consequently find it difficult to justify metadata-related expenditures [5, 8]. Meanwhile, academic researchers have not examined metadata in depth, and consequently theoretical frameworks for assessing the value of metadata do not exist.

In the past, before commercial database management systems were widely adopted by organizations, metadata was a second-class citizen in the data management field [7]. Application and system developers who sought to implement metadata solutions considered it a Sisyphean¹ torture. Metadata requirements are complex and difficult to capture, implementation is demanding, the end result is rarely satisfactory, and enhancements or corrections require significant effort, as metadata layers are deeply embedded in systems. Why is implementing metadata solutions so difficult? Here, we explore this question in the context of a data warehouse, introducing the multiple elements that constitute metadata to illustrate its inherent complexity. We further explore the drawbacks of commercial off-the-shelf (COTS) products for managing metadata and the challenge of designing and implementing metadata solutions. Finally, we ask whether the difficulties introduced by metadata offset its benefits. The answer is not clear-cut, and many factors must still be examined.

Metadata is often viewed as a system’s data dictionary, capturing definitions of data entities and the relationships among them. While not inaccurate, this is also an overly narrow view that overlooks the richness and complexity of metadata. Metadata has been categorized in many ways; for example, technical metadata is used for system operation and maintenance, and business metadata is used for data valuation and interpretation [5]. While metadata may be a technical necessity, recent studies of data quality [9], decision theory [1, 2] and knowledge management [6] suggest it has significant business implications as well. A second categorization differentiates back-end metadata (associated with data storage and processing) from front-end metadata (associated with information delivery and use) [8]. The table here lists several examples of metadata components based on these classifications. A third categorization is based on functionality and reflects design and maintenance characteristics abstracted by the metadata. It identifies six distinct metadata types [8]:

Infrastructure metadata abstracts the components of the system (such as hardware, operating systems, networking, and database servers) and is used primarily for system maintenance;
Model metadata, or the data dictionary, abstracts the modeling of data into entities and their relationships: conceptual, logical, and physical [3]. It includes a semantic layer for translating the physical data elements into business terms interpretable by end users and vocabulary mapping, or synchronizing data definitions and interpretations, across multiple user groups [4];
Process metadata captures information on how data is generated and the transformations it undergoes from source to target. In a data warehouse, process metadata is linked to the back-end processes that extract, transform, and load (ETL) data. Decision makers must understand data flow and be able to gauge data quality in order to create technical mapping of the data warehouse environment [9];
Quality metadata captures the assessment of the actual data stored in the system, including physical measurements (such as numbers of records and overall size in bytes) and quality measurements (such as accuracy and completeness);
Interface, or delivery and reporting, metadata captures how business users consume the data. It includes information on what data is delivered (such as via reports), presentation formats, and delivery configuration (such as timing and targets). It may also include information on report templates, how fields within templates are linked to data elements, predefined data filters, and dimension hierarchies for drill-downs and aggregations; and
Administration metadata tracks data usage, including information on users, security, and access privileges to data and applications.

Recent studies [1, 2] have explored the effects of metadata—specifically quality and process metadata—on decision-making efficiency and decision outcome. Decision makers may evaluate data quality both impartially (objectively) and contextually (subjectively) [9]. Impartial evaluation is based on the data itself, including missing records, miscalculated fields, and integrity violations. Contextual evaluation accounts for a variety of factors (such as the process that generated the data, the decision task the data is used for, and the decision maker’s motivation and expertise). To support them, quality metadata includes pre-evaluated measurements (along the dimensions of data quality, such as accuracy, timeliness, and completeness) derived from the data itself, while process metadata offers a way to gauge quality within a decision context.

Providing these metadata components to decision makers has been shown to significantly improve their decision process efficiency [1], as well as the decision outcome [1, 2]. Figure 1 outlines how quality metadata can be used to enrich a report. Contextually, quality metadata plays a more significant role when the report is used to help determine a quarterly bonus (a context) and a less significant role if used for routine performance tracking (a different context).

The metadata categories—front-end/back-end, business/technical, and functional types—are not necessarily distinct and may reflect strong interdependencies. For example, source-target transformations (process metadata) must be mapped to the physical configuration (infrastructure metadata); data delivery information (interface metadata) must be tied to registered users (administration metadata); and quality (content) metadata must relate to the actual data element it describes (model metadata). A consequence of adopting a narrow view of metadata while failing to understand the relationships among metadata components is the creation of fragmented “metadata islands.” Each island includes metadata of a specific functionality, unaware of and unable to communicate with other islands. Even when system designers and developers understand this complexity, implementing an integrated metadata layer is resource-intensive in terms of money, time, and managerial effort, as well as being a technical challenge.

Metadata is likely to be useful in rational, data-driven, analytical decision-making scenarios. Not clear is whether it provides similar benefit in decision processes that are more intuitive or politically charged.

Commercial Solutions²

Given the different types of metadata and the complexity of managing them, IT practitioners turn to COTS products for implementing data warehousing and metadata solutions [4, 10]. COTS products in this area are broadly classified into three categories. The first is data storage and management systems (such as Oracle, Sybase, MS-SQL, IBM-UDB, and Hyperion-Essbase). The second includes automated back-end data-processing utilities, commonly known as ETL products (such as Informatica, Oracle Warehouse-Builder, MS-SQL DTS, IBM Warehouse Manager, and Hummingbird). And the third includes reporting, or business intelligence (BI), utilities (such as MicroStrategy, Business Objects, and Cognos). Most leading data warehousing products provide GUI-supported utilities for metadata management.

An examination of the leading COTS products reveals an ambiguous picture. On one hand, all the vendors acknowledge the importance of metadata and embed it within their products. On the other, these products fail to address several critical issues:

Comprehensive support for all metadata functionalities. While infrastructure, model, and administration metadata are commonly supported, process metadata is supported only by ETL products, and data-delivery metadata only by BI products. Quality metadata is not yet supported by any of the leading products. This fragmentation necessitates integration across multiple products to create a comprehensive metadata repository;
Little support for business metadata (and an emphasis on technical metadata). Essential elements (such as those required to interpret data-transformation processes, link user administration to customer relationship management utilities, and track resource use) are minimally provided, if at all. An exception is the business elements of a data dictionary supported by most reporting and BI products;
Metadata storage and representation formats restricted to relational or complex proprietary structures. This makes products more efficient but preempts integration and/or exchange of metadata with other formats, including textual and graphical (such as ER and DaVinci diagrams); and
Metadata elements tightly coupled with one product or within a suite of products from the same vendor (such as IBM, Microsoft, or Oracle). Limited support is offered by way of common interfaces for metadata exchange, hindering metadata integration across tools. The leading vendors of data warehousing products are trying to address the need for more robust metadata exchange capabilities.

Design and Implementation Challenges

The broad and complex functionality of metadata, coupled with insufficient support for metadata management from software products pose several challenges for implementing metadata solutions. A successful implementation must also address other technical and managerial factors, including:

Interchangeable metadata formats. Metadata can be captured and represented in a variety of formats. For example, textual flat files are easy to implement and read but are less secure and do not readily support the capture of the relationships among metadata components. Relational models are easier to centralize and integrate, relatively secure, and equipped with a standard access method (SQL) and with well-defined data-administration utilities. However, relational implementation can be complex and expensive (in terms of RDBMS purchase costs and administration overhead). Graphical structures (such as entity relationship models) are more interpretable but require user training; they are also not easy to integrate with metadata in other formats. Documents allow business users to easily understand metadata and capture complex detail. On the flip side, integrating documents with other formats is difficult; documents also require significant administrative overhead. Proprietary data structures are customizable for specific organizational needs, but integrating them with standard formats is difficult.

Metadata implementation is likely to involve more than one format. Certain data entities may require abstraction in multiple formats, hence efficient interchangeability among formats is highly desirable. A common approach for achieving compatibility and interchangeability is to choose one format (typically the relational model) as the baseline for the others. Figure 2 outlines the concept of format interchangeability. For example, the Sale Transactions data in the figure is abstracted into three metadata formats—tabular, textual, and visual—each targeting a different user group.

Integrating metadata. Without appropriate controls metadata might evolve inconsistently across the enterprise as complex, isolated, nonreusable “pockets” tightly coupled with individual applications [7]. These pockets might lead to conflicting standards for business entities, disable efficient communication among subsystems, and complicate system maintenance [5]. Metadata management is moving from managing decentralized metadata pockets to managing centralized repositories [7]. The metadata repository reflects this trend [3, 5, 10], providing enterprisewide storage of metadata that integrates all components, offers better control, and avoids metadata islands. Unfortunately, a comprehensive commercial solution for full-fledged metadata integration does not exist. A major obstacle for integration, as pointed out earlier, is the lack of standardization among COTS products that manage metadata.

Efforts to overcome metadata exchange and integration problems have been partially successful. The market, however, is still split between two competing metadata exchange standards: the Open Integration Model (OIM) and the Common Warehouse Model (CWM). The Metadata Coalition, led by Microsoft, proposed OIM in 1999. At about the same time, the Object Management Group, led by Oracle, promoted CWM. Both standards allow data warehousing tools to keep a proprietary form of metadata while permitting access to it through a standard interface. However, they differ in their mix of exchangeable metadata elements and in their exchange formats and hence are not readily interchangeable.

Securing organizationwide support is typically the greatest challenge in any successful metadata implementation.

The existence of competing standards complicates the selection of tools for a data warehouse implementation. To facilitate easier integration, it is preferable to select a mix of tools committed to the same metadata interchange standard. Alternatively, the standard gap can be bridged through specialized metadata management tools (such as MetaCenter by Data Advantage Group, Advantage Data Transformer by Computer Associates, MetaBase by MetaMatrix, and Unicorn System by Unicorn). These tools provide a unified metadata infrastructure, broad coverage for technical and business metadata, the claim of being vendor-independent, and support for multiple metadata exchange formats. XML, another emerging solution for integration is the standard for data interchange among distributed applications. XML is used by CWM for metadata exchange, integrating both data and metadata into a single structure.

Design paradigms. Organizations face several choices when designing an enterprise repository. An elementary choice is from among the top-down, bottom-up, and hybrid strategies needed to capture requirements. A top-down approach would look at the entire set of organizational information systems and try to capture an overall metadata picture.³ A bottom-up approach, on the other hand, would start from the lower granularity of subsystems and bring their metadata specifications together into one unified set. While a top-down paradigm is more likely to ensure standardization and integration among sub-systems, it might be infeasible in cases involving information systems with local metadata repositories already in place. Moreover, capturing metadata requirements for an entire organization is a complex and tedious task that might not be completed in a reasonable time. The bottom-up paradigm, focusing on specific system, is more likely to achieve short-term results but fail to satisfy broader integration needs.

Organizations may compromise by choosing a hybrid approach that still starts at a high level but focuses on the metadata modules more critical to the organization, as well as key functionality types (such as semantic layer, security configuration, information flow rules, and data quality assessment). The key modules serve as the base for a core repository that becomes a centralized metadata source for those specific functionality types. The metadata repository is not comprehensive or exhaustive to start with. Subsequent to the core implementation, initial modules may be expanded and others added incrementally as the repository grows. Compared to the top-down approach, the hybrid counterpart has the advantage of allowing faster, less-complex, less-costly implementations. On the other hand, it still provides a centralized, integrated solution to the key metadata elements.

The metadata repository architecture is another important design choice [8]. A centralized architecture, corresponding to a top-down paradigm, locates the organizational metadata repository on a centralized server and becomes the only metadata source for all front-end and back-end utilities. Alternatively, a distributed architecture, corresponding to a bottom-up design paradigm, allows systems to maintain their own customized metadata. Hybrid architecture allows metadata to reside with applications but keeps control (and the key components) in a centralized repository.

The chosen design paradigm and derived architectural approach are likely to be influenced by the organizational structure and the complexity of its information systems. It is unlikely that a large organization with sophisticated information needs would adopt a top-down design for metadata and implement it in a centralized manner. Such organizations are likely to have many information systems, hence are more likely to apply a decentralized or hybrid architecture through the corresponding design paradigms. Smaller organizations, with less complex information needs, can afford the luxury of a top-down approach, aiming to capture the entire set of metadata requirements and implement a centralized architecture.

Metadata quality. The design and initial implementation of the metadata layer is only the beginning. As organizations enhance their business activities or transition to new ones, information systems, the underlying data, and metadata must all be updated accordingly. Poor-quality metadata can result not only from poor analysis of requirements or from an invalid design approach but also from the failure to detect changes in the business and reflect them in the metadata layer. With metadata being at the functional core of information systems, poor quality might in turn degrade the quality of the data, cause operational failures, and violate information security. To keep metadata functional and its quality high, organizations need to invest in its ongoing administration and maintenance. Metadata must be constantly updated to reflect evolving changes in data models, business rules, information flow, end-user configuration, and underlying technology.

Metadata Research

Given these challenges, is metadata implementation worth the effort? If IT/IS researchers and practitioners understand only metadata’s technical merits but not its business benefits, why should business managers care about metadata? Wouldn’t it be reasonable for COTS product vendors to focus on technical metadata, designing it exclusively for IT professionals while ignoring the business decision makers? The answer is not obvious, as the benefits of metadata are not yet well known. Recent studies [1, 2, 6] suggest that metadata may significantly benefit business decision makers, hence, ought to be further explored by the academic research community.

Data-quality management is a promising area for metadata [9]. Due to the rapid growth of data volume and complexity, poor data quality represents a growing hazard to information systems. Metadata enables decision makers to gauge data quality and is critical for the administration of processes and security within decision-support environments (such as a data warehouse).

Managerial decision making stands to benefit from metadata [1, 2]. Understanding this benefit involves several questions:

What types of decision making are most likely to benefit from metadata? Metadata is likely to be useful in rational, data-driven, analytical decision making scenarios. Not clear is whether it provides similar benefit in decision processes that are more intuitive or politically charged;
How is that benefit influenced by the decision maker role? It would be reasonable to assume that a middle-tier manager and a CEO would each benefit, though in different ways; and
What stages of the decision process benefit most from metadata? Decision making is a multi-stage process where preliminary cycles of elaboration and search for data may precede the final decision. Metadata may affect not only that decision but the efficiency of preliminary stages of data exploration as well.

Securing organizationwide support is typically the greatest challenge in any successful metadata implementation. Such support cannot be achieved without identifying and communicating the merits of metadata to the technical community, to business users, and to corporate decision makers alike. Those merits, however, as well as the drawbacks, have yet to be fully explored, and many questions remain to be answered before metadata value is fully known.

Figures

Figure 1. Business reporting with and without data quality metadata.

Figure 2. Interchangeability of metadata formats.

Tables

Table. Classification of sample metadata elements.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

The Metadata Enigma

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/1113034.1113035

February 2006 Issue

Published: February 1, 2006

Vol. 49 No. 2

Pages: 88-94

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Aug 30 2024

Everything You Always Wanted to Know About PCs, But Were Afraid to Ask

Saurabh Bagchi

Computing Profession

individuals at a conference table, illustration

News Aug 30 2024

How CrowdStrike Stopped Everything

David Geer

Security and Privacy

BLOG@CACM Aug 29 2024

Leveraging Computational Thinking in the Era of Generative AI

Yael Erez, Koby Mike, and Orit Hazzan

Education

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Commercial Solutions2

Design and Implementation Challenges

Metadata Research

Figures

Tables

The Metadata Enigma

DOI

February 2006 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Commercial Solutions²