Data, and those who manage it, have received heightened attention in recent years. Data is widely viewed as a valuable organizational resource,5 and those charged with its management have been elevated to executive levels of the organization.9 However, as organizations respond to the ever-increasing demands for information, data management activities have become much more complex.1
Contributing to this complexity has been the development and increased use of networked information technologies, which enable large-scale collaborations across time, space, institutional, and disciplinary boundaries. Cyberinfrastructure is an overarching concept that encompasses the hardware, software, services, personnel, and organizations that serve as an underlying foundation in support of collaborative network activities (see Figure 1).
Organizations of all types, including corporate, scientific, and government entities, are engaged in large-scale collaborative networks that work with massive datasets4 with the objectives of generating, processing, and disseminating knowledge across various boundaries. Given that knowledge is applied information and information is contextualized data the contextual environment within which data management practices are carried out significantly impacts the nature and effectiveness of those practices, as well as resultant data management outcomes. Yet, data management research and practice rarely consider the embeddedness of data in context11 and generally focus on data management within the boundaries of a single organization.
In contrast, the research we present here goes beyond these limitations. We frame our research in the Model of Global IT Issues,2 which provides a perspective that accounts for the development, use, and management of information systems and technologies in different cultures and locations.10 Then we highlight the importance of contextualized data networks and cyberinfrastructure data management in today's collaborative environment. Finally, we introduce a framework for cyberinfrastructure data management. We illustrate through a case study how use of the framework can guide data practitioners in analyzing the contextual environments of their data networks; determining actual and potential data quality problems; and establishing effective management solutions.
The Model for Global IT Issues Analysis2 shown in Figure 2 focuses on three sets of factors: country-specific variables, firm-specific variables, and resultant key IT management issues. Country-specific variables include political/regulatory factors, level of economic development, and cultural variables. Firm-specific variables include type of firm/industry, global strategies, and global business and IT strategy. In this model, the global context has a geographical reference, whereby the first two sets of factors significantly influence key IT management issues that include business relationships, human resources, quality and reliability, internal effectiveness, end user computing, technology infrastructure, and system development concerns that are heightened when working in multiple locations around the world. Through deliberate determination and analysis of country-specific and firm-specific variables, an effective assessment of key IT management issues can be made for particular global contexts.
We extend this model to encompass a broader meaning of the term "global" and focus specifically on large-scale collaborative environments. We define global to include a holistic view that acknowledges potential differences in cultures, countries, organizations, economies, professional disciplines, knowledge domains, and cyberinfrastructure elements in a large-scale collaborative environment. Additionally, we emphasize the contextualized nature of data and the management of its supporting cyberinfrastructure.
The goals of data management have been described as ensuring organizational data integrity and availability.12 Large-scale collaborative networks, however, have four characteristics that result in specific data management needs.
Hence, managing the cyberinfrastructure supporting massive sets of networked data requires additional governance mechanisms.
For example, collaborative activities focused on emergency prevention and preparedness necessitate the sharing and analysis of geographical data collected and managed across multiple organizations. However, efforts to provide centralized responsibility for ensuring hazard prevention, detection, and response have been a challenge due to the differing geographical models and data formats used by local hazard management and city planning organizations. Prior research8 has highlighted weaknesses in data management that hampered efforts to assess and prevent urban hazards, including dispersion and heterogeneity of data, lack of policies and standards for data sharing, and lack of metadata.
Similar data management challenges have been highlighted in government efforts to implement Homeland Security initiatives in the United States. Three key problems have been noted with databases that currently contain data required for homeland security:7 dispersion of data on different platforms, inadequate data quality (inaccurate, missing, outdated, inconsistent) and inadequate metadata (unable to determine where data resides, what data represents, what format it is in, among others).
The nature of large-scale collaborative work demands data management goals beyond organizational data integrity and availability. Below we develop a framework to describe data management in this context.
Contextual factors. In our adaptation of the Model for Global IT Issues Analysis,2 three sets of factors are critical in the context of large-scale collaborative data: cyberinfrastructure- specific variables, location-specific variables, and discipline/domain-specific variables (see Figure 3).
Location-specific variables entail the political/regulatory factors, levels of economic development, and social/ cultural factors relevant to any unit within the collaborative network that affect, or potentially affect, data needs or data management outcomes. Political and regulatory factors, such as national laws, government regulations, corporate policies, and the political climate within the collaborative network itself, can place demands on data and network processes, as well as shape the objectives of data administration. Levels of economic development can vary within the collaborative network, particularly when collaborating units are distributed across developed and underdeveloped countries, urban and rural locations, or large institutions and small businesses. In particular, economic development levels often serve as an indication of the ability to expend resources in support of data management efforts. Social and cultural factors, such as differences in professional work norms, organizational incentives systems, languages, established relationships among network members, and national culture, are likely to affect all aspects of data administration as well as interpretations of data product and service quality.
As depicted in Figures 1 and 3, cyberinfrastructure-specific variables include the hardware, software, services, personnel, and organizations embedded within the collaborative network that support IT-enabled collaborative network activities. Cyberinfrastructure services, including high performance computing; data, information, and knowledge management; observation, measurement and fabrication; interfaces and visualization; and collaboration services, are all data intensive. To the extent that networking, operating system, and middleware solutions support the cyberinfrastructure services required by the collaborative network, the greater the likelihood those services will adequately support the community-specific data administration needs of network users.
Discipline/Domain-specific variables encompass the value of data, the existence of data standards, and the influence of the IS/CS disciplines (in terms of understanding current techniques, methodologies, and practices) in each professional discipline and knowledge domain relevant to the collaborative network. Not all units within the collaborative network may value data (i.e., place importance on data quality) in the same way. They may have conflicting or non-existent data standards in many or all domains. And the level of IS/CS expertise residing within, or supporting, the collaborative network may be lacking in units or across the entire network. As a result, discipline/domain-specific variables significantly impact how data administration is likely to ensue.
Taken together these contextual factors have significant impacts on data quality through their influence on data administration issues within large-scale collaborative networks. It is important to understand the complexity of these factors, as they interact and change sometimes dramatically -over time. The ability of those charged with managing the data cyberinfrastructure to effectively assess, monitor, control for, and adapt to the contextual environment determines, in large part, how sound, useful, dependable, and usable data are for network users.
Data Administration Issues. Taking into account the many contextual factors that exist and evolve over time during a large-scale collaborative effort, a highly involved level of data administration must take place to assure adequate data quality within the network. Administrators must understand the data objectives that drive network users with regard to the users' and other stakeholders' motivations, goals, and expectations. In a large-scale collaborative effort, it is highly likely that these objectives will vary across network units and external stakeholder populations. It is also likely that network units will have varying degrees of metadata (i.e., standards and data dictionary systems) and data process (i.e., data collection, input, storage, processing, dissemination, and maintenance) capabilities. Indeed the management of metadata and the establishment of processes for the management of data have been found to be critical to integrating data needed for large-scale collaborative efforts. Whether these differences are reconciled in an appropriate or inappropriate manner will ultimately affect data quality.
Hence, network processes that encompass the interactive nature of collaboration within a large-scale network environment become an important aspect of data administration. In order to effectively determine data needs and capabilities within and across network units, communication, coordination, and monitoring mechanisms must be put in place. Training efforts are needed to standardize data processes and ensure data quality. A critical issue often encountered in large-scale collaborative network environments is data ownership.4 Data administrators must be cognizant of intellectual property issues associated with data within the collaborative network, and may be responsible for safeguarding ownership rights.
Data Quality Impacts. Within large-scale collaborative networks, contextual factors impact data quality in terms of both product and service. Data product quality dimensions refer to how well the data, in terms of its features, conforms to network data specifications (such as, soundness) or meet/exceed network user expectations (usefulness). Data service quality dimensions refer to how well data delivery processes conform to network user service specifications (dependability) or meet/exceed network user expectations (i.e., usability).6 Table 1 defines each data quality dimension highlighted in Figure 3.
To illustrate the use of this framework, we describe its application in the case of a government agency (GovAg) responsible for managing water resources in a developing country.
Contextual Factors: Massive sets of data were collected throughout the country by a network of units including various departments within GovAg, private contractors, non-profit organizations, and other quasi-government agencies. The cyberinfrastructure supporting the data management activities of this collaborative network consisted of the networking, operating system, and middleware platforms that allowed network members to collect, input, store, process, disseminate and maintain data within their own units, as well as across the network. The implementation of these platforms was inadequate given a low level of resources available in the country.
Individuals from varied disciplines, including surface water, groundwater, information systems (IS), and administrative management, were also part of the cyberinfrastructure support. Within this mix of disciplines water scientists had the most influence and many preferred to perform their own data management tasks. Additionally, the educational system was just beginning to implement comprehensive university-level IS programs. Hence, there were few individuals actually trained as information systems professionals working on data management activities within the network.
Data Administration Issues: In order to adequately manage the data administration issues present in the GovAg network, we examined the objectives, metadata, data processes, and network processes present in the collaborating units explicitly considering the above contextual factors during our examination. Because of the influence of water scientists and their limited self-training in IS practice, many data processes were inadequately implemented and poorly documented. In the water science community, surface and groundwater have traditionally been treated as separate disciplines. This resulted in silos of data with very different standards.
GovAg had to respond to recent changes in water laws that necessitated greater cooperation, coordination, and integration of data across network units. Given the limited financial and personnel resources available in the country, it was difficult to gather the resources needed to design and implement new data and network processes. Additionally, agreement on new processes was difficult as the motivations, goals, and expectations of network members varied along both organizational and professional lines. As a result, inadequate processes were implemented.
Data Quality Impacts: The handling of these data administration issues had a direct impact on the resulting data quality. There were inconsistent representations of data across the network, and limited resources for data quality processes resulted in data integrity problems. Because there was limited metadata and limited communications among network members, data understandability and interpretability was compromised, limiting its usefulness. The believability, reputation, and value-added data characteristics were also questioned.
By using the framework to highlight and discuss these issues with GovAg network members, we were able to make specific recommendations to improve their data management activities. Within a few months, policy changes were enacted across the network, financial resources were reallocated, and personnel resources were reorganized resulting in improved metadata, data processes and network processes. Data quality improvements were seen shortly thereafter.
Use of the framework facilitates continuous assessment of data management practices and outcomes. We recommend data administrators make three types of determinations from these assessments. First, if data product or service quality needs are not being met, administrators should determine if adjustments or reconciliations must be made to the objectives, metadata, data processes or network processes taking place within individual units or across the collaborative network. Second, given the contextual factors and data administration issues present, it should be determined if adjustments to data quality standards are needed. Third it should be determined whether adjustments or reconciliations to contextual factors are needed, and if so to what extent those changes can be made.
Having greater understanding of the contextual factors, data administration issues, and data quality impacts present in the network, increases the likelihood that attempted change efforts will be effective in terms of achieving desired outcomes. Hence, we recommend data administrators within large-scale collaborative networks use the framework to guide their study of these areas some of which may not be traditionally within their field of attention.
Given the importance of knowledge work accomplished today through large-scale collaborations, effective management of the cyberinfrastructure data supporting these activities is critical. The Framework for cyberinfrastructure Data Management serves as a tool for data managers and practitioners to determine their data management needs in increasingly-complex development efforts. For researchers and analysts, the framework provides a guide to much-needed, future research on data cyberinfrastructure, as well as other cyberinfrastructure components.
Developing competency in, and infrastructure for, assessing, monitoring, and managing the massive sets of contextualized data that result from large-scale collaboration networks is not optional; it is a necessity in today's data-, information-, and knowledge-intensive environments. It is our hope that the Framework for Cyberinfrastructure Data Management will benefit managers, practitioners, and academics as communities across the globe continue to take advantage of the tremendous advances in communications and storage technologies.
3. Atkins, D. E. et al. Revolutionizing science and engineering through cyberinfrastructure: Report of the National Science Foundation Blue-Ribbon Advisory Panel on Cyberinfrastructure. National Science Foundation, January 2003.
8. Merson, M., Montoya, L. and Paresi, C. Manage data Manage hazards: Development of urban hazard information infrastructure for the city of Windhoek, Namibia. Management of Environmental Quality 15, 3 (2004), 276293.
©2009 ACM 0001-0782/09/0200 $5.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2009 ACM, Inc.