Computing Applications

The Provenance of Electronic Data

It would include details of the processes that produced electronic data as far back as the beginning of time or at least the epoch of provenance awareness.

By Luc Moreau, Paul Groth, Simon Miles, Javier Vazquez-Salceda, John Ibbotson, Sheng Jiang, Steve Munroe, Omer Rana, Andreas Schreiber, Victor Tan, and Laszlo Varga

Posted Apr 1 2008

Introduction
Open Model for Process Documentation
Querying the Provenance of Electronic Data
In Health Care Management
Existing Systems
Conclusion
References
Authors
Footnotes
Figures

Provenance is well understood in the study of fine art where it refers to the documented history of some art object. Given that documented history, the object attains an authority that allows scholars to understand and appreciate its importance and context relative to other works. Art objects that lack a proven history may be viewed with skepticism by those who study them.

If the provenance of data produced by computer systems could be determined, then users would be able to understand how documents had been assembled, how simulation results were determined, and how financial analyses were carried out. Computer applications should thus be transformed, making them provenance-aware, so the data’s provenance may be retrieved, analyzed, and reasoned over.

Electronic data does not typically contain the historical informationthat would help end users, reviewers, or regulators make the necessary verifications.

The Oxford English Dictionary defines provenance as: “(i) the fact of coming from some particular source or quarter; origin, derivation; (ii) the history or pedigree of a work of art, manuscript, rare book, etc.; concretely, a record of the ultimate derivation and passage of an item through its various owners.” Hence, we can regard provenance as the derivation from a particular source to a specific state of an item. The description of such a derivation may take different forms or emphasize different properties according to a user’s personal interest. For instance, for a work of art, provenance usually identifies its chain of ownership; alternatively, the actual state of a painting may be understood better by studying the various restorations it has endured.

The dictionary definition also identifies two distinct ways to view provenance: the source (or derivation) of an object and the record of the derivation. A computer-based representation of provenance is crucial for users who want to analyze, reason, and decide whether or not they trust electronic data.

Here, we introduce the provenance life cycle, summarizing key principles underpinning existing provenance systems. We then examine an open data model for describing how applications are executed; in this context, provenance is seen as a user query over such descriptions. We illustrate the vision of provenance-aware applications through a concrete example in health-care management, contrasting it with existing systems.

The scientific and business communities [6] both embrace a service-oriented architecture (SOA) that allows the dynamic discovery and composition of services. SOA-based applications are increasingly dynamic and open but must satisfy new requirements in both e-science and business. In an ideal world, e-science end users would be able to reproduce their results by replaying previous computations, understand why two seemingly identical runs with the same inputs produce different results, and determine which data sets, algorithms, or services were involved in their derivation.

In e-science and business, some users, reviewers, auditors, and even regulators must verify that the process that led to some result complies with specific regulations or methodologies; further, they must prove the results were derived independently from services or databases with given license restrictions; and they must also establish that the data was captured at the source by instruments with some precise technical characteristics.

While some users must perform such tasks today, they are unable to do so or do it only imperfectly, because the underpinning principles have not been investigated, and systems have not been designed to support such requirements. A key observation is that electronic data does not typically contain the historical information that would help end users, reviewers, or regulators make the necessary verifications. Hence, there is a need to capture extra information, or process documentation, describing what actually occurred at execution time. Process documentation is to electronic data what a record of ownership is to a work of art. Provenance-aware applications create process documentation and store it in a provenance store offering long-term persistent, secure storage of process documentation (see Figure 1). This role accommodates a variety of physical deployments; for instance, a provenance store can be a single, autonomous service or (to be more scalable) a federation of distributed stores.

When process documentation is recorded, the provenance of data results can be retrieved by querying the provenance store and analyzed to suit the user’s needs. The provenance store and its contents might also need to be managed, maintained, or curated.

Open Model for Process Documentation

Process documentation for many applications cannot be produced in a single, atomic burst but must be interleaved continuously with execution. This makes it necessary for designers to be able to distinguish a specific item documenting part of a process from the whole process of documentation. We view the former—a p-assertion—as an assertion made by an individual application service involved in the process. Thus, the documentation of a process consists of a set of p-assertions made by the services involved in the process.

In order to minimize its effect on application performance, documentation must be structured so it can be constructed and recorded autonomously by services on a piecemeal basis. Otherwise, should synchronization be required among these services to agree on how and where to document execution, application performance might suffer dramatically. To satisfy this design requirement, we’ve identified various kinds of p-assertions we expect applications to adopt in order to document their execution. Figure 2 outlines a computational service sending and receiving messages and creating p-assertions that describe its involvement in such activity.

In SOAs, interactions consist of messages exchanged between services. By capturing all interactions, one can analyze an execution and verify its validity or compare it with other executions. Therefore, process documentation includes interaction p-assertions, or descriptions of the contents of a message by a service that has sent or received it.

Whether a service returns a result directly or calls other services, the relationship between its outputs and inputs is not generally explicitly represented in the messages themselves but is understood through analysis of the service’s business logic. To promote openness and generality, we make no assumptions about the technology (such as source code and workflow language) used by services to implement their business logic. Rather, we require services to provide information in the form of relationship p-assertions, or descriptions asserted by a service as to how it obtained output data sent in an interaction by applying some function or algorithm to input data from other interactions. (In Figure 2, output message M3 was obtained by applying function f1 to input M1.)

With the two kinds of p-assertions—interaction and relationship—process documentation as a whole is greater than the sum of its individual parts. Indeed, while p-assertions are simple pieces of documentation produced by services autonomously, interaction and relationship p-assertions together capture an explicit description of the flow of data in a process. Interaction p-assertions denote data flows between services, whereas relationship p-assertions denote data flows within services. These flows capture the causal and functional data dependencies in execution and, in the most general case, constitute a directed acyclic graph (DAG) (see Figure 3). For a specific data item, the data-flow DAG indicates how it is produced and used and is thus a core element of provenance representation, though not the only one.

Process documentation is to electronic data what a record of ownership is to a work of art.

Beyond the flow of data in a process, internal service states may be needed to understand nonfunctional characteristics of execution (such as the performance or accuracy of services) and therefore the nature of the results they compute. Hence, a service-state p-assertion is documentation provided by a service about its internal state in the context of a specific interaction. Service-state p-assertions are varied; they may include the amount of disk and CPU time used by a service in a computation, the local time when an action occurred, the floating-point precision of the results it produced, or application-specific state descriptions.

In order for provenance-aware applications to be interoperable, it is critical that the process documentation they respectively produce be structured according to a shared data model. Therefore, the novelty of our approach is the openness of the proposed model of documentation [7] conceived as independent of application technologies [8]. These characteristics together allow process documentation to be produced autonomously by application services and expressed in an open format over which provenance queries may be expressed.

Querying the Provenance of Electronic Data

Provenance queries are user-tailored queries over process documentation aimed at obtaining the provenance of electronic data. In this context, the data item of interest to the user must first be characterized. Indeed, since data is indeed mutable, its provenance, or history, can vary according to the point in execution from which a user wishes to find it. A provenance query must be able to identify a data item with respect to a given documented event (such as sending or receiving a message).

The full detail of everything that ultimately caused a data item to be what it is could be quite large; for example, the full provenance of an experiment’s results almost always includes a description of the process that produced the materials in the experiment, along with the provenance of any materials used in producing these materials and the devices and software (and their settings) used in the experiment. Should documentation be available, the full provenance would ultimately include details of processes leading back to the beginning of time or at least to the epoch of provenance awareness.

Users must be able to express the scope of their interest in a process through a provenance query, essentially performing a reverse graph traversal over the data flow DAG and terminating according to the query-specified scope; the query output is a DAG subset. Scoping can be based on types of relationships, intermediary results, services, or subprocesses [7].

In Health Care Management

To illustrate our approach, we explore a health care management application. The Organ Transplant Management (OTM) system under development by the Catalan Transplant Organization, Catalonia, Spain, manages all the activities pertaining to organ transplants across multiple Catalan hospitals and their regulatory authority, the government of Catalonia, Spain [1]. OTM consists of a complex process involving the surgery itself, along with such activities as data collection and patient organ analysis that must comply with a set of regulatory rules. OTM is supported by an IT infrastructure that maintains records that allow medical personnel to view (and edit) a given patient’s local file within a given institution or laboratory. However, the system does not yet connect records or capture the dependencies among them or allow external auditors or patients’ families to analyze or understand how decisions are made.

By making OTM provenance-aware, powerful queries impossible without provenance-awareness functionality can now be supported (such as find all doctors involved in a decision, find all blood-test results involved in a donation decision, and find all data that led to a decision). Such functionality can be made available not only to the medical profession but also to regulators and families.

Here, we limit ourselves to a simplified subset of the OTM workflow—the process leading to the decision of whether or not to donate an organ. As a hospitalized patient’s health declines and in anticipation of a potential organ donation, an attending doctor requests the full health record for the patient and sends a blood sample for analysis. Through a context-sensitive menu-driven user interface (UI), the attending doctor submits the requests that are then passed to a software component (the donor data collector) responsible for collecting all expected results. If brain death is observed and logged into the system and if all requested data and analysis results are obtained, the system asks the doctor to decide about the donation of an organ. The decision, or the outcome of the doctor’s medical judgment based on the collected data, is explained in a report submitted by the doctor as the decision’s justification.

Figure 3 (top) outlines the components involved in this scenario and their interactions. The UI sends requests (I1, I2, I3) to the donor data collector service, which gets data from the patient records database (I4, I5), along with analysis results from the laboratory (I6, I7), and finally requests a decision (I8, I9).

To make OTM provenance-aware, designers are augmenting OTM with the ability to produce an explicit representation of the process taking place, including p-assertions for all interactions (I1I9), relationship p-assertions capturing dependencies between data items, and state p-assertions. Figure 3 (bottom) outlines the DAG representing a donation decision’s provenance, which consists of relationship p-assertions produced by provenance-aware OTM. DAG nodes denote data items, whereas DAG edges (in blue) represent relationships (such as data dependencies, like “is based on” and “is justified by,” and causal relationships, like “in response to” and “is caused by”). Each data item is annotated by the interaction in which it occurs. Further, the UI asserts a service-state p-assertion for each of its interactions about the users logged into the system.

Authorized users can then issue provenance queries that navigate the provenance graph, pruning it according to the querier’s needs; for example, from the graph, we can derive that users X and Y are both causing a donation decision to be reached. Figure 3 includes only a limited number of components, but in real-life examples involving vast amounts of documentation, users—doctors, patients, or regulatory authorities—benefit from a powerful and accurate provenance-query facility.

Existing Systems

The approach we’ve explored here is derived from an extensive requirement analysis [8] that resulted in a complete architectural specification [7] used as the basis for writing an open specification of data models and interfaces. The open approach allows the documentation of complex distributed applications, possibly involving multiple technologies (such as Web services, command-line executables, and monolithic executables). It also allows the expression of complex provenance queries to identify data and scoping processes independent of the technologies being used.

The Virtual Data System [4] and myGrid [10] are execution environments for scientific workflows that provide support for provenance. They focus on producing documentation from a workflow enactor’s viewpoint using data models compatible with p-assertions. They assume their respective workflow language, allowing them to obtain compact process documentation. By adopting an open data model for process documentation, like the one we’ve advocated here, such systems could be integrated into heterogeneous applications that seamlessly execute provenance queries.

The database community has also investigated provenance [2, 5] but adopted different assumptions; for instance, it assumes the existence of a query language for which queries may be reversed to identify the origin of results. As in our approach, different kinds of provenance (such as why and where [2]) are viewed as being of value as specific instances of provenance queries.

The Provenance Aware Storage System developed at Harvard University [9] is designed to automatically produce documentation of execution by capturing file system events in an operating system. Like all other approaches, capturing small-grain documentation involves scalability and performance challenges, so deriving information at a suitable level of abstraction for the user is often difficult.

Conclusion

The IT landscape, which once exclusively involved closed monolithic applications, today involves applications that are open and composed dynamically while being able to discover results and services on the fly. Users must know whether they have confidence in their applications’ electronic data; it must therefore be accompanied by its provenance that describes the process that led to its production.

To achieve this vision, we’ve proposed an open approach through which applications, irrespective of technology, document their execution in an open data model that can then be used to run provenance queries tailored to user needs. In the same way scholars can appreciate works of art by studying their documented history, users would be able to gain confidence in electronic data thanks to provenance queries.

Figures

Figure 1. Provenance life cycle.

Figure 2. Categories of p-assertions made by a computational service.

Figure 3. Provenance directed acyclic graph of a donation decision.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

The Provenance of Electronic Data

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/1330311.1330323

April 2008 Issue

Published: April 1, 2008

Vol. 51 No. 4

Pages: 52-58

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Dec 26 2024

AI: Beyond the Headlines

Erdin Beshimov

Artificial Intelligence and Machine Learning

News Dec 23 2024

Images Give Robots a Sharper Focus

Samuel Greengard

Architecture and Hardware

BLOG@CACM Dec 20 2024

Strengthening Security Throughout the ML/AI Lifecycle

Alex Vakulov

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Open Model for Process Documentation

Querying the Provenance of Electronic Data

In Health Care Management

Existing Systems

Conclusion

Figures

The Provenance of Electronic Data

DOI

April 2008 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.