Security and Privacy Viewpoint

Personal Data and the Internet of Things

It is time to care about digital provenance.

By Thomas Pasquier, David Eyers, and Jean Bacon

Posted Jun 1 2019

Introduction
Digital Provenance
Who to Trust?
Where Does the Audit Live?
How to Communicate Information?
We Need to Care About Digital Provenance
References
Authors
Footnotes

We have all read market predictions describing billions of devices and the hundreds of billions dollars in profit that the Internet of Things (IoT) promises.^a Security and the challenges it represents²⁷ are often highlighted as major issues for IoT, alongside scalability and standardization. In 2017, FBI Director James Comey warned, during a senate hearing, of the threat represented by a botnet taking control of devices owned by unsuspecting users. Such a botnet can seize control of devices ranging from connected dishwashers,^b to smart home cameras and connected toys, not only using them as a platform to launch cyber-attacks, but also potentially harvesting the data such devices collect.

In addition to concerns about cyber-security, corporate usage of personal data has seen increased public scrutiny. A recent focus of concern has been connected home hubs (such as Amazon Alexa and Google Home).^c Articles on the topic discussed whether conversations were being constantly recorded and if so, where those records went. Similarly, the University of Rennes faced a public backlash after revealing its plan to deploy smart-beds in its accommodation to detect "abnormal" usage patterns.^d A clear question emerges from IoT-related fears: "How and why is my data being used?"

As concerns grow, legislators across the world are taking action in order to protect the public. For example, the recent EU General Data Protection Regulation (GDPR) that took effect in May 2018,^e and the forthcoming ePrivacy Regulation^f place strong responsibility on data controllers to protect personal data, and to notify users of security breaches. The EU commission defines a Data Controller as the party that determines the purposes for which, and the means by which, personal data is processed (why and how the data is processed). EU regulations further impose constraints on EU citizens’ data processing based on location and data type (that is, "special category" data falls under more stringent constraints). The data controller must provide means for end users to determine whether their data is properly handled and means to effect their rights. Overall, there must be mechanisms to determine what data is processed, how, why, and where.

Such concerns have drawn researchers to look at means to develop more accountable and transparent systems.^10,24 The problem has also been clearly highlighted by the EU Data Protection Working Party: "As a result of the need to provide pervasive services in an unobtrusive manner, users might in practice find themselves under third-party monitoring. This may result in situations where the user can lose all control on the dissemination of his/her data, depending on whether or not the collection and processing of this data will be made in a transparent manner or not."

Modern computing systems contain many components that operate as black boxes; they accept inputs and generate outputs but do not disclose their internal working. Beyond privacy concerns, this also limits the ability to detect cyber-attacks, or more generally to understand cyber-behavior. Because of these concerns DARPA, in the U.S., launched the Transparent Computing project^g to explore means to build more transparent systems through the use of digital provenance with the particular aim of identifying advanced persistent threats. While DARPA’s work is a good start, we believe there is an urgent need to reach much further. In the remainder of this Viewpoint, we explore how provenance can be an answer to some IoT concerns and the challenges faced to deploy provenance techniques.

Digital Provenance

There is a growing clamor for more transparency, but straightforward, widespread technical solutions have yet to emerge. Typical software log records often prove insufficient to audit complex distributed systems as they fail to capture the complex causality relationships between events. Digital provenance⁸ is an alternative means to record system events. Digital provenance is the record of information flow within a computer system in order to assess the origin of data (for example, its quality or its validity).

The concept first emerged in the database research community as a means to explain the response to a given query.¹⁶ Provenance research later expanded to address issues of scientific reproducibility, notably by providing mechanisms to reconstitute computational environments from formal records of scientific computations.²³ More recently, provenance has been explored within the cybersecurity community²⁵ as a means to explain intrusions¹⁸ or more recently to detect them.¹⁴

Provenance records are represented as a directed acyclic graph that shows causality relationships between the states of the objects that compose a complex system. As a consequence, it is compatible with automated mathematical reasoning. In such a graph, the vertices represent the state of transient and persistent data items, transformations applied to those states, and persons (legal or natural) responsible for data and transformations (generally referred to as entities, activities, and agents respectively). The edges represent dependencies between these entities. The analysis of such a graph allows us to understand where, when, how, by whom, and why data has been used.^7,9

An outcome of research on provenance in the cybersecurity space is the understanding that the capture mechanism must provide guarantees of completeness (all events in the system can be seen), accuracy (the record is faithful to events) and a well-defined, trusted computing base (the threat model is clearly expressed).²² Otherwise, attacks on the system may be undetected, dissimulated by the attacker, or misattributed. We argue that in a highly ad hoc and interoperable environment with mutually untrusted parties, the provenance used to empower end users with control and understanding over data usage requires similar properties.

Who to Trust?

In the IoT environment the number of involved stakeholders has the potential to explode exponentially. Traditionally, a company managed its own server infrastructure, maybe with the help of a subcontractor. The cloud computing paradigm further increased complexity with the involvement of cloud service providers (sometimes stacked, for example, Heroku PaaS on top of the Amazon IaaS cloud service), third-party service providers (for example, Cloud-MQTT) and other tenants sharing the infrastructure. The IoT further increases this complexity, with potentially ad hoc and unforeseen interactions between devices and services on top of the complex cloud and edge computing infrastructure most IoT services rely on.

One answer to this problem is to build applications in "silos" where the involved parties are known in advance, but as a side-effect locking-in devices and services to a single company (for example, the competing smart-home offerings by leading technology companies). This is far from the IoT vision of a connected environment, but most existing products fall into this category. There are obviously major business considerations behind this model, and it should be noted that the EU GDPR mandates for some form of interoperability (although it is yet unclear how it should be interpreted¹²).

Building transparent and auditable systems may be one of the greatest software engineering challenges of the coming decade.

An alternative to such "lock-in" would be to make devices’ consumption of data transparent and accountable. If data is exchanged across devices, the concerned user should be able to audit its usage. However, in an environment where arbitrary devices could interact (although it must be remembered that EU GDPR requires explicit and informed user consent), how can trust be established in the audit record? This requires an in-depth rethinking of how IoT platforms are designed, potentially exploring the security-by-design approach based on hardware roots of trust¹³ to provide trusted digital enclaves in which behavior can be audited. Some form of "accountability-by-design" principle should also be encouraged, where transparency and the implementation of a trustworthy audit mechanism is a core concern in product design.

Such solutions have been explored in the provenance space, for example, by leveraging SGX properties to provide a strong guarantee of the integrity of the provenance record.⁴ Similarly, remote attestation techniques leveraging TPM hardware have been proposed⁶ to guarantee the integrity of the capture mechanism. However, how to provide such guarantees in an IoT environment, where such hardware features may not be available, is a relatively unexplored topic.

Where Does the Audit Live?

The fully realized IoT vision is of vast distributed and decentralized systems. If we assume trustworthy provenance capture is achievable, the issue of guaranteeing that the provenance record can be audited remains. If you are to audit the processing of personal data, guarantees about the integrity and availability of the provenance record must exist. If you agreed to share your daily activity for research, the activities of insurance companies scraping your data for possible health risks must not be able to masquerade as benign research use, nor should data collection for political purposes be able to pass as harmless entertainment, as in the Cambridge Analytica scandal.^h Similarly, the availability (durability) of the audit record must be guaranteed. There is no point to an audit record if it can simply be deleted.

Further, Moyer et al. evaluated the storage requirements of provenance when used for security purposes in relatively modest distributed systems.²¹ In such a context, several thousands of graph elements can be generated per second and per machine, resulting in a graph containing billions of nodes to represent system execution over several months. It is unclear how some past research outcomes, for example, detection of suspicious behavior,² privacy-aware provenance¹¹ or provenance integrity,¹⁵ scale to very large graphs, as such concerns were not evaluated. Similarly, while blockchain is heralded¹⁹ as an integrity-preserving means to store provenance, it is unclear how well it could expand to such scale. Several options have been explored to reduce graph size, such as identifying and tracking only sensitive data objects⁵ or performing property-preserving graph compression¹⁷ however none has yet adequately addressed the scalability challenge.

How to Communicate Information?

Means must be developed to communicate about data usage, but also about the risks of inference from the data. Not only must the nature of the data be considered, but also other properties such as the frequency of capture.³ For example, a 100Hz smart-meter reading can in some cases indicate what television channel is currently being watched; even a daily average reading could inform about occupancy. Here, it is important to be able to explore and represent the outcome of complex computational workflow.¹

Provenance visualization has been an active research topic for over a decade, yet no fully satisfactory solution has been proposed. The simplest possible visualization is to render the graph, however beyond trivially simple graphs such a representation is too complex and dense to be easily understood, even by experts. We go further and suggest that how interpretable such information is for end users also depends on educational background, socioeconomic environment, and culture.

In order to make the accountability and transparency of IoT platforms effective, a better communication medium must be provided. An approach often taken is to analyze motifs in the graph to extract high-level abstractions (for example, Missier et al.²⁰), meaningful to the average end user. In recent work, it was proposed to represent such a high-level abstraction as a comic strip.²⁶

We Need to Care About Digital Provenance

Building transparent and auditable systems may be one of the greatest software engineering challenges of the coming decade. As a consequence, digital provenance and its application to cybersecurity and the management of personal data has become a hot research topic. We have highlighted key active areas of research and their associated challenges. It is fundamental for industry practitioners to understand the threat posed by the black-box nature of the IoT, the potential solutions, and the challenges to a practical deployment of those solutions. Accountability-by-design must become a core objective of IoT platforms.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Personal Data and the Internet of Things

View in the ACM Digital Library

DOI

10.1145/3322933

June 2019 Issue

Published: June 1, 2019

Vol. 62 No. 6

Pages: 32-34

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

News Sep 17 2025

Is It Real, or Is It AI?

Logan Kugler

Artificial Intelligence and Machine Learning

real diamond and fake diamond side by side

BLOG@CACM Sep 16 2025

Strengthening Enterprise Quantum Security

Carl Torrance

Architecture and Hardware

BLOG@CACM Sep 15 2025

Airlines Rely on the Cloud

Hazel Raoult

Architecture and Hardware

aerial view of clouds from an airplane window

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Digital Provenance

Who to Trust?

Where Does the Audit Live?

How to Communicate Information?

We Need to Care About Digital Provenance

Personal Data and the Internet of Things

DOI

June 2019 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.