Opinion
Security and Privacy Viewpoint

Personal Data and the Internet of Things

It is time to care about digital provenance.
Posted
  1. Introduction
  2. Digital Provenance
  3. Who to Trust?
  4. Where Does the Audit Live?
  5. How to Communicate Information?
  6. We Need to Care About Digital Provenance
  7. References
  8. Authors
  9. Footnotes
personal data and the IoT, illustration

We have all read market predictions describing billions of devices and the hundreds of billions dollars in profit that the Internet of Things (IoT) promises.a Security and the challenges it represents27 are often highlighted as major issues for IoT, alongside scalability and standardization. In 2017, FBI Director James Comey warned, during a senate hearing, of the threat represented by a botnet taking control of devices owned by unsuspecting users. Such a botnet can seize control of devices ranging from connected dishwashers,b to smart home cameras and connected toys, not only using them as a platform to launch cyber-attacks, but also potentially harvesting the data such devices collect.

In addition to concerns about cyber-security, corporate usage of personal data has seen increased public scrutiny. A recent focus of concern has been connected home hubs (such as Amazon Alexa and Google Home).c Articles on the topic discussed whether conversations were being constantly recorded and if so, where those records went. Similarly, the University of Rennes faced a public backlash after revealing its plan to deploy smart-beds in its accommodation to detect "abnormal" usage patterns.d A clear question emerges from IoT-related fears: "How and why is my data being used?"

As concerns grow, legislators across the world are taking action in order to protect the public. For example, the recent EU General Data Protection Regulation (GDPR) that took effect in May 2018,e and the forthcoming ePrivacy Regulationf place strong responsibility on data controllers to protect personal data, and to notify users of security breaches. The EU commission defines a Data Controller as the party that determines the purposes for which, and the means by which, personal data is processed (why and how the data is processed). EU regulations further impose constraints on EU citizens’ data processing based on location and data type (that is, "special category" data falls under more stringent constraints). The data controller must provide means for end users to determine whether their data is properly handled and means to effect their rights. Overall, there must be mechanisms to determine what data is processed, how, why, and where.

Such concerns have drawn researchers to look at means to develop more accountable and transparent systems.10,24 The problem has also been clearly highlighted by the EU Data Protection Working Party: "As a result of the need to provide pervasive services in an unobtrusive manner, users might in practice find themselves under third-party monitoring. This may result in situations where the user can lose all control on the dissemination of his/her data, depending on whether or not the collection and processing of this data will be made in a transparent manner or not."

Modern computing systems contain many components that operate as black boxes; they accept inputs and generate outputs but do not disclose their internal working. Beyond privacy concerns, this also limits the ability to detect cyber-attacks, or more generally to understand cyber-behavior. Because of these concerns DARPA, in the U.S., launched the Transparent Computing projectg to explore means to build more transparent systems through the use of digital provenance with the particular aim of identifying advanced persistent threats. While DARPA’s work is a good start, we believe there is an urgent need to reach much further. In the remainder of this Viewpoint, we explore how provenance can be an answer to some IoT concerns and the challenges faced to deploy provenance techniques.

Back to Top

Digital Provenance

There is a growing clamor for more transparency, but straightforward, widespread technical solutions have yet to emerge. Typical software log records often prove insufficient to audit complex distributed systems as they fail to capture the complex causality relationships between events. Digital provenance8 is an alternative means to record system events. Digital provenance is the record of information flow within a computer system in order to assess the origin of data (for example, its quality or its validity).

The concept first emerged in the database research community as a means to explain the response to a given query.16 Provenance research later expanded to address issues of scientific reproducibility, notably by providing mechanisms to reconstitute computational environments from formal records of scientific computations.23 More recently, provenance has been explored within the cybersecurity community25 as a means to explain intrusions18 or more recently to detect them.14

Provenance records are represented as a directed acyclic graph that shows causality relationships between the states of the objects that compose a complex system. As a consequence, it is compatible with automated mathematical reasoning. In such a graph, the vertices represent the state of transient and persistent data items, transformations applied to those states, and persons (legal or natural) responsible for data and transformations (generally referred to as entities, activities, and agents respectively). The edges represent dependencies between these entities. The analysis of such a graph allows us to understand where, when, how, by whom, and why data has been used.7,9

An outcome of research on provenance in the cybersecurity space is the understanding that the capture mechanism must provide guarantees of completeness (all events in the system can be seen), accuracy (the record is faithful to events) and a well-defined, trusted computing base (the threat model is clearly expressed).22 Otherwise, attacks on the system may be undetected, dissimulated by the attacker, or misattributed. We argue that in a highly ad hoc and interoperable environment with mutually untrusted parties, the provenance used to empower end users with control and understanding over data usage requires similar properties.

Back to Top

Who to Trust?

In the IoT environment the number of involved stakeholders has the potential to explode exponentially. Traditionally, a company managed its own server infrastructure, maybe with the help of a subcontractor. The cloud computing paradigm further increased complexity with the involvement of cloud service providers (sometimes stacked, for example, Heroku PaaS on top of the Amazon IaaS cloud service), third-party service providers (for example, Cloud-MQTT) and other tenants sharing the infrastructure. The IoT further increases this complexity, with potentially ad hoc and unforeseen interactions between devices and services on top of the complex cloud and edge computing infrastructure most IoT services rely on.

One answer to this problem is to build applications in "silos" where the involved parties are known in advance, but as a side-effect locking-in devices and services to a single company (for example, the competing smart-home offerings by leading technology companies). This is far from the IoT vision of a connected environment, but most existing products fall into this category. There are obviously major business considerations behind this model, and it should be noted that the EU GDPR mandates for some form of interoperability (although it is yet unclear how it should be interpreted12).


Building transparent and auditable systems may be one of the greatest software engineering challenges of the coming decade.


An alternative to such "lock-in" would be to make devices’ consumption of data transparent and accountable. If data is exchanged across devices, the concerned user should be able to audit its usage. However, in an environment where arbitrary devices could interact (although it must be remembered that EU GDPR requires explicit and informed user consent), how can trust be established in the audit record? This requires an in-depth rethinking of how IoT platforms are designed, potentially exploring the security-by-design approach based on hardware roots of trust13 to provide trusted digital enclaves in which behavior can be audited. Some form of "accountability-by-design" principle should also be encouraged, where transparency and the implementation of a trustworthy audit mechanism is a core concern in product design.

Such solutions have been explored in the provenance space, for example, by leveraging SGX properties to provide a strong guarantee of the integrity of the provenance record.4 Similarly, remote attestation techniques leveraging TPM hardware have been proposed6 to guarantee the integrity of the capture mechanism. However, how to provide such guarantees in an IoT environment, where such hardware features may not be available, is a relatively unexplored topic.

Back to Top

Where Does the Audit Live?

The fully realized IoT vision is of vast distributed and decentralized systems. If we assume trustworthy provenance capture is achievable, the issue of guaranteeing that the provenance record can be audited remains. If you are to audit the processing of personal data, guarantees about the integrity and availability of the provenance record must exist. If you agreed to share your daily activity for research, the activities of insurance companies scraping your data for possible health risks must not be able to masquerade as benign research use, nor should data collection for political purposes be able to pass as harmless entertainment, as in the Cambridge Analytica scandal.h Similarly, the availability (durability) of the audit record must be guaranteed. There is no point to an audit record if it can simply be deleted.

Further, Moyer et al. evaluated the storage requirements of provenance when used for security purposes in relatively modest distributed systems.21 In such a context, several thousands of graph elements can be generated per second and per machine, resulting in a graph containing billions of nodes to represent system execution over several months. It is unclear how some past research outcomes, for example, detection of suspicious behavior,2 privacy-aware provenance11 or provenance integrity,15 scale to very large graphs, as such concerns were not evaluated. Similarly, while blockchain is heralded19 as an integrity-preserving means to store provenance, it is unclear how well it could expand to such scale. Several options have been explored to reduce graph size, such as identifying and tracking only sensitive data objects5 or performing property-preserving graph compression17 however none has yet adequately addressed the scalability challenge.

Back to Top

How to Communicate Information?

Means must be developed to communicate about data usage, but also about the risks of inference from the data. Not only must the nature of the data be considered, but also other properties such as the frequency of capture.3 For example, a 100Hz smart-meter reading can in some cases indicate what television channel is currently being watched; even a daily average reading could inform about occupancy. Here, it is important to be able to explore and represent the outcome of complex computational workflow.1

Provenance visualization has been an active research topic for over a decade, yet no fully satisfactory solution has been proposed. The simplest possible visualization is to render the graph, however beyond trivially simple graphs such a representation is too complex and dense to be easily understood, even by experts. We go further and suggest that how interpretable such information is for end users also depends on educational background, socioeconomic environment, and culture.

In order to make the accountability and transparency of IoT platforms effective, a better communication medium must be provided. An approach often taken is to analyze motifs in the graph to extract high-level abstractions (for example, Missier et al.20), meaningful to the average end user. In recent work, it was proposed to represent such a high-level abstraction as a comic strip.26

Back to Top

We Need to Care About Digital Provenance

Building transparent and auditable systems may be one of the greatest software engineering challenges of the coming decade. As a consequence, digital provenance and its application to cybersecurity and the management of personal data has become a hot research topic. We have highlighted key active areas of research and their associated challenges. It is fundamental for industry practitioners to understand the threat posed by the black-box nature of the IoT, the potential solutions, and the challenges to a practical deployment of those solutions. Accountability-by-design must become a core objective of IoT platforms.

Back to Top

Back to Top

Back to Top

    1. Acar, U. et al. A graph model of data and workflow provenance. In Proceedings of the TAPP'10 Second Conference on Theory and Practice of Provenance, USENIX, 2010.

    2. Allen, M.D. et al. Provenance for collaboration: Detecting suspicious behaviors and assessing trust in information. In Proceedings of the 7th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom). IEEE, 2011, 342–351.

    3. Amar, Y. et al. An information theoretic approach to time-series data privacy. In Proceedings of the 1st Workshop on Privacy by Design in Distributed Systems. ACM, (2018), 3.

    4. Balakrishnan, N. et al. Non-repudiable disk I/O in untrusted kernels. In Proceedings of the 8th Asia-Pacific Workshop on Systems. ACM, 2017, 24.

    5. Bates, A. et al. Take only what you need: Leveraging mandatory access control policy to reduce provenance storage costs. In Proceedings of the Conference on Theory and Practice of Provenance (2015), USENIX, 7–7.

    6. Bates, A.M. et al. Trustworthy whole-system provenance for the Linux kernel. In Proceedings of the USENIX Security Symposium (2015) 319–334.

    7. Buneman, P. et al. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory. Springer, 2001, 316–330.

    8. Carata, L. et al. A primer on provenance. Commun. ACM 57, 5 (May 2014), 52–60.

    9. Cheney, J. et al. Provenance in databases: Why, how, and where. Foundations and Trends in Databases 1, 4 (2009), 379–474.

    10. Crabtree, A. et al. Building accountability into the Internet of Things: The IoT databox model. Journal of Reliable Intelligent Environments (2018).

    11. Davidson, S. et al. Provenance views for module privacy. In Proceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 2011, 175–186.

    12. De Hert, P. et al. The right to data portability in the GDPR: Towards user-centric interoperability of digital services. Computer Law & Security Review. Elsevier, (2017).

    13. Eldefrawy, K. et al. SMART: Secure and minimal architecture for (establishing dynamic) root of trust. In Network and Distributed System Security Symposium 12 (2012), 1–15.

    14. Han, X. et al. FRAPpuccino: Fault-detection through Runtime Analysis of Provenance. In Proceedings of the Workshop on Hot Topics in Cloud Computing (HotCLoud'17). USENIX (2017).

    15. Hasan, R. et al. The case of the fake Picasso: Preventing history forgery with secure provenance. In Proceedings of the Conference on File and Storage Technologies (FAST'09), (2009), 1–14.

    16. Herschel, M. et al. A survey on provenance: What for? What form? What from? The VLDB Journal—The International Journal on Very Large Data Bases 26, 6 (2017), 881–906.

    17. Hossain, M.N. et al. Dependence-preserving data compaction for scalable forensic analysis. In Proceedings of the USENIX Security Symposium.

    18. King, S.T. and Chen, P.M. Backtracking intrusions. ACM SIGOPS Operating Systems Review 37, 5 (May 2003).

    19. Liang, X. et al. Provchain: A blockchain-based data provenance architecture in cloud environment with enhanced privacy and availability. In International Symposium on Cluster, Cloud and Grid Computing. IEEE/ACM, (2017), 468–477.

    20. Missier, P. et al. ProvAbs: Model, policy, and tooling for abstracting PROV graphs. In Proceedings of the International Provenance and Annotation Workshop. Springer, 2017, 3–15.

    21. Moyer, T. and Gadepally, V. High-throughput ingest of data provenance records into Accumulo. In Proceedings of the High Performance Extreme Computing Conference (HPEC), IEEE, 2016, 1–6.

    22. Pasquier, T. et al. Runtime analysis of whole system provenance. In Proceedings of the Conference on Computer and Communications Security (CCS'18). ACM, 2018.

    23. Pasquier, T. et al. If these data could talk. Scientific Data 4 (2017), http://www.nature.com/sdata2017114.

    24. Pasquier, T. et al. Data provenance to audit compliance with privacy policy in the Internet of Things. Personal and Ubiquitous Computing (2018), 333–344.

    25. Pohly, D.J. et al. Hi-Fi: Collecting high-fidelity whole-system provenance. In Proceedings of the 28th Annual Computer Security Applications Conference. ACM, 2012, 259–268.

    26. Schreiber, A. and Struminski, R. Tracing personal data using comics. In Proceedings of the International Conference on Universal Access in Human-Computer Interaction. Springer, 2017, 444–455.

    27. Singh, J. et al. Twenty security considerations for cloud-supported Internet of Things. IEEE Internet of Things Journal 3, 3 (Mar. 2016), 269–284.

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More