Architecture and Hardware Contributed articles

New Approaches to Security and Availability For Cloud Data

Extending the data trust perimeter from the enterprise to the public cloud requires more than encryption.

Posted Feb 1 2013

Introduction
Key Insights
Integrity Checking with Iris
Auditing Framework
Enhancing Data Availability with HAIL
Conclusion
References
Authors
Figures

New Approaches to Security and Availability for Cloud Data, illustration

Cloud computing is a service model that offers users (herein called “tenants”) on-demand network access to a large shared pool of computing resources (the cloud). The financial benefits of cloud computing are widely recognized. Building and managing a large-scale data center results in savings of a factor between five and seven over one of medium size in terms of electricity, hardware, network-bandwidth, and operational costs.¹ From the tenant’s perspective, the ability to use and pay for resources on demand and the elasticity of the cloud are strong incentives for migration to the cloud.

Key Insights

Security concerns limit or even impede enterprise migration into public clouds.
Creating incentives for cloud adoption requires thinking beyond data encryption, which alone rarely provides confidentiality on data processed in the cloud or protects against tampering, corruption, or loss of availability.
Real-time auditing by tenants or third parties can create new security visibility in the cloud and strong assurance of correct cloud-service operation.

Despite these benefits, public clouds are still not widely adopted, especially by enterprises. Most large organizations today run private clouds, in the sense of virtualized and geographically distributed data centers, but rarely rely primarily on externally managed resources; notable exceptions include Twitter and The New York Times, which run on Amazon infrastructure.

Major barriers to adoption are the security and operational risks to which existing cloud infrastructures are prone, including hardware failure, software bugs, power outages, server misconfiguration, malware, and insider threats. Such failure and attack vectors are not new, but their risk is amplified by the large scale of the cloud.⁵ They can even be disastrous, including data loss and corruption, breaches of data confidentiality, and malicious tampering with data. Strong protections beyond encryption are therefore a necessity for data outsourced to the cloud, with two standing out as particularly important: integrity, or assurance against data tampering, and freshness, or the guarantee that retrieved data reflects the latest updates.

Another concern hindering migration into a public cloud is lack of availability and reliability guarantees. Well-known cloud providers have experienced temporary lack of availability lasting at least several hours^11,21 and striking loss of personal customer data, most notably the 2011 T-Mobile/Sidekick incident.²³ Traditional reliability models for hardware make certain assumptions about failure patterns (such as independence of failures among hard drives) that are not accurate in the world of cloud computing. Without novel data-reliability protections (beyond today’s RAID-5 and RAID-6), maintaining correctness of massive amounts of data over long periods of time is extremely difficult.⁴

Another top concern for enterprises migrating into the cloud is collocation with potentially malicious tenants.⁵ In an Infrastructure-as-a-Service (IaaS) model, tenants rent virtual machines (VMs) on servers they share with other tenants; logical isolation among VMs is enforced by hypervisors. In a Platform-as-a-Service (PaaS) model, different tenants may run applications in the same operating system, without clear isolation beyond basic OS-level protections (easily bypassed by advanced malware). Ristenpart et al.¹⁸ showed that an attacker can collocate a VM under its control on the same server as a targeted victim VM in the Amazon IaaS infrastructure; they also provided evidence that such an attacker can exploit side channels in shared hardware (such as the L2 cache) to exfiltrate sensitive data from the victim.

Our research addresses the challenge of migrating enterprise data into the public cloud while retaining tenant trust and visibility. We have devised cryptographic protocols that extend traditional trust perimeters from enterprise data centers into the public cloud by conferring strong protections on migrated data, including integrity, freshness, and high availability. In addition, we propose an auditing framework to verify properties of the internal operation of the cloud and assure enterprises that their cloud data—and workloads—are handled securely and reliably.

Adversarial model. Here, we are concerned mainly with cloud providers subject to a range of threats. These providers might deviate from our protocols for cost savings or due to poor security practices but not for pure malicious intent. In most cases, we are able to detect deviations from our protocols by misbehaving cloud providers but do not provide remediation mechanisms against fully malicious cloud providers. By using multiple cloud providers in the design of our High-Availability and Integrity Layer, or HAIL, protocol, described later, we show we can also provide data integrity and availability in a setting in which a fraction of providers can be fully compromised.

Solution overview. Figure 1 outlines our vision of a more-trustworthy cloud-computing model for enterprises. A small trusted gateway within the enterprise intermediates all communication, from internal data center to external public cloud. The gateway manages cryptographic keys (for encrypting data for confidentiality requirements), maintains trusted storage for integrity and freshness enforcement, and may add redundancy to data for enhanced availability. Once the data and workloads of a particular enterprise migrate to the cloud, an independent cloud-auditing service (run by the enterprise or, alternatively, a third party) monitors the enterprise’s cloud resources. This service regularly communicates bi-directionally with the gateway. Updates on enterprise data and workloads migrated to the cloud propagate from the enterprise to the auditing service, which communicates the results of its audits back to the enterprise, including, say, the health scores of various resources (such as data repositories and virtual machines).

Organization. Here, we describe several research projects that are components of this broad vision, starting with our design for an authenticated file system called Iris that allows migration of existing internal enterprise file systems into the cloud. Iris offers strong integrity and freshness guarantees of both file-system data and metadata accessed while users perform file-system operations. Iris minimizes the effects of network latency on file-system operations and is optimized for typical file-system workloads (sequential file accesses). We then introduce our auditing framework, including Proofs of Retrievability (PoR) and related protocols that cryptographically verify the correctness of all cloud-stored data with minimal communication. (Remarkably, even against a cheating cloud, PoRs show that every bit of data stored in the cloud is intact.) We describe a dynamic PoR architecture that supports data updates in Iris and audit of physical-layer storage properties. We also show how to verify that cloud data is replicated across multiple hard drives with our Reversible Addition-Fragmentation chain Transfer, or RAFT, protocol. For further data protection we address the challenge of data availability in the face of cloud-service failures, including potentially malicious ones. We also describe our HAIL protocol, which distributes data redundantly across multiple cloud providers. HAIL is a cloud extension of the RAID principle, building reliable storage systems from inexpensive, unreliable components. We conclude with yet more challenges in securing cloud data.

Integrity Checking with Iris

Tenants commonly assume encrypting their data before sending it to the cloud is sufficient for securing it. Encryption provides strong confidentiality against a prying or breached cloud provider. It does not, however, protect against corruption of data due to software bugs or configuration errors, which require enforcement of a different property—data integrity—to ensure that data retrieved by a tenant is authentic, or has not been modified or corrupted by an unauthorized party. On its own, data integrity is relatively easy to achieve through cryptography (typically through Message-Authentication Codes, or MACs, on data blocks). But one critical yet subtle related security property—freshness—of data is often overlooked. Freshness ensures the latest updates are always propagated to the cloud and prevents rollback attacks in which stale versions of the data are presented to tenants.

Data freshness ensures retrieved data always reflects the most recent updates while preventing rollback attacks. Achieving data freshness is essential for protecting against misconfiguration errors or rollbacks that are caused intentionally and is the main technical challenge in building the Iris system we describe in the following section:

Iris design goals. Iris is an authenticated file system that supports migration of an enterprise-class distributed file system into the cloud, efficiently, transparently, and in a scalable manner. It is authenticated in the sense that it enables an enterprise tenant to verify the integrity and freshness of retrieved data while performing file-system operations.

A key Iris design requirement is that it imposes on client applications no changes to file-system operations, including file read, write, update, and delete operations, as well as creation and removal of directories. That is, Iris does not require user machines to run modified applications. It also aims to achieve a slowdown in operation latency small enough to go unnoticed by users even when a large number of clients in the enterprise (on the order of hundreds and even thousands) issue file-system operations in parallel.

Iris architecture. An important challenge we faced when designing Iris is how to address the typically high network latency between an enterprise and the cloud. To reduce the effect of network latency on individual operation latency and the cost of network transfer to and from the cloud, Iris employs heavy caching on the enterprise side.

In Iris, a trusted gateway residing within the enterprise trust boundary, as in Figure 1, intermediates all communication from enterprise users to the cloud. The gateway caches data and meta-data blocks from the file system recently accessed by enterprise users. The gateway computes integrity checks, namely MACs, on data blocks. It also maintains integrity and freshness information for cached data consisting of parts of a tree-based authenticated data structure stored in the cloud.

The cloud maintains the distributed file system consisting of all enterprise files and directories. Iris is designed to use any existing back-end cloud-storage system transparently, without modification. The cloud also stores MACs for block-level integrity checks, as well as a tree-based cryptographic data structure needed to ensure the freshness of data blocks and the directory tree of the file system.

Integrity and freshness verification. To guarantee data integrity and freshness for an entire file system, Iris uses an authentication scheme consisting of two layers (see Figure 2). At the lower layer, it stores a MAC for each file block; file blocks are fixed-size file segments of typical size 4KB. This structure enables random access to file blocks and verification of individual file-block integrity without accessing full files. For freshness, MACs are not sufficient, so Iris associates a counter or version number with each file block incremented on every block update¹⁵ and included in the block MAC. Different versions of a block can be distinguished through different version numbers. But for freshness, block version numbers must also be authenticated.

The upper layer of the authentication scheme is a Merkle tree tailored to the file system directory tree. The leaves of the Merkle tree store block-version numbers in compressed form. The authentication of data is separated from the authentication of block-version numbers to enable optimizations in the data structure. Internal nodes of the tree contain hashes of their children, as in a standard Merkle tree. The root of the Merkle tree must be maintained at all times within the enterprise-trust boundary at the gateway.

The tenant can efficiently verify the integrity and freshness of a file data block by checking the block MAC and the freshness of the block-version number. The tenant verifies the block-version number by accessing the sibling nodes on the path from the leaf storing the version number up to the root of the tree, recomputing all hashes on the path to the root and checking that the root matches the value stored locally. With a similar mechanism, the tenant can additionally verify the correctness of file paths in the file system and, more generally, of any other file system metadata (such as file names, number of files in a directory, and file-creation time).

This Merkle-tree-based structure has two distinctive features compared to other authenticated file systems: Support for existing file-system operations (Iris maintains a balanced binary tree over the file system directory structure to efficiently support existing file system calls); and support for concurrent operations (the Merkle tree supports efficient updates from multiple clients operating on the file system in parallel). Iris also optimizes for sequential file-block accesses; sequences of identical version counters are compressed into a single leaf.

The Iris authentication mechanism is practical and scalable; in a prototype system, a Merkle tree cache of only 10MB increases the system throughput by a factor of 3, so, compared to no caching, the throughput is fairly constant for approximately 100 clients executing operations on the file system in parallel; the operation-latency overhead introduced by processing at the gateway, including the integrity-checking mechanism, is at most 15%. These numbers are reported from a user-level implementation of Iris evaluated on commonly used benchmarks, including IOZone, sequential file reads and writes, and archiving of an entire directory structure.²⁰

Auditing Framework

Tools like Iris enable tenants to deploy their own security protections for data migrated to the cloud. But tenant self-protection is effective only to a point; for instance, even with Iris in place, a tenant’s data is not safe against wholesale service-provider failure. Moreover, while Iris enables tenants to detect data loss resulting from a drive crash, it does not give them early warning of the probable precondition—a dangerous lack of provider storage redundancy.

A strong auditing framework is the cornerstone of tenant confidence in a service provider. Regulatory, reputational, and contractual assurances of provider safeguards are important, but ongoing technical assurances of solid security are irreplaceable.

When Alice (the tenant) stores data with Bob (the cloud), she wants to know that Bob has not let her data succumb to bit rot, storage-device failures, corruption by buggy software, or myriad other common threats to data integrity.

Our research aims to develop a tenant-provider auditing relationship in which a tenant or external auditing service acting on the tenant’s behalf, as in Figure 1, can continuously audit a provider to prove or disprove compliance with a given security policy. The provider responds to a challenge with a compact, real-time proof. Such auditing draws on the structure of a cryptographic challenge-response protocol; the tenant can rigorously verify the provider’s response, obtaining a technically strong guarantee of policy compliance.

The challenge-response-style protocols described here cover a range of security properties, from data correctness to availability in the face of provider failures. We give a high-level overview of how they work and the guarantees they offer; we also briefly discuss their place in a larger vision, in which trusted hardware complements our auditing framework.

For concreteness, we mimic the cryptographic literature, referring to our canonical tenant as Alice and cloud provider as Bob. In our description here, Alice also acts as the auditor verifying properties of cloud-stored data, but in our more general framework, from Figure 1, the auditing protocol could be executed by a separate entity—the auditing service.

Auditing data retrievability. When Alice (the tenant) stores data with Bob (the cloud), the most basic assurance she is likely to seek is that her data remains intact. She wants to know that Bob has not let her data succumb to bit rot, storage-device failure, corruption by buggy software, or myriad other common threats to data integrity. Because even a well-meaning Bob may be vulnerable to infection by malware, Alice also needs such assurance to be robust even if Bob cheats.

One strong cryptographic approach to assurance is the PoR,¹² a challenge-response protocol in which Bob proves to Alice that a given piece of data D stored in the cloud is intact and retrievable. While the Iris system enables verification of data integrity for data retrieved from the cloud in the course of performing regular file-system operations, a PoR enables verification of an entire data collection without first retrieving it from the cloud.

This goal might seem counterintuitive at first, even impossible. A PoR can demonstrate with a compact proof (on the order of, say, hundreds of bytes) that every single bit of data is intact and accessible to Alice, even if it is very large (gigabytes or more).

Building a PoR, step by step. Here, to give a sense of how a PoR works, we develop a construction in stages. An obvious candidate approach is for Alice to simply store a cryptographic hash c = h(D) of data D in the cloud. To verify that D is intact, she challenges Bob to send her c. However, this idea involves two problems: Bob can cheat, storing c and throwing away D, though a refinement incorporating a secret “nonce” into the hash would address it. Efficiency considerations are a more fundamental drawback; to generate c from D authentically, Bob must hash all of D, a resource-intensive process if D is big.

An alternative approach is for Alice to sample data blocks and verify their correctness, in effect spot-checking her data. Now let r_i denote the i^th data block; blocks are fixed-size segments of data with typical size 4KB. Before storing the data into the cloud, Alice locally retains a randomly selected block r_i. To challenge Bob, she sends him the block-index i and asks him to transmit r_i, which she verifies against her local copy. To amplify the probability of detecting data corruption, Alice can request multiple blocks with independent random indices i₁, i₂,…, i_m simultaneously.

Bob now has to touch only a small portion of the data to respond to a challenge, solving the resource-consumption problem with hashing. If D, or a large chunk of D, is missing or corrupted, Alice will detect this fact with high probability, as desired. However, the scheme still involves two drawbacks:

First, while Alice can detect large corruptions with high probability, she is unlikely to detect small corruptions of, say, limited bit rot, even with multiple challenge blocks. Suppose her data has one million blocks and she challenges Bob on 10 randomly selected blocks; the probability of Alice detecting a one-bit error would be less than 0.001%. And second, she must use fresh blocks for each challenge, so Bob cannot predict future challenges from past ones. So, if Alice plans to challenge Bob many times, she must store a considerable amount of data locally.

To solve the first, we can appeal to an error-correcting code, a technique for adding redundancy to some piece of data D (called “message”), yielding encoded data D* (called “code word”). D* is often constructed by appending what are called “parity blocks” to the end of D. If a limited portion of the encoded data D* is corrupted or erased, a decoding function can still be applied to restore the original data D. The expansion ratio (|D*| = |D|) and amount of tolerable corruption depend on the code parameters. For the sake of the example, though, consider an error-correcting code that expands data by 10%—|D*|=|D| = 1.1—and can successfully decode it provided that at most 10% of the blocks in D* are corrupted.

Now, if Alice stores D* instead of D, she is assured her data will be lost only if a significant fraction of her data is corrupted or erased. That is, for a single bit of D to be irretrievably corrupted, a large chunk of D* must be corrupted Use of error-correcting effectively amplifies the power of Alice’s challenges. Suppose, for instance, she uses a code that corrects up to 10% corruption. Now, issuing 10 challenges against a one-million-block data collection will result in detection of any irretrievably corrupted bits in D with probability over 65%—a vast improvement over 0.001%.

Solving the problem of excessive local storage is fairly easy. Using a locally stored secret key k, Alice can compute MACs—secret-key digital signatures—over data blocks r₁, r₂,…, r_n. She can safely have Bob store MACs c₁, c₂,…, c_n alongside D*. To verify the correctness of a block r₁, Alice uses k and c_i. So Alice needs to store only the key k. Figure 3 shows the data stored by Alice and Bob in this PoR construction.

Concerning other challenges in constructing a practical PoR, Alice can send, say, a key to a pseudorandom function during a challenge, based on which Bob can infer the position of all challenged blocks. Bob can also aggregate multiple response blocks into one, rendering transmission and verification more efficient. Additionally, making error-correcting practical for large data collections requires coding and cryptographic tricks, though the solution presented here provides much of the intuition behind a full-blown PoR construction.

A PoR can be used to verify the integrity and correctness of any type of data collection stored in the cloud, including file systems and key-value stores. Its salient property is it gives Alice strong guarantees about the correctness of an entire data collection using minimal computation and bandwidth for auditing. The auditing protocol could be performed by a third-party auditing service.

Variants. Several PoR variants are available for auditing data integrity: for example, a Proof of Data Possession (PDP)² enables public verification of data integrity by employing public-key cryptography. Compared with PoRs, PDP provides detection of only large amounts of data corruption, without a recovering mechanism. PDP and PoR protocols can be privately or publicly verifiable. In a privately verifiable protocol, auditing can be performed only by the party who knows the secret key used for encoding the data. In contrast, in a publicly verifiable protocol, auditing can be performed by any third-party service, at the cost of more computationally expensive encoding and auditing protocols. The term Proof of Storage (PoS)³ has evolved as a convenient catchall for PoRs and PDPs. Most PoS protocols can be combined with other cryptographic protections (such as encryption for data confidentiality) to achieve a suite of cloud-data protections.¹³

PoRs in Iris. A basic PoR, as described earlier, has a notable limitation: an inability to handle data updates gracefully. Changing a single data block in D requires propagation of changes across the parity blocks of D*. So a basic PoR is efficient only for checks on static data (such as archived data). The situation is somewhat better without error correction; researchers have proposed (asymptotically efficient) PDP systems that support data updates.⁸ But support for updates also rules out recovering from a small amount of data corruption.

It is natural, then, to ask whether we can have the best of both worlds—a PoR for dynamically changing data. Yes is the answer.

Check values (MACs or digital signatures) pose the first major challenge in supporting a dynamic PoR/PDP. Not only must they be updated in conjunction with data-block updates, but, when verifying them, Alice must be able to determine they are both correct and fresh. The Iris system is designed precisely to tackle the tricky problem of efficient freshness verification for file systems, making it ideal for building a dynamic PoR.

An even more formidable challenge is updating error-correcting blocks as data blocks change. That is, to protect against targeted corruption by Bob, the structure of the error-correcting code, and thus the pattern of parity block updates, must be hidden from him. Encryption is no help; basic encryption hides data, not access patterns.

The way out of this conundrum is a model shift inspired by the deployment objectives of Iris. While PoR designers generally aim to keep Alice’s storage to a minimum, Iris aims at enterprise-class cloud deployment. When Alice is a company, rather than an individual, substantial tenant-side resources are a reasonable expectation.

The key idea for dynamic PoR layered on Iris is to have Alice cache parity blocks locally, on the enterprise side, and periodically back them up to the cloud. This approach conceals individual parity-block updates from Bob, as well as the code structure. It has another advantage, too: Alice’s updates to parity blocks can be made locally. As a single data-block update results in multiple parity-block updates, the ability to make updates locally greatly reduces communication between Alice and Bob.

The result is an enhancement such that Iris not only verifies file integrity and freshness but can also check (efficiently) whether an entire file system is intact, down to the last bit. In addition, if corruptions to data are found (through auditing or through integrity verification using the Merkle tree structure described earlier), Iris can recover corrupted blocks from the additional redundancy provided by the erasure code. Iris provides strong guarantees of detection and remediation of data corruption, resulting in retrievability of an entire file system stored in the cloud. A great Iris benefit is that its parameters for the erasure code and the communication during an audit can be adjusted for a desired level of recoverability.

While construction in Iris provides the first practical solution to a dynamic PoR protocol, it relies on some amount of local storage maintained by the tenant; in our Iris instantiation the client maintains O( n) local storage for a file system with n data blocks. The problem of constructing a dynamic PoR protocol with constant storage at the client-side is the major remaining theoretical cryptographic research challenge in auditing data retrievability.

Auditing of drive-failure resilience. Detecting data loss via Proofs of Storage (PoS) is helpful, though prevention is best. One major cause of data loss is drive crashes. In a large data center with hundreds of thousands of drives, drive failure is the norm rather than the exception. With 2%3% annual failure rates published by manufacturers and even higher numbers observed in the field,¹⁹ a large data center is likely to experience thousands of drive failures each year.

Services (such as Amazon S3) claim to store files in triplicate. But how can such a claim be verified remotely? At first glance, it seems impossible. Suppose Alice wants to verify that Bob is storing three copies of her data. Downloading the three copies would not work; if Bob is cheating and storing only one copy of the data, he can simply transmit that one copy three times. There is, however, a very simple solution: Alice can encrypt each copy of her data under a separate key, yielding three distinct encryptions. Executing a PoS against each encrypted version ensures the existence of three distinct copies.

In many cases, though, Alice does not want to have to store her data in encrypted form. She may want Bob to be able to process her data for her or make it available to her friends. More important, the existence of three distinct copies does not per se ensure resilience to drive crashes. All three copies could be sitting on the same drive, after all. So how can Alice verify that there are three distinct copies of the data, each on a different drive?

A striking feature of the problem is that it is primarily physical, not logical. The objective is not to verify the encoding or mere existence of data but its disposition on a physical substrate.

RAFT⁷ is the solution, allowing Alice to verify that Bob has stored some piece of data D so it can survive up to t drive failures, for a desired parameter t. RAFT allows D to be dispersed using erasure coding, a more space-efficient technique than maintaining full file copies. RAFT operates specifically on data stored in rotational drives, exploiting their performance limitations as a bounding parameter. The more drives across which Bob has striped D, the faster he can respond to a challenge. In particular, RAFT makes use of bounds on the seek time of a rotational drive. Alice transforms her data D using an erasure code into encoded data D*. D* can be striped across c drives such that if any t fail, D can be recovered. She asks Bob to store D* across c drives.

To verify resilience to t drive failures, Alice challenges Bob to fetch a set of n randomly distributed blocks from D*. Suppose Bob stores D* on d drives. Each block fetch incurs a seek (assuming the random blocks are spread apart at a large distance). So on average, if a seek takes time μ, Bob’s total fetch time is μn/d. If d > c, then his response time will take μn(1/d−1/c) longer than expected, on average. By measuring Bob’s response time, then, Alice can determine whether he is indeed using c drives, as required (see Figure 4).

While we do not go into detail here, many complications arise in real-world settings. Variations in drive performance, as well as in network latency, between Alice and Bob must be measured carefully. Sensitive statistical analysis and structuring of challenges is required to accommodate these variations.

Hardware roots of trust. Another approach to assurance within the challenge-response framework we explore here is a hardware root of trust, as supported by, say, Trusted Platform Modules that permit a tenant to verify remotely, via a challenge-response protocol, a provider is executing a particular software stack.

But hardware roots of trust cannot directly enforce guarantees on storage integrity or reliability. Even if Bob’s servers are configured precisely, as specified by Alice, and even if Alice controls their operation, she obtains no direct guarantee as to their corresponding storage subsystem. For instance, the only way for Alice to determine a file F is intact in storage without full-file inspection is to perform a PoR. The same holds for properties verified by Iris, HAIL, and RAFT, none guaranteed solely through trustworthy execution environments.

Enhancing Data Availability with HAIL

We have described an auditing framework that offers tenants visibility into the operations of the cloud and verification of some properties of their cloud-side data. But what happens if the cloud provider fails to respond correctly to an audit due to data loss? A major impediment to cloud adoption by enterprises is temporary lack of availability by the provider or even permanent failure. This is a real threat, as illustrated by catastrophic provider failures resulting in massive customer data loss.²³

We designed HAIL⁶ specifically to address this challenge, predicated on the idea that it is wise to distribute data across multiple cloud providers for continuous availability. HAIL thus leverages multiple cloud providers to construct a reliable, cost-effective cloud-storage service from unreliable components. The idea is similar in flavor to RAID,¹⁶ which creates reliable storage arrays from unreliable hard drives. HAIL extends the idea into the cloud, the main differences being its support for a stronger adversarial model and a higher-level abstraction.

HAIL works by promptly detecting and recovering from data corruption. The tenant (or third-party service) periodically audits individual cloud providers toward this end. HAIL auditing is lightweight in terms of both bandwidth and computation. Using the redundancy embedded across different cloud providers, the tenant (or third party) remediates corruption detected in a subset of providers. HAIL is reactive, rather than proactive, meaning it remediates data only on detected corruption.

System model. In HAIL a tenant like Alice distributes her data with embedded redundancy to a set of n cloud providers: S₁, …, S_n. In our model, data generated by all enterprise users is transmitted to the gateway, as in Figure 1. The gateway performs some data encoding, described in the following paragraphs, optionally encrypts data, and distributes a data fragment to each cloud provider. HAIL operates through an adversarial model in which a strong mobile adversary can corrupt all cloud providers over time. But within a single epoch (a predetermined period of fixed length) the adversary can corrupt at most b out of n servers, for some b < n.

HAIL encoding. Figure 5 outlines the encoding of data in HAIL. To provide resilience against cloud-provider failure, the gateway splits the data into fixed-size blocks and encodes it with a new erasure code we call “dispersal code.” The figure includes a matrix representation of the data on the left, resulting in an encoded matrix on the right. Each row in the encoded matrix is a stripe or code word obtained by applying the dispersal code. Each row contains the original data blocks, as well as new parity blocks obtained with the dispersal code. Each matrix column is stored at a different cloud provider. The dispersal code guarantees the original data can be reconstructed, given up to b cloud provider failures (and n – b intact columns).

However, a single layer of encoding is not sufficient to guarantee data availability and integrity in HAIL’s strong adversarial model. Consider this attack: The adversary corrupts b new servers in each epoch, picks a particular row index i, and corrupts the corresponding block r_ij at server S_j. After n/b⌉ epochs, the adversary corrupts all servers and the entire row i in the encoded matrix from Figure 5. In this case, the redundancy of the dispersal code is not helpful in recovering the corrupted row, and the entire data D is permanently corrupted.

HAIL is an extension of RAID into the cloud, distributing data across multiple cloud providers to achieve continuous availability.

How can the system prevent such a creeping-corruption attack? By simply auditing a few randomly selected blocks at each server, the probability that the tenant would discover the corruption of blocks in a single row of the encoded matrix is quite low. Therefore, another encoding layer we call a “server code” is needed within each server. The server code adds additional redundancy (parity blocks) to each column in the encoded matrix representation. The role of the server code is to recover from a small amount of corruption at each cloud provider that would otherwise be undetectable through the auditing protocol.

To prevent adversarial data corruption, the tenant also needs to store MACs on data blocks in the cloud. With a new technique we call an “integrity-protected dispersal code,” parity blocks of the dispersal code can themselves be used as MACs on the rows, thus reducing the storage overhead for integrity protection.

Auditing and recovering from failures. In HAIL, the gateway (or an external auditing service, as in Figure 1) periodically audits the correctness of the cloud data by contacting all cloud providers. The gateway sends a random row index i as a challenge to each cloud provider, and verifies, upon receiving the responses r_ij, for j in {1, …, n} the correctness of the entire row. It is also possible to aggregate multiple responses (multiple randomly selected blocks) from each server to reduce bandwidth and amplify the probability of failure detection.

When data corruption at one or more cloud providers is detected, the corrupted data can be reconstructed at the tenant side using the two encoding layers—the dispersal and server code. Data reconstruction is an expensive protocol, one rarely invoked (only upon detection of data corruption).

With its encoding, auditing, and recovery mechanisms, HAIL provides resilience against a strong mobile adversary that might potentially corrupt all providers over time. However, a limitation of HAIL is that, as designed, it does not handle file updates gracefully. Rather, it is most suited to archival data, or data stored in the cloud for retention purposes and not regularly modified. More-efficient versions of HAIL can be constructed under a weaker adversarial model that may be practical for short-term data storage.

Conclusion

We have described new techniques that secure cloud data by ensuring a range of protections, from integrity and freshness verification to high data availability. We also proposed an auditing framework that offers tenants visibility into the correct operation of the cloud. These techniques enable an extension of the trust perimeter from enterprise internal data centers into public clouds, as in Figure 1. Our hope is these techniques will alleviate some of the concern over security in the cloud and facilitate migration of enterprise resources into public clouds. We conclude here by mentioning several open problems of interest in this context:

Performing computations over tenants’ encrypted data. We have emphasized data integrity and availability, but data confidentiality is a major open problem. General computation over a tenant’s encrypted data is possible using a technique we call “fully homomorphic encryption,” though this breakthrough¹⁰ is not yet practical. Weaker, custom techniques can achieve specific functionalities (such as searches¹⁴ and general SQL queries¹⁷) over encrypted data.

The impossibility of general computations over multiple tenants’ encrypted data using only cryptographic techniques and no interaction among tenants was shown in van Dijk and Juels.²² A promising area of research is the design of custom protocols for applications involving multiple tenants’ data (such as data mining over multiple institutions’ medical records or financial transactions). Combining secure hardware architectures with cryptography (such as secure multiparty computation protocols) offers significant potential.

Ensuring tenant isolation. Cloud co-tenancy with attackers can jeopardize tenant data, as shown in Ristenpart et al.,¹⁸ which explored cache-based side channels for data exfiltration. The risks of co-tenancy in storage systems (such as storage-system side channels) is an unexplored vector of attack now deserving investigation, in our view.

One approach to co-tenancy risks is to isolate tenants by implementing virtual private cloud (VPC) abstractions within a public cloud. HomeAlone²⁴ enables tenants to verify remotely that a VPC is strongly enforced at the host level, in the sense of creating physical isolation of a tenant’s workloads. While physical isolation offers a solution for extremely sensitive workloads, it can undercut the financial benefits of tenant sharing of computing resources. For this reason, solutions offering tenant isolation and enabling the sharing of cloud resources at the same time are extremely important. Trusted hardware may also play an important role, along with tight enforcement of logical isolation abstractions throughout the software stack, including hypervisor and OS, and across the cloud fabric.

Geolocation of data. Of particular commercial interest is the open problem of remote verification of the geographical location of cloud data. The motivation is regulatory compliance, with many laws requiring providers keep customer data within, say, national boundaries.⁹ Given the challenge of ensuring that data is not duplicated, any solution probably requires a trusted data-management system via, say, trusted hardware, and localizing the pieces of the system is an interesting challenge. Geolocation of trusted hardware via remote timing from trusted anchor points seems a key avenue of exploration.

Figures

Figure 1. Extending trust perimeter from enterprise data center to the public cloud.

Figure 2. Authentication data structure in Iris.

Figure 3. Data stored by Alice and Bob in a PoR. To issue a challenge to Bob, Alice indicates positions i₁,…, i_m to Bob. Bob returns data blocks r_i1…, r_im, along with MACs c_i1,…, c_im. Alice verifies the correctness of r_ij against MAC c_ij, for all j {1,…, m} using secret key k.

Figure 4. Responding to challenges from one disk (on the left) and two disks (on the right) in the RAFT protocol.

Figure 5. Encoding of data D: on the left, original data is represented as a matrix; on the right, encoded data with parity blocks is added for both server and dispersal codes.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

New Approaches to Security and Availability For Cloud Data

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/2408776.2408793

February 2013 Issue

Published: February 1, 2013

Vol. 56 No. 2

Pages: 64-73

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM May 9 2024

Vendor Software Solutions in a Cloudy World

Doug Meil

Software Engineering and Programming Languages

Credit: Shutterstock datacenter in the clouds, cloud-computing concept, illustration

News May 7 2024

Whither BCIs?

Sandrine Ceurstemont

Computing Applications

Credit: Getty Images human head with color motion trails, illustration

BLOG@CACM May 3 2024

Pioneering Sustainable IT with Green Computing

Alex Williams

Architecture and Hardware

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Key Insights

Integrity Checking with Iris

Auditing Framework

Enhancing Data Availability with HAIL

Conclusion

Figures

New Approaches to Security and Availability For Cloud Data

DOI

February 2013 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.