Since the launch of large-scale infrastructure-as-a-service (IaaS) offerings from Amazon (Amazon Web Services in 2006), Microsoft (Azure in 2008), and Google (Google Cloud Platform in 2008), there has been a continuous trend toward cloud computing, which allows customers to leverage capabilities and cost advantages through economies of scale. This was made possible through virtualization,3 where virtual machines (VMs) enable the efficient use of large, bare-metal compute architectures (hosts) by using a hypervisor to coordinate sharing between multiple tenants according to their expressed usage requirements. While VMs provide a way for users to obtain additional compute capacity quickly and maximize the use of existing hardware (and/or avoid the cost of maintaining peak capacity by using a public cloud), it is still necessary to configure, deploy, manage, and maintain them.
In recent years, container-based technologies, such as Docker and Kubernetes, have arisen to make it easier to manage software application lifecycles. They provide a lightweight solution for creating a set of machine configurations, called containers, which can be deployed onto hardware (virtualized or physical) as a group via an automated process. Container technology provides multiple separate user-space instances that are isolated from one another via kernel software. Unlike VMs, containers run directly on the host system (sharing its kernel) and as such, do not need to emulate devices or maintain large disk files. Further, container definitions adhering to the Open Container Initiative (OCI) Distribution Specification15 specify dependencies as layers that can be shared between different containers, making them amenable to caching and thus speeding up deployment while reducing storage costs for multiple containers.
The success of containerization technology for on-premises systems has led to major cloud providers developing their own containers-as-a-service (CaaS) offerings, which provide customers with the ability to maintain and deploy containers in the public cloud. In CaaS offerings, containers run in a per-group utility virtual machine (UVM), which provides hypervisor-level isolation between containers running from different tenants on the same host. While the container manager and container shim running on the host are responsible for pulling images from the container registry, bringing up the UVM, and orchestrating container execution, an agent running in the UVM (the guest agent) coordinates the container workflow as directed by the host-side container shim.
Cloud computing poses challenges for systems that require confidentiality,8 because the three groups involved have competing needs:
Cloud Service Provider (CSP)—The owner and operator of a public cloud.
Tenant—A CSP customer that uses the cloud to host applications.
User—A tenant customer or employee that uses the tenant’s cloud applications.
Although VMs and VM-isolated container groups provide strong isolation between tenants, they are deployed by the CSP and coordinated by the CSP’s hypervisor. As such, in existing cloud offerings, the host operating system (including the container manager and container shim) and the hypervisor lie in the tenant’s trusted computing base (TCB) and consist of all the hardware and software a tenant is required to trust. As a rule, the TCB should be as small as possible. Therefore, research into confidential computing17 proposed reducing the TCB size by leveraging hardware-enforced trusted execution environments (TEEs).2,4,5,7,9,11,13,20,21
TEEs isolate an application’s code and data from the rest of the system, including privileged components. They protect the integrity and confidentiality of user workloads, even if the host’s software is compromised or controlled by a malicious entity. TEEs available from major CPU vendors can be either process-based, such as Intel SGX,18 or VM-based, such as AMD SEV-SNP,10 Intel TDX,6 and ARM CCA.17 VM-based TEEs offer hardware-level isolation of the VM, preventing the host operating system and the hypervisor from having access to the VM’s memory and registers.
With CaaS, container execution is orchestrated by a host-side shim that communicates with the guest agent, which coordinates the activity of the container group in the UVM. The UVM can be hardware-isolated in a TEE enclave, but the container images are still controlled by the host, as are the order in which they are mounted, the container environment variables, the commands sent to the containers via the bridge between the container shim and the guest agent, and so forth. This means that a compromised host can overcome the hardware isolation of the VM by injecting malicious containers. This risk of attack—be it from malicious or compromised employees of the CSP, or external threats—limits the extent to which containerization can be used in the cloud for sensitive workloads in industries such as finance and healthcare.
The naïve solution to this problem is to run the guest agent and container shim in the same VM-based TEE. This removes the CSP from the tenant’s TCB, but it takes away the CSP’s ability to orchestrate and automate the container workflow. It also leaves the tenant in the user’s TCB, because the container images are controlled by the tenant, as is the order in which they are mounted, the container environment variables, and the commands sent to the containers. The user of the confidential container (for example, a bank customer or a patient providing data to a doctor) must then trust that the tenant has and will run only the expected commands. Finally, this leaves image integrity and data confidentiality/integrity unsolved.
To provide image and data integrity while keeping both the CSP and the tenant out of the user’s TCB, we built Parma, which powers confidential containers on Azure container instances. It implements the confidential containers abstraction on a state-of-the-art container stack running on processors with VM-based TEE support (that is, AMD SEV-SNP processors). Parma provides a lift-and-shift experience and the ability to run unmodified containers pulled from (unmodified) container registries, while providing the following strong security guarantees:
Container attestation and integrity. Only customer-specified containers can run in the TCB, and any means of container tampering is detected by the TCB and will either cause deployment of the container group to fail (if the execution policy is intact) or remote attestation to be invalid (because the attested policy is different).
User data confidentiality and integrity. Only the TCB has access to the user’s data, and any means of data tampering is detected by the TCB via evidence made available to the user.
Figure 1 shows how Parma uses execution policies to enforce user constraints and provide security to a container group. An execution policy is a component of a confidential VM that is attested at initialization time; it describes all the actions the tenant has explicitly allowed the guest agent (operated by the CSP) to take in the container group that runs on that VM. Figure 1(a) shows how Parma limits the user TCB to the container group itself, isolating it from the host operating system and hypervisor. Figure 1(b) shows an unsuccessful mount action (requested by the host and denied by the guest), in which a layer of a container image does not have a dm-verity
(device-manager’s “verity” target) root hash that matches a hash enumerated in the policy.
Parma provides strong protection from the CSP for the container’s root file system (consisting of the container image layers and writable scratch space) and the user’s data. For container image layers (pulled in plain text by the untrusted container manager and stored in a host-side block device), Parma mounts the device as an integrity-protected read-only file system and relies on the file-system driver to enforce integrity checking on an access to the file system. For confidentiality and integrity of privacy-sensitive data stored in a block device (for example, writable scratch space of the container’s root file system) or blob storage (for example, remote blobs holding user data), Parma relies on block-level encryption and integrity to decrypt memory-mapped blocks, guaranteeing that data appears in plain text only in the VM’s hardware-protected memory.
Finally, Parma provides container attestation rooted in security hardware by enforcing tenant-specified execution policies. The guest agent has been augmented to enforce the execution policy such that it executes only those commands (submitted by the untrusted container shim) that are explicitly allowed by the tenant, as seen in Figure 1. The policy is attested by encoding its measurement in the attestation report as an immutable field at UVM initialization.
As a result of including the execution policy, the hardware-issued attestation describes the initial state of the container group and all its allowed state transitions. The attestation can then be used downstream by users for operations that require it to establish security guarantees. For example, remote verifiers may release keys (governing the user’s encrypted data) to only those container groups that can present an attestation report encoding the expected execution policy and measurement of the UVM.
Architecture
This section describes the Parma architecture that implements this new confidential containers abstraction, which uses attested execution policies. First, the container platform, which forms the basis of the Parma design and implementation, is described, followed by a detailed description of the threat model. This section also presents the security guarantees under the threat model and describes how Parma provides these guarantees via a collection of design principles.
The guiding principle of Parma is to provide an inductive proof over the state of a container group, rooted in the attestation report produced by the platform security processor (PSP). The standard components and lifecycle for the container platform (CPLAT) are largely unchanged, except for the guest agent, whose actions become circumscribed by the execution policy. Thus constrained, the future state of the system can be rooted in the measurement performed during guest initialization.
Container platform. Figure 2 illustrates container flow. The sequence diagram at the top shows the process that results in a VM-isolated container. The pentagons correspond to the following steps: (i) Pull the image; (ii) launch a pod; (iii) create a container; (iv) start a container. The circles outline multiple points of attack in the workflow: The container shim may pass a compromised (1) UVM image or (2) guest agent during UVM creation. The container manager can (3) alter or fabricate malicious-layer virtual hard disks (VHDs) and/or (4) mount any combination of layers onto the UVM. The container shim can pass (5) any set of layers to use for creating a container file system, as well as (6) any combination of environment variables or commands. A compromised host operating system can (7) tamper with local storage, (8) attack the memory of the UVM, or (9) manipulate remote communications. This list is not comprehensive.
CPLAT is a group of components built around the capabilities of containerd
, a daemon that manages the complete container lifecycle as seen in Figure 2, from image pull and storage to container execution and supervision to low-level storage and network attachments. containerd
supports the OCI image and runtime specifications15 and provides the substrate for multiple container offerings such as Docker, Kubernetes, and various public cloud offerings. Clients interact with containerd
via a client interface. containerd
supports running bare-metal containers (that is, those that run directly on the host) as well as those that run in a UVM.
VM-isolated containers are the focus here. CPLAT interfaces with a custom container shim running in the host operating system. The container shim interacts with (i) host services to bring up a UVM required for launching a new pod, and (ii) the guest agent running in the UVM. The guest agent is responsible for creating containers and spawning a process for starting the container. In essence, the container shim to guest-agent path allows the CPLAT components running in the host operating system to execute containers in isolated guest VMs. The execution of a VM-isolated container on Linux using CPLAT involves four high-level steps (as seen in Figure 2), equivalent to the four commands in Figure 3.
Threat model. Parma is designed to address the case of a strong adversary that controls the entire host system, including the hypervisor and the host operating system, along with all services running in it. That said, we trust the CPU package, including the PSP and the AMD SEV-SNP implementation, which provides hardware-based isolation of the guest’s address space from the system software. We also trust the firmware running on the PSP and its measurements of the guest VM, including the guest agent.
Such an adversary can:
Tamper with the container’s OCI runtime specification.
Tamper with block devices storing the read-only container image layers and the writable scratch layer.
Tamper with container definitions, including the overlay file system (that is, changing the order or injecting rogue layers); add, alter, or remove environment variables; alter the tenant command; and the mount sources and destinations from the UVM.
Add, delete, and make arbitrary modifications to network messages (that is, fully control the network).
Request execution of arbitrary commands in the UVM and in individual containers.
Request debugging information from the UVM and running containers, such as access to I/O, the stack, or container properties.
These capabilities would allow an adversary to gain access to the address space of the guest operating system without the protection of Parma.
Security architecture. Under this threat model, the goal is to provide strong confidentiality and integrity guarantees for the container and customer data. We have used the following principles to design a security architecture that provides those guarantees.
Hardware-based isolation of the UVM. The memory address space and disks of the VM must be protected from the CSP and other VMs by hardware-level isolation. Parma relies on the SEV-SNP hardware guarantee that the memory address space of the UVM cannot be accessed by host-system software.
UVM measurement. The UVM, its operating system, and the guest agent are cryptographically measured during initialization by the TEE, and this measurement can be requested over a secure channel at any time by tenant containers. The AMD SEV-SNP hardware performs the measurement and encodes it in the signed attestation report.1
Verifiable execution policy. The tenant must be provided with a mechanism to verify that the active execution policy (see later in the section on Execution Policy) in a container group is what it is expected to be. The execution policy is defined and measured independently by the tenant and is then provided to the CaaS deployment system. The host measures the policy (for example, using SHA-512) and places this measurement in the immutable host data of the report.1 The policy itself is passed to the UVM by the container shim, where it is measured again by the tenant and the user to ensure that its measurement matches the one encoded as host data in the report.
Integrity-protected read-only file systems. Any block device or blob storage is mounted in the UVM as an integrity-protected file system. The file-system driver enforces integrity checking on access to the file system, ensuring that the host-system software cannot tamper with the data and container images. In Parma, a container file system is expressed as an ordered sequence of layers, where each layer is mounted as a separate device and then assembled into an overlay file system. First, Parma verifies as each layer is mounted that the dm-verity
root hash for the device matches a layer that is enumerated in the policy. Second, when the container shim requests the mounting of an overlay file system that assembles multiple-layer devices, Parma verifies that the specific ordering of layers is explicitly laid out in the execution policy for one or more containers.
Encrypted and integrity-protected read/write file systems. Any block device or blob storage that holds privacy-sensitive data is mounted as an encrypted file system. The file-system driver decrypts the memory-mapped block upon access to the file system. The decrypted block is stored in hardware-isolated memory space, ensuring that host-system software cannot access the plain-text data. The writable scratch space of the container is mounted with dm-crypt
and dm-integrity
, and this is enforced by the execution policy. The encryption key for the writable scratch space is ephemeral and is provisioned initially in hardware-protected memory; it is erased once the device is mounted.
Remote attestation. Remote verifiers (for example, tenants, users, external services, attestation services) need to verify an attestation report so they can establish trust in a secure communication channel with the container group running in the UVM. They also need to verify that the UVM has booted the expected operating system, the correct guest agent, and further that the guest agent is configured with the expected execution policy.
In Parma, the UVM (including privileged containers) can request an attestation report using the secure channel established between the PSP and the UVM.1 The requester generates an ephemeral token—for example, transport layer security (TLS) public-key pair or a sealing/wrapping public key—which is presented as a runtime claim in the report. The token’s cryptographic digest is encoded as report data. A remote verifier can then verify that:
The report has been signed by a genuine AMD processor using a key rooted to AMD’s root certificate authority.
The guest launch measurement and host data match the expected VM measurement and the digest of the expected execution policy.
The report data matches the hash digest of the runtime claim presented as additional evidence.
Once the verification is complete, the remote verifier that trusts the UVM (including the guest operating system, guest agent, and execution policy) then trusts that the UVM and the container group within it will not reveal the private keys from which the public tokens have been generated (for example, TLS private key, sealing/wrapping private key). The remote verifier can use the runtime claim accordingly.
For example:
A TLS public key can be used to establish a TLS connection with the attested container group. As such, the remote verifier can trust there is no replay or man-in-the-middle attack.
A sealing public key can be used to seal (via encryption) a request or response intended only for the attested containers.
A wrapping public key can be used by a key-management service to wrap and release encryption keys required by the VM’s container group for decrypting remote blob storage. As such, the remote verifier can hold that only trustworthy and attested container groups can unwrap the encryption keys. Figure 4 illustrates this process, a typical attestation workflow.
In Figure 4, a container (key in circle) attempts to obtain and decrypt the user’s data for use by other containers in the group. The key has been previously provisioned into a key-management service with a defined key-release policy. It follows these steps:
The container in the UVM requests that the PSP issue an attestation report, including an Rivest-Shamir-Adleman (RSA) public key as a runtime claim.
The report and additional attestation evidence are provided to the attestation service, which verifies that the report is valid and then provides an attestation token that represents platform, initialization, and runtime claims.
Finally, the attestation token is provided to the key-management service, which returns the user’s key to the container wrapped using the provided RSA public key only if the token’s claims satisfy the key-release policy statement.
Execution policy. As discussed earlier in this article, we do not trust the container shim, as it could be under the control of an attacker. This implies that any action that the container shim requests the guest agent undertake inside the UVM is suspect (see the Threat Model section for a list of malicious host actions). Even if the current state of the container group is valid, the host may choose to compromise it in the future, and thus the attestation report cannot be used as a gate for access to secure user data. The attestation report on its own simply records the UVM operating system, guest agent, and the container runtime versions in use. It makes no claims about the container group the host will subsequently orchestrate.
For example, the host can start the user container group in a manner expected by an attestation service until it acquires some desired secure information. It can then load a series of containers that open the container group to a remote code execution attack. The attestation report, obtained during initialization, cannot protect against this. Even updating it by providing additional runtime data to the PSP1 does not help, because the vulnerability is added by the host after the attestation report has been consumed by the external service.
The concept of an execution policy addresses this vulnerability. Written by the tenant, the execution policy describes which actions the guest agent is allowed to take throughout the lifecycle of the container group. The guest agent is altered to consult this before taking any of the actions in Table 1 and provides information to the policy that is used to make decisions. Each of these actions has a corresponding enforcement point in the execution policy, which will either allow or deny the action. In our implementation of Parma for Confidential ACI, the policy is defined using the Rego policy language.16 A sample enforcement point implemented in Rego can be seen in the code snippet.
Enforced Actions | Command Properties |
---|---|
Mount a device | Device hash, target path |
Unmount a device | Target path |
Mount overlay | ID, path list, target path |
Unmount overlay | Target path |
Create container | ID, command, environment, working directory, mounts |
Execute process (in container) | ID, command, environment, working directory, |
Execute process (in UVM) | ID, command, environment, working directory, |
Shutdown container | ID |
Signal process | ID, signal, command |
Mount host device | Target path |
Unmount host device | Target path |
Mount scratch | Target path, encryption flag |
Unmount scratch | Target path |
Get properties | – |
Dump stacks | – |
Logging (in the UVM) | – |
Logging (containers) | – |
The policy actions shown in Table 1 are those that must be enforced by the execution policy. Each action is provided with the listed properties of the command as part of the enforcement query. The list is specific to our implementation but given standardization around containerd
, it should be applicable to most scenarios. First, some actions pertain to the creation of containers. By ensuring that any device mounted by the guest has a dm-verity
root hash listed in the policy, and that they are combined into overlay file systems in layer orders that coincide with specific containers, we first establish that the container file systems are correct. The container can then be started, further ensuring the environment variables and start command comply with policy and that mounts from the UVM to the container are as expected (along with other container-specific properties). Other actions proceed in a similar manner, constraining the guest container shim’s control over the guest agent.
A novel feature of our implementation is the ability for a policy to manipulate its own metadata state (maintained by the guest agent). This provides an attested mechanism for the execution policy to build a representation of the state of the container group, which allows for more complex interactions. For example, in the rule shown in the sample, the enforcement point for mounting a device creates a metadata entry for the device, which will be used to prevent other devices from being mounted to the same target path.
The result of making this small change to the guest agent is that the state space of the container group is bounded by a state machine, in which transitions between states correspond to the actions described earlier. Each transition is executed atomically and comes with an enforcement point.
Security Invariant
The state machine starts as a system that is fully measured and attested, including the execution policy with all its enforcement points, with the root of trust being the PSP hardware (n = 1). The set of enforced commands in Table 1 are designed to govern all state transitions in the container group. Since this policy is part of the attestation, the security properties of that initial state are preserved with each transition.
This security comes with some caveats. First, Parma cannot protect the user against compromised containers that the tenant has measured and included in its policies. It protects the tenant against malicious actions on the part of the CSP and forces the tenant to take only those actions explicitly listed in the execution policy made available to the user for measurement. Parma cannot protect the tenant from itself. Second, the security provided is worthwhile only to the degree that the tenant and its users make use of the attestation report. That is, the user must verify that the attestation report received from Parma is bound to a UVM and an execution policy that uses the data in an acceptable manner.
Evaluation
Figure 5 displays latency histograms from the Nginx benchmark. The results in Table 2 are from redis, SPEC2017, and Triton benchmarks. The redis values are geometric means in units of thousands of requests per second—higher is better. The SPEC2017 values reported here are the base ratio from reportable benchmark runs—higher is better. The reported values for Nvidia Triton are inferences per second—higher is better. In all cases, introducing Parma has a negligible effect compared to SEV-SNP on its own.
Benchmark | Redis (1,000 Reqs/s) | Spec (Base Ratio) | Triton (Inferences/s) |
---|---|---|---|
Base | 68 | 8.43 | 16.42 |
SNP | 56 | 7.33 | 12.62 |
Parma | 55 | 7.30 | 12.58 |
Benchmarking tools were used to evaluate several typical containerized workloads for reductions in computation throughput, network throughput, and database transaction rates to ensure that Parma does not introduce significant computational overhead. In all cases, Parma provided confidentiality for containerized workloads with minimal costs to performance (typically less than 1% additional overhead vs. running in an enclave).
Experiments were performed using three configurations, shown in Table 2:
Base—A baseline benchmark container running outside the SEV-SNP enclave.
SEV-SNP—The same container running in the SEV-SNP enclave.
Parma—The same container again in the enclave and with an attested execution policy.
Each benchmarking experiment used two machines:
Dell PowerEdge R7515 with an AMD EPYC 7543P 32-core processor and 128 GB of memory for hosting the container runtime. This machine ran Windows Server 2022 Datacenter (22H2) and an offline version of the Azure Container Instances CPLAT (that is,
containerd
,cri
, andhcsshim
). The UVM ran a patched version of 5.15 Linux, which includes AMD and Microsoft patches to provide AMD SEV-SNP enlightenment.A benchmarking client (to avoid impact of any benchmarking software on the evaluation) with the same configuration, connected to the Dell PowerEdge on the same subnet via 10Gb Ethernet. This machine ran Ubuntu 20.04 with Linux kernel 5.15.
nginx. Web services are a common use case for containerization, so we benchmarked the popular nginx Web server using the wrk222 benchmarking tool. Each test ran for 60s on 10 threads, simulating 100 concurrent users making 200 requests per second (for a total of 12,000 requests per test). The tests were repeated 20 times for each of the three configurations and measured the latency. The results are shown in Figure 5. The histograms were composed of all latency samples gathered. There was an increase in latency by the introduction of SEV-SNP, as expected, as well as a minor increase in latency when the execution policy was added.
redis. The in-memory key/value database redis provides another useful benchmark for containerized computation. It supports a diverse set of data structures, such as hashtables, sets, and lists. The benchmarking used the provided redis-benchmark
with 25 parallel clients and a subset of tests, the results of which can be seen in Table 2. Looking at the geometric mean over all actions, we see a performance overhead of 18% added by operating in the AMD SEV-SNP enclave, and a further 1% when using Parma. The performance overhead can be attributed to increased TLB pressure arising from large working sets, which exhibit poor temporal reuse in TLBs and trigger page-table walks. In SEV-SNP-enabled systems, page-table walks incur additional metadata checks (in the reverse page table) to ensure that the page is indeed owned by the VM.
SPEC2017. Parma was also evaluated by measuring the computation performance overhead using the SPEC2017 intspeed
benchmarks.19 These benchmark programs were compiled and run on the bare-metal hardware. When containerized, they were provided with 32 cores and 32GB of memory. As can be seen in Table 2, AMD SEV-SNP adds a performance overhead of 13% on average, and Parma adds less than 1% on top of this.
Nvidia Triton Inference Server. Finally, Parma was evaluated by running a machine-learning (ML) inference workload based on Nvidia’s Triton Inference Server.14 Models (trained offline) and their parameters were used to serve requests via a REST API.
The confidential ML inference server was deployed in a container group comprising two containers: a Triton inference container (unmodified) and a sidecar container that mounted an encrypted remote blob (holding the ML model) using dm-crypt
and dm-integrity
. (The sidecar container also implemented the attestation workflow described in Figure 4 to release the encryption key.) The file system and the precomputed ML model were made available to the Triton inference container.
Nvidia’s perf-analyzer
system was used to evaluate the inference servers. This allowed measurement of the overhead introduced by SEV-SNP and Parma, as shown in Table 2. For each of the three configurations (Base, SNP, and Parma), four experiments were run with one to four concurrent clients making continuous inference requests. As before, the additional overhead from Parma is approximately 1%. The overheads share the same root cause in the increased TLB pressure as previously described in the section on redis.
Conclusion
The experiments presented here demonstrate that Parma, the architecture that drives confidential containers on Azure container instances, adds less than one percent additional performance overhead beyond that added by the underlying TEE (that is, AMD SEV-SNP). Importantly, Parma ensures a security invariant over all reachable states of the container group rooted in the attestation report. This allows external third parties to communicate securely (via remote attestation) with containers, enabling a wide range of containerized workflows that require confidential access to secure data. Companies obtain the advantages of running their most confidential workflows in the cloud without having to compromise on their security requirements. Tenants gain flexibility, efficiency, and reliability; CSPs get more business; and users can trust that their data is private, confidential, and secure.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment