In the beginning, there was PCIe. Well, really there was PCI and PCI-X, which were superseded by PCIe in 2003, and many others, such as ISA and VME, before them, but PCIe is a superset of them all. They are all interconnects, allowing a host (for example, the main system CPU) to configure and manipulate connected peripheral devices and map their memory into a shared address space.
In time, computations became bigger and more complicated, and peripheral devices became whole systems unto themselves. Graphics processing units (GPUs) are the best example, going from hardwired graphics offload devices to full-blown general-purpose processors that cooperate and communicate with the host to solve problems.
Cooperative processing between the host and device is complicated by PCIe's lack of coherent memory sharing. When CPU cores share memory, they use a cache coherency protocol to ensure they can have a fast local copy (a cache) while keeping a coherent view of memory—even when other cores write to it. PCIe does not support this kind of sharing; it only allows block transfers between host and device. Various companies created successor protocols—CCIX, OpenCAPI, and Gen-Z—to support this, but they have all expired or been subsumed by Intel's Compute Express Link (CXL).
CXL provides new protocols on top of PCIe for accelerator devices to cache host memory (CXL.cache) and for hosts to cache device memory (CXL.mem). The industry is currently focused on CXL.mem memory expansion devices. The first CXL-compatible CPUs (released in November 2022) support "CXL 1.1+ for memory expansion,"1 and CXL accelerators have not been announced—only CXL.mem devices, such as Samsung's 512GB RAM expansion.19 CXL 3.0, released in August 2022, adds support for fabric topologies connecting many hosts to many shared Global Fabric Attached Memory (GFAM) devices. This facilitates disaggregated memory, where an arbitrary number of endpoints connected in an arbitrary topology can request, use, and coherently share arbitrary amounts of memory.
If disaggregated memory is the future, our biggest question is that of protection. With so many end-points all connecting to and sharing the same memory, how can they be restricted to accessing only the memory they need? They may be running untrusted software or untrusted hardware. How can memory protection work in this threat environment? The Capability Hardware Enhanced RISC Instructions (CHERI) project has shown that architectural capabilities can provide flexible, fine-grained memory protection.21 How does CXL's current memory protection compare? Could a capability system work in CXL's distributed setting with malicious actors? To start, let's examine CXL's protection mechanisms and see how well they handle real-world security problems.
In most cases, software uses physical resources through multiple layers of abstraction. Each layer translates incoming requests to a format expected by the next layer down, and can also provide protection. A simple example is the memory management unit (MMU), which translates memory requests from virtual to physical memory.13 The OS gives each process a different mapping of virtual to physical addresses, and the MMU ensures processes can access only the physical memory that the OS has mapped in. To generalize, protection systems ensure actors can only access valid resources.
The protection a system can provide is limited by the granularity of its actors and resources; therefore, protection at multiple layers of abstraction is important. For example, the MMU only has insight at the process level. The software inside the process has tighter definitions of valid (for example, "I will not access out-of-bounds array elements") that the MMU doesn't understand (it doesn't know or care where the array is) and thus cannot help with. Instead, another layer can be added above the MMU, such as a language runtime (JVM, .NET) or hardware-based checks (CHERI21), which have more information and ensure validity at a finer-grained level.
Different levels of abstraction can add different sets of actors and resources. For example, an operating system is responsible for ensuring its processes access files correctly—and for actually performing those accesses through the file system driver. If those files are on a networked file system, the server may have to handle multiple clients at once and check that they access files correctly. The individual OS does not know about the other clients, and the server does not know about the processes running inside the OS, so having protections and checks at both levels is necessary.
CXL and the flaws therein. CXL, like PCIe, uses a host-device model. Each CXL host controls a set of connected peripheral devices, and maps all the memory they expose into an host physical address space (HPA). The host may also map its own memory into the HPA, and accelerator devices like GPUs can access it over CXL.cache, but current devices just expose RAM to the host over CXL.mem. CXL 3.0 upgraded CXL.mem to allow hosts to share memory regions through both multi-headed and GFAM devices.
Multi-headed CXL.mem devices connect to multiple hosts and can map the same regions of physical memory into all of their HPAs at the same time. Those hosts can all cache parts of those regions, and the device is responsible for ensuring coherency (see Figure 1). For example, if host 1 tries to write to a cache line in region A, the device realizes that hosts 2 and 3 share A and tells them to invalidate that cache line. Unfortunately, each of those hosts can only access 16 regions8 (Sec 2.5), so they will necessarily be large—on the order of gigabytes or hundreds of megabytes.
GFAM devices take this a step further by not being attached to specific hosts. Any host can map GFAM memory into its HPA, and any endpoint (host or device) in that HPA can talk to the GFAM directly and access that memory. The GFAM is configured with separate translation tables for each endpoint, so each endpoint can access eight regions of physical memory8 (Sec 22.214.171.124). These regions may overlap, allowing memory sharing, or they may be isolated. As shown in Figure 2, 10GiB of GFAM is mapped, but the host and accelerator are configured so they see only 6GiB each, with a 2GiB shared region. Again, because each endpoint has few ranges, they will be large. Memory groups8 (Sec 126.96.36.199) can punch holes in these ranges and hide specific blocks, bast 64MB8 (Table 7-67, min. block size).
Both kinds of memory provide protection through nonexhaustive translation: Endpoints request addresses in their HPA, which get translated to local device addresses, and that translation may fail (that is, the endpoint may not have memory mapped at that address). These mechanisms, similar to an MMU, provide inflexible coarse-grained protection. At most, each endpoint can access 16 memory ranges per device. The only way to change the mappings and transfer access rights is to convince the Fabric Manager, which has no defined interface for this8 (Sec. 7.6.1).
CXL 3.0 also introduced Unordered I/O requests, which allow accelerators to access other devices' memory, but there is no standardized way to protect those accesses. It may be possible to prevent specific devices from interacting at all (for example, through PCIe Access Control Services) or add MMU-like protection (for example, through PCIe Address Translation Services) but these, like CXL's other protection models, are inflexible and coarse-grained.
CXL's protection isn't great. End-points can be configured to access and share large memory regions, but cannot share many small ones. Endpoints can't grant each other access to memory; they have to go through an intermediary. Device-to-device access has to rely on vendor-defined protection, if any. How does that stack up against real-world threats?
Threats in the datacenter. First, we can understand the datacenter threat model from a whitepaper published in November 2022 by Amazon Web Services (AWS) about their Nitro platform.2 Cloud systems have to run workloads from many clients, who don't trust each other, on the same hardware. Before Nitro, AWS would run all client workloads as virtual machines (VMs) atop a hypervisor, which exposed isolated virtualized resources to each VM. For example, the hypervisor would implement a software model of a network card for each VM, so it could control which networks the VMs could access. The key impact of the Nitro system is moving this virtualization out of the hypervisor and into the hardware.
Each Nitro system is controlled by a custom Nitro Controller PCIe card. This is the hardware root of trust, responsible for configuring the System Main Board (that is, the CPU, motherboard, and RAM) and other peripherals before running client workloads. Networks and storage are accessed through other AWS-designed Nitro PCIe cards, which the Nitro Controller can split into Virtual Functions using PCIe single-root I/O virtualization (SR-IOV)16 to provide isolated resources for each VM.
When running many VMs, a minimal hypervisor is still necessary to configure the MMU and link each VM to its dedicated virtual functions. A Nitro system can also run bare-metal (a single client workload without a hypervisor). Even though the client workload is untrusted, the Nitro cards still virtualize access to networks and storage.
AWS trusts the Nitro controller to bring up the system, the Nitro cards to virtualize networks/storage, and the hypervisor/MMU to enforce isolation between VMs. Client workloads cannot be trusted, and if they're running bare-metal, then any communication from the System Main Board cannot be trusted either. From CXL's perspective, this means a host could be malicious (running bare-metal) or be responsible for many malicious workloads (running VMs). In the latter case, CXL does not have any constructs that can help with virtualization. In fact, CXL does not consider virtualization at all—literally, virtualization and similar terms are not in the specification.
Datacenters have further complications. Accelerator devices, such as GPUs, sometimes rely on directly sharing memory for high performance. Nvidia's Magnum I/O APIs18 allow GPUs to directly access NVMe storage devices (GPUDirect Storage), share memory with other GPUs (NVSHMEM), and expose their memory to other peripherals (GPUDirect RDMA), including InfiniBand adapters (nvidia-peermem).
While some GPUs nominally support virtualization through SR-IOV, AWS does not take advantage of this—client workloads are given whole numbers of GPUs and control them directly (clients even control the GPU drivers3). This expands the threat model. Not only are GFAMs sharing memory across HPAs, but also individual devices (including accelerators) may expose their memory to endpoints controlled by malicious clients.
CXL does not handle this use case. It implicitly assumes that hosts and devices are trustworthy. Hosts may be trustworthy if they have, for example, a hypervisor keeping them in check, and devices may be carefully chosen for trust, but if any device or host is untrustworthy (for example, running bare-metal client workloads) better protection is needed.
Threats in the consumer space. The threat of malicious devices is not exclusive to the datacenter—in fact, it's much worse for consumers! Desktops and laptops have a plethora of external ports for connecting arbitrary hardware, including high-performance accelerators, such as external GPUs. Accelerators take advantage of high-speed Thunderbolt connections that wrap PCIe, giving external hardware access to the internal PCIe memory map. Attacks on PCIe-based systems through Thunderbolt have already been demonstrated,15 showing that malicious hardware can access sensitive memory intended for other devices, even with protections such as IOMMU enabled.
Worse, direct device-to-device memory accesses are making their way to consumer systems as well. Modern game consoles depend on high-speed transfers from storage to GPU-accessible memory, and Microsoft's DirectStorage API brings this closer to reality on PCs. While at the time of writing, it still copied data through a buffer in system RAM, it seems inevitable that high-performance rendering systems (for example, games and video editing) will eventually take advantage of direct access—especially because it's already possible in the datacenter.
CXL is coming to the consumer market, so it needs to handle this. In an AMD "Meet the Experts" webinar4 in October 2022, an AMD representative said it might come to consumer devices within five years, initially with a focus on connecting persistent storage and RAM. Loading from persistent storage is currently the big use case for device-to-device transfer, so CXL needs to consider malicious devices sooner rather than later.
Today, CXL's memory protection is inflexible at best. It is capable of isolating endpoints in large memory regions, but not much more than that. It has no capacity for virtualization for workloads running on the same end-point, and cannot protect devices from each other.
CHERI21 is a capability-based protection system that has proven useful both for flexible, fine-grained (tens of bytes) memory protection and for compartmentalization, by sandboxing programs and libraries from each other.22 This seems to address all of CXL's security issues. Could CXL adopt a capability-based system?
Capabilities are unforgeable tokens that encode the authority to access a resource. Given a capability, an actor can access the resource, derive new capabilities for that resource with reduced permissions, transfer them to other actors, and potentially revoke them if those actors no longer need access. Because access rights are encoded directly in the token, capabilities are very flexible: It's easy to derive new capabilities with extremely specific access rights for new situations. Deriving lots of capabilities does have a downside: Revoking a capability—recursively deleting all derivations—can be more difficult. Let's examine a few examples.
Central-trust systems. Capabilities must be unforgeable. When a capability is used, the system needs some way to verify it has not been forged. The simplest way to enforce this is to store all capabilities and perform all capability modifications in a centralized trusted base, or a central-trust system.
One example is FreeBSD Capsicum,20 which protects files from processes by replacing Unix file descriptors with capabilities. A process can open the files it needs, limit its access with more granular permissions, and then enter capability mode to sandbox itself with those files. Like file descriptors, capabilities are stored in tables in OS memory. Userspace programs have to use syscalls to ask the OS to manipulate them, instead of creating or modifying them directly. The OS trusts itself to correctly modify capabilities (for example, never adding permissions, only taking them away) so capabilities cannot be forged. Although Capsicum doesn't perform revocation, in principle it would simply require searching the tables or even tracking parentage in capability metadata. This provides better security than plain Unix, but syscalls and context-switching to the OS can be slow.
CHERI takes a different approach. Instead of implementing the trusted base in software, CHERI implements it in the hardware and adds machine instructions for fast capability manipulation. CHERI replaces pointers with capabilities—fat pointers that include the range of addresses the pointer may point to. This range ensures pointers stay within their original provenances6,12 and can be limited further (for example, you can allocate an array, derive a capability for one element, and pass that to a function without exposing the rest of the array).
Registers and memory use tag bits to mark valid capabilities, and the hardware controls the tag bits to prevent forgery. Because all pointers have this metadata, including code pointers, even the smallest software components (for example, individual functions) can be sandboxed with just the memory they need. Larger libraries, even ones compiled without CHERI support, can also be sandboxed using compartments.22 The cost of storing capabilities anywhere is that revocation must search everywhere,23 although the overheads are lower than you might expect.10 CHERI ensures logical software components can access only the virtual memory ranges to which they have been explicitly given permission.
How could this help CXL? Eagle-eyed readers might notice GFAM already uses a system similar to Capsicum—each endpoint (that is, actor) has up to eight translation table entries (that is, capabilities) that grant access to memory. This demonstrates the flaws with a centralized system in this context: The number of capabilities (and implicitly their granularity) can be limited by hardware resources. This is more suitable for protecting host memory from a limited number of devices, for example,14 but GFAM tries to track all capabilities granted to thousands of actors. To alleviate this, one could store the capabilities in the memory exposed over CXL.mem or give each endpoint some dedicated capability memory, such as Capsicum's tables. Both cases would require trust to be distributed among the endpoints.
The threat of malicious devices is not exclusive to the datacenter—in fact, it's much worse for consumers!
Distributed-trust systems. Barrelfish5,17 and SemperOS11 are distributed operating systems, implemented as separate instances running on separate cores and communicating with message passing. Barrelfish uses capabilities to protect OS resources, such as message passing and threading primitives, physical memory ranges, etc. SemperOS uses capabilities for an in-memory file system.
The trusted base for capability operations is distributed across the OS cores but aims to provide identical semantics to central-trust systems. Most importantly, any core can derive from a capability in any other core, and thus revocation may need to touch all cores. This requires all actors to trust each other. It is more complicated to reason about than central-trust systems, but it scales better—particularly if cross-actor operations are uncommon.
For CXL, this may be suitable if all endpoints are trustworthy. If, for example, all endpoints in a datacenter use CHERI-like hardware to manipulate capabilities, this could work. At scale, however, revocation may become a bigger issue, and CXL cannot rely solely on this model anyway—the threat of malicious endpoints is too great.
Decentralized systems. Even if calling out to a centralized trusted base to manipulate capabilities is impossible or impractical, and the actors cannot be trusted to manipulate capabilities correctly, there is still hope. Decentralized capabilities, such as macaroons,7 can be passed to untrusted actors and have their validity checked when those actors try to use them.
Macaroons provide access to a resource that is reduced through an append-only list of caveats. A macaroon begins with an identifier, such as "access transaction details," and a signature, made by hashing the identifier with a secret key. When a caveat (such as "for Alice's account," or "until 5PM EST") is added, that caveat is hashed with the current signature to make a new signature. The old signature is thrown away and cannot be reconstructed—the hash cannot be undone. Given a macaroon with a set of caveats, it's impossible to remove a caveat and recalculate the correct signature without the secret key. Therefore, it's impossible for a hostile user to forge a macaroon with fewer caveats (that is, more permissions).
Decentralized capabilities have not yet been integrated into low-level software or hardware. Macaroons were originally designed for the Web, so they have a text-based wire format and third-party authentication features, which a binary-based interconnect doesn't need. This is fine for the network layer (for example, Michael Dodson combined macaroons with CHERI for finegrained memory-mapped I/O access over an insecure network9), but domain-optimized representations would be more space-efficient.
Revocation is also interesting. Capabilities could come with timeout caveats and require refreshing, or groups of capabilities (and all their derivations) could be revoked by throwing away their secret key. This would allow CXL end-points to store and (attempt to) manipulate their capabilities themselves, and let the CXL.mem device revoke them, all without trusting them. Decentralized capabilities are robust to hostile actors, do not require centralized resources, and are ripe for further investigation.
Physical memory is accessed through many layers of abstraction. Applying protection at different layers, which are aware of different actors and use resources at different granularities, is essential. CHERI and MMUs offer great protection at the software and process level, but CXL's protection model has issues. It allows memory sharing, but only of a few large ranges instead of many small ones. It doesn't give actors a way to share new memory ranges with each other, instead relying on a central, underspecified Fabric Manager. Capabilities are inherently flexible—they can protect large and small memory ranges, and can be transferred directly between actors without a centralized authority—so they should be able to address these problems.
CXL initially targets the datacenter, with many endpoints sharing disaggregated memory. The protection is coarse-grained, and does not consider virtualization. VMs running on the same host have to rely on similarly coarse hypervisor- and MMU-based isolation. Fine-grained capabilities could allow individual VMs to share small memory regions directly. Capabilities for large memory regions could also enforce VM compartmentalization at the CXL layer, similarly to CHERI.
In datacenter and consumer systems, device-to-device memory sharing is becoming essential for high performance. CXL does not try to protect devices from each other at all, which is especially scary considering how powerful malicious devices already can be. Capabilities would provide a consistent interface for securely exposing regions of device memory. Decentralized capabilities are robust against malicious actors and could keep the peace in the Wild West of untrustworthy hardware. In a datacenter with trusted components, distributed-trust systems could even forgo the cryptography associated with decentralized capabilities for lower overheads.
Decentralized and distributed capabilities have a lot of potential, but they have not been used in this context yet and need further investigation. Even so, they could greatly benefit CXL, which is a new interconnect standard that provides the opportunity to build in better security from the start instead of retrofitting it. A domain-optimized decentralized capability system could work wonders, giving CXL fine-grained memory sharing and improving virtualization and device-to-device security. Interconnects must take security more seriously, and we believe capabilities can provide flexible and robust security for CXL and beyond.
We thank the CHERI project team led by Robert Watson for demonstrating the potential of capabilities for memory security, without which this work could not exist. The CHERI team also provided essential feedback while developing this article, for which we are extremely thankful. This work was supported by the University of Cambridge Harding Distinguished Postgraduate Scholars Programme, and by EPSRC grant EP/V000381/1 (CAPcelerate).
1. Advanced Micro Devices, Inc. Offering unmatched performance, leadership energy efficiency, and next-generation architecture, AMD brings 4th gen AMD EPYC processors to the modern data center. (2022); https://bit.ly/3KQU6OU.
2. Amazon Web Services The security design of the AWS Nitro System. AWS whitepaper. (2023); https://go.aws/3P6UOdm
3. Amazon Web Services; https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html.
4. AMD Meet the Experts Webinars. How AM5, DDR5 memory, and PCIe 5.0 support pave the way for next-gen gaming experiences; https://bit.ly/3QKm42J.
5. Baumann, A. et al. The Multikernel: A new OS architecture for scalable multicore systems. In Proceedings of the ACM SIGOPS 22nd Symp. Operating Systems Principles. (2009), 29–44; 10.1145/1629575.1629579.
6. Beingessner, A. Rust's unsafe pointer types need an overhaul. Faultlore. (2022); https://faultlore.com/blah/fix-rust-pointers/
7. Birgisson, A. et al. Macaroons: cookies with contextual caveats for decentralized authorization in the cloud. In Proceeding of the 2014 Network and Distributed System Security Symp.; https://bit.ly/45EKxe4.
8. CXL Consortium Compute Express Link (CXL) Specification, revision 3.0, version 1.0. (2022); https://www.computeexpresslink.org/download-the-specification.
10. Filardo, N.W. Cornucopia: Temporal safety for CHERI heaps. In IEEE Symp. on Security and Privacy. (2020), 608–625. https://ieeexplore.ieee.org/document/9152640
11. Hille, M. et al. SemperOS: A distributed capability system. In Proceedings of the 2019 Usenix Annual Technical Conf. 709–722; https://www.usenix.org/conference/atc19/presentation/hille.
12. Jung, R. Pointers are complicated, or: What's in a byte? Ralf's ramblings. (2018); https://www.ralfj.de/blog/2013/07/24/pointers-and-bytes.html
13. Kernel Development Community. Virtual memory primer; https://bit.ly/3YSr5Z1.
14. Markettos, A.T. et al. Position paper: Defending direct memory access with CHERI capabilities. In Hardware and Architectural Support for Security and Privacy 7, (2020), 1–9; 10.1145/3458903.3458910
15. Markettos, A.T. et al. Thunderclap: Exploring vulnerabilities in operating system IOMMU protection via DMA from untrustworthy peripherals. In Proceedings of the 2019 Network and Distributed System Security Symp.; 10/gjh62d.
16. Microsoft. An introduction to single root I/O virtualization (SR-IOV). (2021); https://bit.ly/3QRxvWw.
18. Nvidia Magnum IO; https://www.nvidia.com/en-us/data-center/magnum-io/.
19. Samsung Semiconductor. Samsung Electronics introduces industry's first 512GB CXL memory module. (2022); https://news.samsung.com/global/samsung-electronics-introduces-industrys-first-512gb-cxl-memory-module.
22. Watson, R.N.M. et al. CHERI: A hybrid capability-system architecture for scalable software compartmentalization. In Proceedings of IEEE 2015 Symp. Security and Privacy. 20–37; https://ieeexplore.ieee.org/document/7163016.
23. Xia, H. et al. CHERIvoke: characterising pointer revocation using CHERI capabilities for temporal memory safety. In Proceedings of the 52nd Annual IEEE/ACM Intern. Symp. on Microarchitecture. (2019), 545–557; 10.1145/3352460.3358288.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
Surely (especially with Cambridge authors!) there should be some reference to the CAP machine developed by David Walker at the computer lab in the 1970s. Maybe https://dl.acm.org/doi/abs/10.1145/1067625.806541 ?
Displaying 1 comment