It is the best of times and it is the worst of times in the world of datacenter memory technology. According to International Data Corporation (IDC), dynamic random-access memory (DRAM) revenues exceeded $100 billion in 2022. Yet, the anticipated growth rate is hugging the zero line, and many producers either reported loss-making quarters or are rumored to do so soon. From the perspective of datacenter customers, by some estimates, the cost of renting memory ranges from $20 to $30 per gigabyte per year, for a resource that costs only $2 to $4 to procure outright. On top of this, software as a service (SaaS) end users, for example, are forced to rent all the memory they will need up front. By some estimates, they end up using less than 25% of that memory more than 75% of the time.10
Compute Express Link (CXL), a new technology emerging from the hardware side,9 is promising to provide far memory. Thus, there will be more memory capacity and perhaps even more bandwidth, but at the expense of greater latency. Optimization will first seek to keep memory in far tiers colder, and, second, minimize the rates of both access into and promotion out of these tiers.1,5 Third, proactive promotion and demotion techniques being developed for far memory promote/demote whole objects instead of one cache line at a time to take advantage of bulk caching and eviction in order to avoid repeatedly incurring its long latency. Finally, offloading computations with many dependent accesses to a near-memory processor is already being seen as a way to keep the latency of memory out of the denominator of application throughput.11 With far memory, this will be a required optimization.
Applications that operate over richly connected data in memory engage heavily in pointer-chasing operations either directly (for example, graph processing in deep-learning recommendation models) or indirectly (for example, B+ tree index management in databases). Figure 1 shows an example of pointer-chasing applications in far memory: graph traversal; key lookup in a B+ index; and, collision resolution under open hashing.
Data from previous work2 suggests that as data structures scale beyond the memory limits of a single host, causing application data to spill into far memory, programmers are forced to make complex decisions about function and data placement, intercommunication, and orchestration.
Performance characteristics of far memory. By default, pointers (like the internode ones in Figure 1) are defined in the virtual address space of the process that created them. Because of this, if left unoptimized, pointer-chasing operations and their dependent accesses can overwhelm the microarchitecture resources that support memory-level parallelism (for example reorder buffers) even on a single CPU with local memory. With latencies that can range from 150ns to more than 300ns,2 far memory further compounds this problem.
In a distributed setting, implementing a simple-minded pointer-chasing offload without taking care of virtual-to-physical address translation results in chatty internode coordination with the parent process.15 Effective optimization of pointer-chasing operations entails minimizing communication between the near-memory processor executing the traversal and the server running the parent process.
Developing far memory-ready applications. Evidence from high-performance computing (HPC) and database workloads points to the extreme inefficiency of pointer-rich sparse memory operations on CPUs and GPUs alike,4,14 in some cases hitting less than 1% of peak performance. This leads applications to want to offload such work to near-memory processors. In the case of far memory, that near-memory processor is itself outside the translation context of the parent process of the pointer-rich data. Pointers therefore must make sense everywhere in these new heterogeneous disaggregated systems.
In order to lower infrastructure rent, cloud applications also wish to exploit disaggregated far memory as a fungible memory resource that can grow and shrink with the amount of data. Moreover, they want to independently scale their memory and compute resources. For example, database services want to flex compute up or down in proportion to query load. Pointer-rich data in far memory must be shareable at low overhead between existing and new compute instances.
Pointers in traditional operating systems were valid only in the memory space of the process in which they were created. Sharing pointer-rich data among processes, nodes, and devices therefore required serialization-deserialization. This limitation remained even when prior art was recently extended by taking an approach of tombstoning dangling references to data demoted to far memory using special pointers.7,16 Those pointers could be dereferenced only from the original context of data creation, precluding independent scaling of memory and computation.
Global address spaces, such as partitioned global address space (PGAS), support a limited form of global pointers that persist only for the life time of a set of processes across multiple nodes. Nonvolatile memory (NVM) libraries such as Persistent Memory Development Kit (PMDK) support object-based pointers, but their "large" storage-format pointers are more than 64-bits long, and their traversal cannot be offloaded.
Commercial virtualization frameworks such as VMware's Nu proclets13 can only maintain the illusion of global pointers by compromising security (for example, by turning address space layout randomization off).
Microsoft CompuCache14 also supported global pointers, but by using a heavy database runtime atop full VMs even on disaggregated memory devices. All pointers, whether at hosts or in the CompuCache, are VM-local only. Pointer chasing across devices requires repeatedly returning to the host.
Teleport15 supported pointer-chasing offload to remote memory but by directed, on-demand shipping of the virtual-to-physical translation context to the target locale of each function shipped.
Prior work on OS constructs for far memory is therefore missing a foundation of globally invariant pointers that can be shared with and dereferenced by any node or device in a cluster containing far memory.
When organizing data at object granularity, a globally invariant pointer must contain the ID of the object containing the target data, as well as an offset to that data. This object ID must be interpreted anywhere the pointer can be dereferenced. Ideally, invariant pointers should be no larger than 64 bits and permit access to partially resident objects. Existing approaches do not meet the first criterion (for example, PMDK) or the second criterion (for example, application-integrated far memory, AIFM,12 which has a different pointer form for resident and nonresident objects).
Providing truly globally invariant pointers, however, is necessary for offloading "run anywhere" code.
Twizzler3 is an operating system that introduces globally invariant pointers by using a context local to the object in which the pointer is stored, shortening its representation while allowing any CPU that can read the pointer to fully resolve its destination. This is done using an foreign object table (FOT) that is part of each object in the system, ensuring any individual object is self-contained.
An object's FOT contains identifying information for each foreign object that is the destination for a pointer in the object. Since these are stored in an ordered table, stored pointers use the index into the FOT as a stand-in for the full addressing information, a translation process shown in Figure 2. This approach allows pointers to remain small: a 64-bit pointer can, for example, include a 24-bit "local" object ID and 40-bit offset. While this limits the number of foreign objects that can be referenced from a single object to 224, different objects have their own FOTs and can reference a different set of objects, so the total number of objects in the system is limited only by the size of an object ID.
This approach also allows for a wide range of resolvers that translate identifying information in the FOT into an object ID. For example, the FOT might contain a static object ID or the equivalent of a file-system name to be resolved to an object ID by a name resolver. There is no requirement that a name resolve to the same object ID in different places: for example, an object named /var/log/syslog might resolve to different object IDs on different system nodes. Name resolvers themselves can be pluggable: The FOT need only identify the resolver in a way that any node in the system can run the resolver to return an object ID.
While the first access to a foreign object may be relatively slow, subsequent accesses are very fast, since the resolution to an object ID is cached. The system maps the object into the node's "guest physical" address space, leveraging memory management unit (MMU) hardware already in use for virtualization. It then maps the guest physical space in which the object resides into the guest virtual space for any processes that reference the object, using extended page tables to remove software from the CPU load/store path and allowing the system to run at memory speed. This is necessary for efficiency; even minimal system software interaction on each load and store will slow the computation significantly.
Preliminary experiments3 show that Twizzler's approach is effective at preserving low-latency pointer dereferencing for both intra-object and inter-object invariant pointers. On an Intel Xeon Gold CPU running at 2.3GHz, intra-object pointer dereferences take about 0.4ns, approximately the same time as "normal" dereferences. Cached inter-object pointer dereferences take 3.2ns, somewhat slower than intra-object dereferences but still sufficiently fast because relatively few such references are expected, given multi-megabyte objects. The first reference to a foreign object is slower, at 28ns, but still reasonable. If name resolution is more complex than interpreting a static full-length (128-bit) object ID, it would be longer still; however, these penalties are paid only once, regardless of how many times pointers from object A to object B are dereferenced in the same process.
Benchmarks on both microscale (in-memory key/value store) and macroscale (Yahoo! Cloud Serving Benchmark, YCSB, using different back ends) likewise show excellent performance for this approach. The left graph in Figure 3 shows throughput of the YCSB benchmark on SQLite using four back ends: the native SQLite back end; the Lightning Memory-mapped Database (LMDB) backend, which leverages mmap; our implementation of a PMDK backend, which uses a red-black tree under PMDK; and Twizzler, which uses a red-black tree with the invariant pointer approach.
The invariant pointer approach outperforms every other approach while providing the flexibility of "run anywhere" invariant pointers. The graph on the right of Figure 3 similarly shows that these invariant pointers provide lower latency than other approaches because of the simplicity of the programming model and the low overhead for dereferencing pointers. PMDK, in particular, is significantly slower because its pointers are 128-bits long, requiring additional register space and memory operations to read and dereference.
It is important to note the PMDK and Twizzler implementations are running the same back-end code, with changes made only to accommodate the different programming models; this shows the benefit of using 64-bit pointers local to an object context rather than 128-bit pointers, as PMDK does.
Elephance MemOS is a fork of Twizzler being developed to run on CXL far memory devices. It will be ported and optimized for the systems-on-chip (SoCs) used as controllers in CXL-disaggregated memory nodes.
For software developers, what does memory disaggregation mean and how will systems be built around it? The architecture of such systems will aim to hide the details from the majority of programmers, so their code will not need to change to run on these new systems.
There are three ways in which systems can be built to provide disaggregated memory: application libraries, modification to the operating system's memory system, and changes beneath the operating system at the hardware layers, as seen in Figure 4. In the figure, a set of application servers is connected to a set of MemOS nodes over a shared bus. Pointer-rich application data in far memory lives on MemOS nodes. Pointers can be: inter-object and on the same device, inter-object across devices, or intra-object.
It is likely the first way that disaggregated memory will be made available will be through application libraries linked directly into the application, seen at the top of Figure 5. The memory shim acts as a specialized memory allocator that knows how to handle remote memory using a memory access protocol (MAP). The MAP may depend on a current technology such as RDMA (remote direct memory access (RDMA), or may be something newer such as CXL3.
Many languages, such as Python, which depend on the C library for memory, will be able to use the memory shim to handle memory for objects in the language, freeing the Python programmer from having to know anything about disaggregated memory. For languages such as C and C++, which handle pointers directly, the programmer will have to work with the memory shim APIs in order to manage remote memory. The prevalence of Python and similar managed memory languages in big data and machine-learning applications means that programmers in those fields can use disaggregated memory in a transparent way, no matter where the memory shim is located in the software stack.
Extending the operating system's virtual memory system to integrate with the memory shim is the next logical place to interpose disaggregated memory in the stack, seen in (B) in Figure 5. Again, the specific MAP is not exposed to the kernel developer, only the memory shim APIs. The Linux operating system already has heterogeneous memory management (HMM),8 which is a natural place to slot in the memory shim. Once the shim is integrated into the operating system itself, all applications can use disaggregated memory transparently without modifications to their source code or linking with specialized libraries.
The deepest that far memory can be placed in the stack is in the hardware itself. Memory controllers integrated in CPUs from Intel and AMD are already starting to support early versions of CXL disaggregated memory. In the future, more featureful controllers will present memory to the operating system both locally and remotely in a transparent manner but, like the other two cases, will require a MAP to be interposed between the hardware and the remote memory. The protocol in this instance will be CXL 3. While putting the memory shim into hardware will likely result in the highest bandwidth, lowest latency, and maximum portability, there are reasons to continue to use a memory shim as a linked library into the software. First and foremost is the level of control that linking directly to the memory shim gives to the programmer. Once such functionality is embedded into the operating system or the memory controller, application programmers will lose control and visibility into the remote memory system. While many will be happy not to have to manage memory on their own, applications will remain where such control is a feature. Novel memory architectures for distributed memory must first be tried in software, and some may be too specialized ever to be implemented in hardware.
Effectively exploiting emerging far-memory technology requires consideration of operating on richly connected data outside the parent process.
Consider a memory system where pointers are globally invariant, which will be possible with MemOS but is not yet common in pointer-based systems. Building and debugging such a system in software makes it possible to rapidly iterate on the design—impossible in a memory controller and certainly more difficult to debug in the operating system. Applications that can use globally invariant pointers have distinct advantages because computation can take place on any node without the application having to know where a pointer might reside. Furthermore, it will be possible to move code, rather than data, to achieve computational efficiency—again, because no matter which compute node a pointer resides on, the pointer itself is the global handle that computation depends on, rather than an address in local memory, as things stand today.
Effectively exploiting emerging far-memory technology requires consideration of operating on richly connected data outside the context of the parent process. Operating-system technology in development offers help by exposing abstractions such as memory objects and globally invariant pointers that can be traversed by devices and newly instantiated compute. Such ideas will allow applications running on future heterogeneous distributed systems with disaggregated memory nodes to exploit near-memory processing for higher performance and to independently scale their memory and compute resources for lower cost.
1. Al Maruf, H. et al. TPP: Transparent page placement for CXL-enabled tiered-memory. In Proceedings of the 28th ACM Intern. Conf. Architectural Support for Programming Languages and Operating Systems 3, (2023), 742–755; https://dl.acm.org/doi/10.1145/3582016.3582063.
2. Berger, D.S. et al. Design trade-offs in CXL-based memory pools for public cloud platforms. IEEE Micro 43, 2 (2023), 30–38; https://dl.acm.org/doi/abs/10.1109/MM.2023.3241586.
3. Bittman, D. et al. Twizzler: a data-centric OS for non-volatile memory. In Proceedings of the 2020 Usenix Annual Tech. Conf.; https://dl.acm.org/doi/pdf/10.5555/3489146.3489151.
4. Dongarra, J. A not so simple matter of software. ACM Turing Award Lecture (2021); https://www.youtube.com/watch?v=cSO0Tc2w5Dg.
5. Duraisamy, P. et al. Towards an adaptable systems architecture for memory tiering at warehouse-scale. In Proceedings of the 28th ACM Intern. Conf. Architectural Support for Programming Languages and Operating Systems 3, (2023), 727–741; https://dl.acm.org/doi/10.1145/3582016.3582031.
6. Hsieh, K. et al. Accelerating pointer chasing in 3D-stacked memory: challenges, mechanisms, evaluation. In Proceedings of the IEEE 34th Intern. Conf. Computer Design. (2016), 25–32; https://ieeexplore.ieee.org/document/7753257.
7. Jennings, S. The zswap compressed swap cache. LWNNet. (2013); https://lwn.net/Articles/537422/.
8. Kernel Development Community. Heterogeneous memory management. The Linux Kernel 5.0.0; https://www.kernel.org/doc/html/v5.0/vm/hmm.html.
9. Mehra, P. and Coughlin, T. Taming memory with disaggregation. Computer 55, 9 (2022), 94–98; https://ieeexplore.ieee.org/document/9869614.
10. Michelogiannakis, G. et al. A case for intra-rack resource disaggregation in HPC. ACM Trans Architecture and Code Optimizations 19, 2 (2022), 1–26; https://dl.acm.org/doi/10.1145/3514245.
11. Rodrigues, A. et al. Towards a scatter-gather architecture: hardware and software issues. In Proceedings of the 2019 Intern. Symp. Memory Systems. 261–271; https://dl.acm.org/doi/10.1145/3357526.3357571.
12. Ruan, Z. et al. AIFM: High-performance, application-integrated far memory. In Proceedings of the 14th Usenix Symp. Operating Systems Design and Implementation. (2023); https://dl.acm.org/doi/pdf/10.5555/3488766.3488784.
13. Ruan, Z. et al. Nu: Achieving microsecond-scale resource fungibility with logical processes. In Proceedings of the 20th Usenix Symp. Networked Systems Design and Implementation. (2023); https://www.usenix.org/system/files/nsdi23-ruan.pdf.
14. Zhang, Q. et al. CompuCache: Remote computable caching using spot VMs. In Proceedings of the 12th Annual Conf. Innovative Data Systems Research. (2022); https://www.cidrdb.org/cidr2022/papers/p31-zhang.pdf.
15. Zhang, Q. et al. Optimizing data-intensive systems in disaggregated data centers with Teleport. In Proceedings of the 2022 Intern. Conf. Management of Data; https://dl.acm.org/doi/10.1145/3514221.3517856.
16. Zhou, Y. et al. Carbink: Fault-tolerant far memory. In Proceedings of the 16th Usenix Symp. Operating Systems Design and Implementation. (2022); https://www.usenix.org/system/files/osdi22-zhou-yang.pdf.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found