Large warehouse-scale computers (WSCs) underpin all the cloud computing services we use daily—whether it is Web search, video streaming, social networks, or even emerging AI chatbots or agents. The memory subsystem in these computers poses one of the biggest challenges in their design and operation: Across the industry, Big Tech companies such as Amazon, Google, Meta, and Microsoft spend billions of dollars buying memory and consume hundreds of megawatts powering them. Sadly, this problem is only getting worse, exacerbated by slowing of technology scaling trends (like Moore’s law) and exploding demand for more data and correspondingly more memory—for example, artificial intelligence (AI) workloads.
One approach to address the costs of memory is to use tiers. Most workloads have a working dataset that includes both hot (more frequently used) and cold (less frequently used) data. Assigning the hot data to the fast but expensive memory while moving the colder data to a less expensive, albeit slower memory tier can make a dramatic difference to the total spending on memory. There are multiple ways to create this second, less expensive memory tier. We could use less costly, slower storage media, such as flash and solid-state devices, or we could use faster memory and compress the data there.
However, to make this work, we have to answer some important questions. How do we move data between the different memory tiers so that the applications do not slow down too much? This is particularly challenging since, in the same way memory dominates costs, memory also dominates performance in WSCs. Even relatively small variations in average memory latencies can lead to large performance slowdowns that can wash out any cost savings. There are other challenges too: How do we move data transparently so that we do not need to change the plethora of workloads that run on these large cloud computing systems? And, most importantly, how do we do all this, at warehouse-scale, across heterogeneous workloads and diverse memory technologies?
The accompanying paper, “TMO: Transparent Memory Offloading in Datacenters,” addresses these questions. Focusing on an application-transparent, kernel-driven approach, TMO seeks to answer the two underlying questions of memory-tier management to meet the previously mentioned objectives: when to offload (and how much), and what memory to offload. The paper shows how its answers to these questions were very effective in the context of a real-world at-scale deployment at Meta.
For the first question, the paper introduces a new metric called pressure stall information (PSI), tracked at the kernel level. Unlike prior approaches that use proxy metrics, such as page fault rates, PSI tracks the time when a job is not making progress due to resource constraints. This provides a much better idea of lost work and actual impact on application performance. A user-space agent (“Senpai”) interfaces with the kernel’s reclaim mechanism and dynamically changes memory-tier provisioning, using PSI as an accurate way to understand and react to application impact. This is a very elegant way of solving the otherwise gnarly problem of dealing with traditional fragile and imperfect proxies on memory pressure and impact on performance. For the second question, around what memory to offload, TMO is more aggressive about using swap-based memory relative to the kernel’s traditional proclivity toward the file cache and about considering the memory usage of both sidecar and application containers.
Beyond these high-level contributions, the paper is chock-full of nice engineering details about the implementation. And for the really interested reader, both PSI and Senpai have been upstreamed and open sourced, so you can look through the code yourself. The paper also presents a nice evaluation of TMO, deployed across millions of servers in Meta’s datacenters. The data shows some impressive results, in terms of memory savings and from the perspective of seamlessly managing heterogeneous memory tiers from compressed memory to SSD backends.
While the paper is a great read for WSC designers and other practitioners in the field, it is also likely to be of interest to researchers interested in exploring additional opportunities in this space. In particular, the paper sets up some very interesting questions on how TMO and similar designs can be extended to consider alternate memories—for example CXL-attached memory devices or new memory/flash technologies. The insights from TMO could also apply to memory hierarchies in emerging AI accelerators, including addressing challenges with physical limits of the high-bandwidth memories used in such systems. Similarly, going beyond the paper’s emphasis on cost savings, there are rich opportunities to consider more complex objective functions, including performance, power, reliability, and environmental sustainability. Automation—for example, using machine learning techniques to manage such complex optimizations across the multitude of memory options (going beyond tiering to segmenting)—looks particularly promising.
One thing is clear. We are entering an era of important innovation in the memory/storage hierarchy of warehouse-scale computers, one driven both by opportunity and challenge. The design and evaluation in this paper provide an excellent foundation to initiate more rigorous discussion and research from the broader community.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment