Abstract
1. Introduction
The popularity of in-memory databases and in-memory computing catalyzes ever-increasing demands for memory in modern datacenters. However, datacenters today suffer from low memory utilization (< 65%),6,15 which results from imbalanced memory usages across a sea of servers. In response, academia and industry are working toward a new hardware architecture called memory disaggregation, where CPU and memory are physically separated into two network-attached components—compute servers and memory servers.15,18,22 With memory disaggregation, CPU and memory can scale independently and different applications share a global memory pool efficiently.
Since almost all CPU resources are assembled on compute servers under the memory disaggregation architecture, memory servers have near-zero computation power, which highlights the challenge of how compute servers access disaggregated memory residing on memory servers. Fortunately, remote direct memory access (RDMA), a fast network technique, allows compute servers to directly access disaggregated memory unmediated by memory servers’ computation power with low latency, becoming an essential building block of memory disaggregation architecture.15,18,22
In this work, we explore how to design a high-performance tree index, a key pillar of datacenter systems, on disaggregated memory. We first revisit existing RDMA-based tree indexes and examine their applicability on disaggregated memory. Several RDMA-based tree indexes rely on remote procedure calls (RPCs) to handle write operations;12,14 they are ill-suited for disaggregated memory due to near-zero computation power of memory servers. For tree indexes that can be deployed on disaggregated memory, they also have some critical limitations. Some indexes use RDMA one-sided verbs for all index operation25 (we call it one-sided approach); they can deliver high performance for read operations but suffer from low throughput and high latency in terms of write operations, especially in high-contention scenarios. Other indexes bake write operations into Smart-NICs or customized hardware,1 which brings high total cost of ownership (TCO) and cannot be deployed in datacenters on a large scale immediately.
Our goal is designing a tree index on disaggregated memory that can deliver high performance for both read and write operations with commodity RDMA NICs. To this end, we further analyze what makes one-sided approach inefficient in write operations, and find out three major causes. First, due to limited semantics of one-sided RDMA verbs, modifying an index node (for example, tree node in ) always requires multiple round trips (that is, lock, read, write, and unlock), inducing high latency and further making conflicting requests more likely to be blocked. Second, the locks used for resolving write-write conflicts are slow and experience performance collapse under high-contention scenarios. This is because at the hardware level, NICs adopt expensive concurrency control to ensure atomicity between RDMA atomic commands, where each command needs two PCIe transactions. At the software level, such locks often trigger unnecessary retries, which consumes RDMA IOPS, and do not provide fairness, which leads to high tail latency. Third, the layout of index data structure incurs severe write amplification. Due to coarse-grained consistency check mechanisms (for example, using checksum to protect a whole tree node), a small piece of modification will result in large-sized write-back across the network.
Motivated by the above analysis, we propose Sherman, a write-optimized distributed index on disaggregated memory. The key idea of Sherman is combining RDMA hardware features and RDMA-friendly software techniques to reduce round trips, accelerate lock operations, and mitigate write amplification. Sherman spreads nodes across a set of memory servers, and compute servers perform all index operations via RDMA one-sided verbs purely. Sherman uses a classic approach for concurrency control: lock-free search with versions to resolve read-write conflicts and exclusive locks to resolve write-write conflicts.
To reduce round trips, Sherman introduces a command combination technique. Based on the observation that commodity RDMA NICs already provide in-order delivery property, this technique allows client threads to issue dependent RDMA commands (for example, write-back and lock release) simultaneously, letting NICs at memory servers reflect them into disaggregated memory in order.
To accelerate lock operations, we design a hierarchical on-chip lock (HOCL) for Sherman. HOCL is structured into two parts: global lock tables on memory servers and local lock tables on compute servers. Global lock tables and local lock tables coordinate conflicting lock requests between compute servers and within a compute server, respectively. Global lock tables are stored in the on-chip memory of RDMA NICs, thus eliminating PCIe transactions of memory servers and further delivering extremely high throughput for RDMA atomic commands (110Mops). Within a compute server, before trying to acquire a global lock on memory servers, a thread must acquire the associated local lock to avoid a large amount of unnecessary remote retries. Moreover, by adopting wait queues, local lock tables improve fairness between conflicting lock requests. Based on local lock tables, a thread can hand its acquired lock over to another thread directly, reducing at least one round trip for acquiring remote global locks.
To mitigate write amplification, Sherman tailors the leaf node layout of . First, entries in leaf node are unsorted to eschew shift operations upon insertion/deletion. Second, to support lock-free search while avoiding write amplification, we introduce a two-level version mechanism. In addition to using a pair of node-level versions to detect the inconsistency of the whole leaf node, we embed a pair of entry-level versions into each entry, which ensures entry-level integrity. For insertion/deletion operations without split/merging events, only entry-sized data is written back, thus saving network bandwidth and making the most of the extremely high IOPS of small RDMA messages.
To demonstrate its efficacy, we evaluate Sherman using a set of benchmarks. Under write-intensive workloads, Sherman achieves much better performance than FG,25 a state-of-the-art distributed supporting disaggregated memory. Specifically, in common skewed workloads, Sherman delivers one order of magnitude performance improvement in terms of both throughput and 99th percentile latency. For read-intensive workloads (that is, 95% read operations), Sherman exhibits slightly higher throughput with 25% lower 99th percentile latency.
This work makes three main contributions. First, we analyze existing RDMA-based tree indexes on disaggregated memory (§3). Second, we design and implement Sherman, a write-optimized index on disaggregated memory (§4). Finally, we use a set of evaluations to demonstrate the high performance of Sherman (§5).
2. Background
In this section, we provide the background on memory disaggregation and RDMA network briefly.
2.1 Memory disaggregation.
Traditional datacenters pack CPU and memory into the same hardware units (that is, monolithic servers), leading to low memory utilization (< 65%)6,15 and further increasing the TCO of datacenters. To attack this problem, academia and industry are exploring a new hardware architecture called memory disaggregation.15,18,22 In such an architecture, CPU and memory are physically separated into two different hardware units: compute servers (CSs) and memory servers (MSs). Compute servers own a mass of CPU cores (tens to hundreds), but MSs host high-volume memory (hundreds to thousands of gigabytes) with near-zero computation power. CPUs in CSs can directly access the disaggregated memory in MSs via high-speed networks (for example, RDMA). In this article, we focus on RDMA-enabled memory disaggregation, since RDMA is currently widely deployed and more mature than other memory-semantic networks such as CXL. With memory disaggregation, CPU and memory can scale independently and applications can pack resources in a more flexible manner, boosting resource utilization significantly.
To reduce remote accesses from CSs to MSs, CSs are always equipped with a small piece of memory as the local cache (one to ten gigabytes). Moreover, MSs own a small set of wimpy CPU cores (one to two) to support lightweight management tasks, such as network connection management.
2.2 RDMA network.
RDMA network is the key enabler of memory disaggregation architecture. It provides two types of verbs, namely two-sided verbs and one-sided verbs, to applications. Two-sided verbs— RDMA_SEND
and RDMA_RECV
—are the same as traditional Linux socket interface. One-sided verbs, that is, RDMA_WRITE
, RDMA_READ
, and RDMA_ATOMIC
(RDMA_FAA
and RDMA_CAS
) operate directly on remote memory without involving the CPUs of receivers. The direct-access feature of one-sided verbs makes memory servers with near-zero computation power possible.
RDMA hosts communicate via queue pairs (QPs): A QP consists of a send queue and a receive queue. A sender performs an RDMA command by posting the request to the send queue. RDMA supports three transport types: reliable connected (RC), unreliable connected (UC), and unreliable datagram (UD). Sherman uses RC, since it supports all one-sided verbs and is reliable.
3. Motivation
Designing a fast distributed tree index on disaggregated memory poses unique challenges. In this section, we first revisit two existing approaches—using one-sided verbs purely and extending RDMA interfaces—and reveal their respective issues. Then, we analyze why using one-sided verbs is slow, which motivates the design of Sherman.
3.1 Existing approaches.
With near-zero computation power at MS side, we cannot delegate index operations to CPUs of MSs via remote procedure calls (RPCs), which is the main difference between memory disaggregation and traditional architectures. Existing work enables efficient lookup operations via lock-free search and caching mechanisms,4,12 leading to a single RDMA_READ
in the ideal situation (that is, cache hit). However, write operations are in a quandary due to their complex semantics; specifically, there are two avenues to design write operations, each of which has its own issues.
Approach #1: Using one-sided verbs purely. FG,25 an RDMA-based , completely leverages one-sided verbs to perform index write operations (so it can be deployed on disaggregated memory). For write operations, it adopts a lock-based approach, where tree node modification is protected by RDMA-based spinning locks (RDMA_CAS
for acquisition and RDMA_FAA
for release). For read operation, FG follows a lock-free search scheme, where threads fetch tree nodes via RDMA_READ
without holding locks and then check the consistency using checksum. We conduct an experiment to evaluate FG’s performance. Since FG is not open source, we implement it from scratch; we also cache internal tree nodes to reduce remote accesses. Table 1 shows the result, and we make two observations. First, in read-intensive workload (95% lookup and 5% insert/update), FG can achieve high throughput (> 30Mops) and low 99th percentile latency (< 16s). Second, in write-intensive workload (50% lookup and 50% insert/update), FG delivers 18Mops with 19s tail latency under the uniform setting; yet, its performance collapses in case of skew setting, where the key popularity follows a Zipfian popularity with skewness 0.99: Throughput is only 0.34Mops and tail latency is close to 20ms.
Read-intensive | Write-intensive | ||||
---|---|---|---|---|---|
Uniform | Skew | Uniform | Skew | ||
Throughput (Mops) | 31.8 | 32.9 | 18.7 | 0.34 | |
Latency (μs) | 50th | 4.9 | 4.7 | 9.5 | 10 |
90th | 6.4 | 6.2 | 14.3 | 68.7 | |
99th | 14.9 | 15.3 | 19 | 19890 |
Approach #2: Extending RDMA interfaces. Another approach to designing indexes on disaggregated memory is extending RDMA interfaces. This approach offloads index write operations into memory servers’ NICs via SmartNICs or other customized hardware.1,16 For example, by exploiting interface extensions (that is, indirect addressing and notification), researchers propose HT-Tree (without implementation),1 a hybrid index combining the hash table and tree. Yet, compared with commodity RDMA NICs, SmartNICs come at the price of higher TCO and lower performance. More specifically, SmartNICs have much higher market price (5) than that of commodity RDMA NICs at present.18 As for performance, at 100Gbps network environment, StRoM,16 the state-of-the-art RDMA extensions using FPGA, has 2 higher round-trip latency (4s) against commodity RDMA NICs (2s).
3.2 Why one-sided approach is slow?
The inefficiency of write operations using one-sided verbs stems from excessive round trips, slow synchronization primitives, and write amplification.
3.2.1 Excessive round trips.
The most obvious cause of slow write operations is excessive round trips. For example, when modifying a tree node (not consider node split/merging), a client thread needs four round trips: acquiring associated exclusive lock, reading the tree node, writing back the modified tree node, and finally releasing the lock. Excessive round trips negatively impact write performance in two aspects. First, the latency of a single write operation is proportional to the number of round trips. Second, more round trips lead to the longer critical path, so conflicting write operations (that is, requests targeting the same tree nodes) are more likely to be blocked, degrading the concurrency performance.
3.2.2 Slow synchronization primitives.
RDMA-based locks used in these indexes cannot provide sustainable performance under a variety of workloads. We conduct an experiment to demonstrate it. In the experiment, 154 threads across seven CSs acquire/release 10,240 locks residing in an MS; we choose RDMA verbs used by FG,25 where RDMA_CAS
for lock acquisition and RDMA_FAA
for release. The access pattern follows Zipfian distribution and we vary Zipfian parameters to control the contention degree (0 is uniform setting and 0.99 is the most common real-world scenario). Figure 2 shows the result: The system experiences performance collapse under high-contention settings.
The main reason behind the performance collapse is expensive in-NIC concurrency control. To guarantee correct atomicity semantic between RDMA_ATOMIC
commands targeting the same addresses, RDMA NICs adopt an internal locking scheme.8 More specifically, a NIC maintains a certain number of buckets (for example, 4,096) and puts RDMA_ATOMIC
commands having the same certain bits in their destination addresses (for example, 12 LSBs) into the same bucket. Commands in the same bucket are considered conflicting. An RDMA_ATOMIC
cannot be executed until the previous conflicting commands are finished. Unfortunately, a single RDMA_ATOMIC
needs two PCIe transactions: fetching data from CPU memory into the NIC and writing back after modification (can be omitted for failed RDMA_CAS
). These PCIe transactions stretch the queueing time of conflicting RDMA_ATOMIC
commands and thus degrade concurrency performance, especially in high-contention workloads.
Moreover, in the software design, existing locks in RDMA-based indexes blindly retie when conflicts appear, squandering limited throughput of NICs in both CSs and MSs. And they do not guarantee fairness between conflicting lock acquisition, thus causing starvation and high tail latency.
3.2.3 Write amplification.
Existing RDMA-based indexes always trigger large-sized RDMA_WRITE
, suffering from write amplification. This issue stems from two causes. First, a keeps entries in each tree node sorted. When an entry is inserted/deleted to/from a node, all the entries on the right side of the insertion/deletion position need to be shifted. The shift operations cause extra data to be written via RDMA_WRITE
.
Second, two existing consistency check mechanisms, which are proposed to detect concurrent writes for lock-free lookup, induce write amplification. In the first mechanism, each tree node includes a checksum covering the whole node area (except the checksum itself);25 the checksum is recalculated when modifying the associated node and is verified when reading the node. The other mechanism, namely version-based consistency check, stores a version number at the start and end of each node;12 when modifying a node via RDMA_WRITE
, the associated two versions are incremented; a node’s content obtained via RDMA_READ
is consistent only when the two versions are the same. Since the granularity of the above two mechanisms is tree node, any modification to part of the node area requires to write back the whole node (including the metadata, for example, checksum and version), leading to severe write amplification.
4. Design
Motivated by our observations about root causes of inefficient write operations (§3.2), we design Sherman. In this section, we begin by presenting our design principles (§4.1), provide an overview of Sherman (§4.2), and then describe our key techniques (§4.3-§4.5).
4.1 Design principles.
1) Seeking solutions from hardware first. There are unexploited features of RDMA NICs that pose opportunities for index design. In Sherman, we use on-chip memory exposed by NICs to accelerate lock operations, and in-order delivery guarantee of RC QP to reduce round trips.
2) Applying CS-side optimization when possible. In the current multicore era, each compute server (CS) usually launches a mass of client threads (10s–100s) to manipulate an index concurrently, leaving a lot of design space for CS-side optimization. Within a CS, we leverage local lock tables to coordinate conflicting lock requests between threads, and locks can be handed over from a thread to another quickly. Besides, Sherman maintains a CS-side index cache to reduce network accesses for tree traversal.
3) Tailoring data-structure layout to improve RDMA friendliness. Disaggregated memory has a unique profile: It is accessed via a network (that is, RDMA) rather than a cache-coherent memory bus. Thus, it is necessary to tailor data-structure layout of indexes to reap RDMA’s full potential. To this end, Sherman separates locks from the tree structure to put them into NICs’ on-chip memory, and adopts entry-level versions as well as unsorted leaf nodes to reduce IO size of RDMA_WRITE
commands.
4.2 Overview.
Figure 3 shows the overall architecture and interactions of Sherman, which consists of a set of MSs and CSs. MSs are equipped with massive memory where Sherman tree resides. CSs run client threads that manipulate Sherman via specific interfaces (that is, lookup, range query, insert and delete). Since the number of CPU cores continues to increase, we assume there are a host of client threads running within a CS (dozens or even hundreds). These client threads cooperatively execute system services (for example, RDMA-based transaction processing), which needs Sherman for data indexing.

B+Tree structure. Sherman is a , where values are stored in leaf nodes. We record a sibling pointer for every leaf node and internal node as in the B-link tree. Client threads can always reach a targeted node by following these sibling pointers in the presence of node split/merging, thus supporting concurrent operations efficiently. Every pointer in Sherman (that is, child/sibling pointers) is 64bit, which includes two parts: 16bit MS unique identifier and 48bit memory address within corresponding MS.
Concurrency control. Sherman adopts lock-based write operations and lock-free read operations.
Write-write conflicts. Sherman uses node-grained exclusive locks to resolve write-write conflicts: Before modifying a tree node, the client thread must acquire the associated exclusive lock. These exclusive locks are hierarchical: local lock tables at CS-side and global lock tables in on-chip memory of MSs’ NICs (§4.3). Such a hierarchical structure provides high concurrency performance.
Read-write conflicts. Sherman supports lock-free search, which leverages RDMA_READ
to fetch data residing in MSs without holding any lock. Moreover, Sherman uses versions to detect inconsistent data caused by concurrent writes. However, different from traditional mechanisms that use node-level versions, Sherman proposes a two-level version mechanism, which combines entry-level and node-level versions, to mitigate write amplification (§4.4).
Cache mechanism. To reduce remote accesses in the tree traversal, Sherman adopts a cache mechanism. Each CS maintains an index cache, which only makes two types of internal nodes’ copies: nodes above the leaf nodes (level 1 in Figure 3), and the highest two levels of nodes (including root). A client thread firstly searches type cache. On hit, it fetches the targeted leaf node directly from MSs; otherwise, it searches type cache and then traverses to leaf nodes via remote accesses. The index cache never induces data inconsistency issues; detailed reasons are available in our original SIGMOD paper.20
Memory management. In each MS, we reserve a dedicated memory thread to manage disaggregated memory. The memory thread divides memory residing in corresponding MS into fixed-length chunks (that is, 8MB). Client threads use a two-stage memory allocation scheme to obtain memory from MSs. A client thread first chooses an MS in a round-robin manner, and requests a free chunk from the MS’s memory thread via RPCs. Then, it allocates memory space for tree nodes locally within the chunk.
4.3 Hierarchical on-chip lock.
Sherman proposes hierarchical on-chip lock (HOCL) to improve concurrency performance. HOCL leverages the on-chip memory of NICs to avoid PCIe transactions at MS side; It also maintains local locks at CS side to form a hierarchical structure, reduce retries, and improve fairness. Moreover, locks can be handed over between client threads within the same CSs, thus saving at least one round trip.
On-chip lock table. Current RDMA verbs support device memory programming. Specifically, an RDMA NIC can expose a piece of its on-chip memory (SRAM) to the upper applications, which can be allocated and read/written by RDMA commands. The on-chip memory eliminates PCIe transaction at the receiver side, thus providing extremely high throughput 110Mops RDMA_CAS
). Sherman separates locks from tree nodes and stores locks into on-chip memory at MS side; each tree node is in the same MS as the lock protecting it. These locks in each MS are structured as an array, namely global lock table (GLT). When locking a tree node, the client thread first hashes the address of the tree node into a position number in the associated GLT and then issues an RDMA_CAS
command to the lock, which tries to change it from 0
to the 16bit CS identifier atomically. For lock release, the thread clears the lock via an RDMA_WRITE
.
An important consideration of GLT is the on-chip memory size. In the NIC we use (ConnectX-5), 256kB on-chip memory is available. To accommodate more locks, we make the granularity of RDMA_CAS
finer (16bit rather than 64bit), by applying an infrequently used RDMA verb called masked compare and swap, which allows us to select a portion of 64bit for RDMA_CAS
. Thus, an MS can maintain 131,072 locks in its GLT, enabling extremely high concurrency, particularly considering that we only lock at most one tree node at a time for a single write operation. To the best of our knowledge, Sherman is the first RDMA-based system that leverages on-chip memory of RDMA NICs.
Hierarchical structure. Sherman maintains a local lock table (LLT) in each CS to coordinate conflicting lock requests within the same CSs. The LLT stores a local lock for each lock of all GLTs. When a thread needs to lock a tree node, it first acquires the associated local lock in LLT, and then acquires the associated lock in GLT; thus, conflicting lock requests from the same CSs are queued on the LLT at CS side, avoiding unnecessary remote retries and thus saving RDMA IOPS. Moreover, each local lock in LLT is associated with a wait queue. A thread that cannot acquire a local lock in LLT pushes itself into the corresponding queue; the thread can learn if its turn has arrived by checking whether it is at the head of the queue. The queue provides first-come first-served fairness among threads within the same CSs. For lock release, the thread first releases the lock in GLT and then the local lock in LLT.
Handover mechanism. The hierarchical structure of HOCL enables a handover mechanism: handing over a lock from one client thread to another. When releasing a lock, if a thread finds out the lock’s wait queue is not empty, it will hand the lock over to the one at the head of the wait queue. To avoid starving threads at other CSs, we limit the maximum number of consecutive handovers to four. The thread that is handed over a lock no longer needs remote accesses for acquiring the lock, thus saving at least one round trip.
4.4 Two-level version.
To address the write amplification issue, Sherman incorporates a two-level version mechanism. First, Sherman uses unsorted leaf nodes so that shift operations upon insertion/deletion can be avoided. Unsorted leaf nodes complicate write operations in two aspects: (i) When looking up a key, the client thread needs to traverse the entire targeted leaf node; (ii) Before splitting a leaf node, the client thread must sort the entries in it. Given the microsecond-level network latency, the added overhead is slight.
Second, Sherman introduces entry-level versions to enable fine-grained consistency check, as shown in Figure 4. Specifically, in leaf nodes, each entry is surrounded by a pair of 4bit entry-level versions (that is, FEV and REV). In case of insertion without splitting, the associated entry-level versions are incremented and only the modified entry (includes FEV and REV) is written back via RDMA_WRITE
, thus evading write amplification. Also, a pair of 4bit node-level versions (that is, FNV and RNV) is stored at the beginning and end of each leaf node, protecting the consistency at the node granularity. When splitting a leaf node, the client thread increments associated FNV and RNV, and writes back the whole node via RDMA_WRITE
. Since internal nodes have a much lower modification frequency than leaf nodes; their format is standard: two node-level versions with a sorted layout. The extra memory space occupied by entry-level versions is acceptable: considering a storing 8byte key and 8byte value, each key-value pair at leaf nodes needs an extra 1byte memory space for entry-level versions, consuming about 6% memory of all leaf nodes.
Lookup operation. Two-level version mechanism makes the lookup operation different from the standard one. After reading a leaf node from MSs, a client thread compares the two node-level versions first; mismatched versions indicate the read must be retried. Then, the client thread locates the targeted entry and compares the two associated entry-level versions; if the comparison fails, the client thread needs to re-read the leaf node via RDMA_READ
. Wraparounds of these 4bit versions may cause inconsistency to be undetected: two versions match but one wraps around. To handle wraparounds, we measure the time of each RDMA_READ
command: If it takes more than 8s (that is, , where the time of a single write operation), a retry is needed.
Range query. For a range query, the client thread issues multiple RDMA_READ
in parallel to fetch targeted leaf nodes, and then checks leaf nodes’ consistency in the same way as the lookup operation. Of note, Sherman does not guarantee that a range query is atomic with concurrent write operations. If upper applications (for example, a transaction engine) need snapshot semantics for range queries, they must use other mechanisms to avoid phantom problems.
Discussion. Our two-level version mechanism assumes that an RDMA_READ
is performed in increasing address order. However, a recent work24 shows that this assumption does not always hold true: for 512B data blocks, 0.396% RDMA_READ
operations are executed in an out-of-order manner. This means that the two-level version mechanism may induce data inconsistency in some cases of concurrent access. We have fixed the issue in our recent work.19
4.5 Command combination.
To enforce ordering between dependent RDMA commands (for example, writing back a node and then releasing the lock), existing RDMA-based indexes use an expensive approach: issuing the following RDMA command only after receiving the acknowledgement of the preceding one.25 Yet, we observe that RDMA already provides a strong ordering property at the hardware level: In a reliable connected (RC) queue pair, RDMA_WRITE
commands are transmitted in the order they are posted, and the NIC at receiver side executes these commands in order.7,21 By leveraging this ordering property, Sherman combines multiple RDMA_WRITE
commands in a write operation to reduce round trips.
There are two cases that Sherman combines multiple RDMA_WRITE
commands. First, since a tree node and its associated lock co-locate at the same MS, the write-back of tree node and lock release can be combined through a QP, as opposed to issuing an unlock request after receiving acknowledgement of write-back. Thus, one round trip is saved and the critical path shortens. Second, when a node (we call it A
here) splits, we check whether the newly allocated sibling node belongs to the same MS as A
. If so, three RDMA_WRITE
commands can be combined together: write-back of the sibling node, write-back of A
, and release of A
’s lock.
5. Evaluation
In this section, we evaluate Sherman using a set of benchmarks to demonstrate its efficacy.
5.1 Setup.
Hardware Platform. Since memory disaggregation hardware is unavailable, we use a cluster of commodity, off-the-shelf servers to emulate MSs and CSs by limiting their usages of CPUs and memory.15 Our cluster consists of eight servers, each of which is equipped with 128GB DRAM, two 2.2GHz Intel Xeon E5-2650 v4 CPUs (24 cores in total), and one 100Gbps Mellanox ConnectX-5 NIC, installed with CentOS 7.7.1908. All these servers are connected with a Mellanox MSB7790-ES2F switch. For Mellanox ConnectX-5 NICs, the versions of driver and firmware are OFED 4.7-3.2.9.0 and 16.26.4012, respectively. Due to the limited size of our cluster, we emulate each server as one MS and one CS. Each MS owns 64GB DRAM and two CPU cores, and each CS owns 1GB DRAM and 22 CPU cores.
Compared systems. FG25 is a that supports disaggregated memory. Since FG is not open source, we implement it from scratch. For fair comparison, we add necessary optimizations to it: (i) index cache for reducing remote accesses, (ii) using RDMA_WRITE
to release lock rather than expensive atomic verb RDMA_FAA
. To distinguish our modified version of FG from the original one, we call it FG+ in the evaluation. The performance of FG+ is higher than that reported in FG paper.25
Workloads. We explore different aspects of the systems by using YCSB workloads.3 We use five types of read-write ratio, as shown in Table 2. Note that insert operations include updating existing keys (about 2/3 of all insert operations). There are two types of key popularity: uniform and skewed. In uniform workloads, all keys have the same probability of being accessed. Skewed workloads follow a Zipfian access distribution (Zipfian parameter, that is., skewness, is 0.99 by default), which is common in production environments.3 Unless otherwise stated, all the experiments are conducted with eight MSs and eight CSs. Each CS owns 500MB index cache, and launches 22 client threads (176 in total in our cluster). For each experiment, we bulkload the tree with one billion entries (8byte key, 8byte value) 80% full, then perform specified workloads. The size of a tree node (that is, internal node and leaf node) is 1KB.
Workload | Insert | Lookup | Range Query |
---|---|---|---|
write-only | 100% | ||
write-intensive | 50% | 50% | |
read-intensive | 5% | 95% | |
range-only | 100% | ||
range-write | 50% | 50% |
5.2 Overall performance.
To analyze Sherman’s performance, we break down the performance gap between FG+ and Sherman through applying each technique one by one. Figures 5 and 6 show the results under skewed and uniform workloads, respectively. In these two figures, +Combine stands for command combination technique. +On-Chip and +Hierarchical are two design parts of HOCL: leveraging on-chip memory of NICs to store locks, and hierarchical structure with handover mechanism, respectively. +2-Level Ver represents two-level version mechanism, and shows the final performance of Sherman.
5.2.1 Skewed workloads.
We make the following observations from Figure 5. First, in both write-only and write-intensive workloads, Sherman has much higher throughput and lower latency against FG+. In write-only workloads, Sherman achieves 24.7 higher throughput with 1.2 lower median latency (50p latency) and 35.8 lower 99th percentile latency (99p latency). In write-intensive workloads, Sherman achieves 23.6 higher throughput with 1.4/30.2 lower 50p/99p latency.
Second, all techniques contribute to the high write efficiency of Sherman. Here we analyze each technique for write-intensive workloads (write-only workloads have the same conclusions): Command combination improves throughput by 3.37 and reduces 50p/99p latency by 1.14/3.18, since it saves at least one round trip for each insert operation and further shortens the critical paths—decreasing the probability that conflicting requests are blocked. By putting locks into on-chip memory of NICs, Sherman gains 3.06 and 3.48 improvement in terms of throughput and 99p latency, respectively. This is because on-chip memory avoids PCIe transactions of lock operations at MS side, which provides extremely high throughput for RDMA atomic verbs and thus can absorb more failed RDMA_CAS
for retries. Introducing hierarchical lock structure to Sherman brings 2.22 higher throughput, 1.12 lower 50p latency and 2.68 lower 99p latency. This is because unnecessary RDMA_CAS
retries from the same CSs are avoided and the handover mechanism saves one round trip opportunistically. Two-level version mechanism does not bring considerable throughput improvement (only 3%), since the major bottleneck is concurrent conflicts rather than RDMA IOPS at this time; the 50p latency is reduced by 400ns (from 7.3s to 6.9s), since smaller RDMA_WRITE
IO size has shorter PCIe DMA time at both CS side and MS side.
Third, in read-intensive workloads (Figure 5(c)), Sherman does not present considerable performance improvement, as expected, since all techniques we propose aim to boost performance of write operations. Yet, there are still two points worth noting: By saving round trips for 5% insert operations, command combination reduces 99p latency from 15.3s to 12.9s; Sherman increases the 50p latency by 100ns (2%), we contribute it to unsorted leaf node layout, which causes traversal of the entire leaf node even for non-existing keys.
5.2.2 Uniform workloads.
As shown in Figure 6, compared with FG+, Sherman delivers 1.24 and 1.15 higher throughput in write-only and write-intensive workloads, respectively. These improvements mainly come from command combination and two-level version. Command combination saves round trips, so each client thread can execute more insert operations per second; two-level version reduces IO size of RDMA_WRITE
from node size to entry size, thus giving full play to the RDMA’s characteristics of extremely high small IO rate. HOCL is designed for high-contention scenarios (that is, skewed workloads), so it does not increase throughput in uniform workloads. As for latency, Sherman reduces 50p/99p latency by 1.24/2.01 and 1.19/1.27 in write-only and write-intensive workloads, respectively, which mainly is contributed to command combination and HOCL: Command combination saves round trips, and HOCL saves PCIe transaction time at MS side.
5.3 Range query performance.
In this experiment, we evaluate the performance of range query by using range-only and range-write workloads. The targeted range follows the skewed access pattern. Table 3 shows the results, from which we make three observations. First, in range-only workloads, when the range size equals 100, FG+ outperforms Sherman by 2%. This is because the unsorted leaf layout in Sherman leads to unnecessary scans when targeted leaf nodes are partially occupied. Second, in range-only workloads, as range size grows (that is., 1,000), the throughput of Sherman and FG+ drops and is almost the same, since network bandwidth becomes the bottleneck. Third, in range-write workloads, Sherman outperforms FG+ by up to 1.82. This is because Sherman’s write operations save a considerable quantity of network resources for range query operations.
Range-only | Range-write | |||
---|---|---|---|---|
Range size | 100 | 1,000 | 100 | 1,000 |
FG+ | 26.09Mops | 5.77Mops | 7.74Mops | 2.49Mops |
Sherman | 25.55Mops | 5.76Mops | 14.08Mops | 3.12Mops |
6. Related Work
Memory disaggregation is not a new idea: Lim et al.11 proposed it more than a decade ago to attack the problem of a growing imbalance in compute-to-memory-capacity ratio. Recently, memory disaggregation regains attention for two reasons. First, large-scale companies report the low memory utilization in datacenters.6,15,17 Second, high-speed networks, such as RDMA and CXL, make performance of remote accesses close to that of local accesses. Gao et al.5 find that, under a memory disaggregation architecture, 40–100Gbps network bandwidth and 3–5s network latency can maintain application-level performance.
Many recent academic efforts have been devoted to making memory disaggregation practical. LegoOS15 designs a distributed operating system to manage disaggregated resources. pDPM18 explores how to deploy disaggregated persistent memory (PM) in an efficient manner. LegoBase23 co-designs database and memory disaggregation, achieving faster failure recovery speed than traditional architectures. Sherman focuses on building a fast index on disaggregated memory.
Industry is proposing memory disaggregation hardware. HPE’s “The Machine”9 connects a set of SoCs to a shared memory pool via a photonic network. IBM’s ThymesisFlow13 leverages OpenCAPI to enable CPUs to directly access remote disaggregated memory without software involvement. VMware researchers use cache coherence mechanisms to track applications’ memory accesses, reducing data amplification of disaggregated memory.2 Microsoft builds the CXL-based disaggregated memory pool for cloud deployment.10
7. Conclusion
We proposed and evaluated Sherman, an RDMA-based index on disaggregated memory. Sherman introduces a set of techniques to boost write performance and outperforms existing solutions. We believe Sherman demonstrates that combining RDMA hardware features and RDMA-friendly software designs can enable a high-performance index on disaggregated memory.
Acknowledgements
This work is supported the National Natural Science Foundation of China (U22B2023, 62472242, 62332011), Young Elite Scientists Sponsorship Program by CAST (2023QNRC001), and Postdoctoral Fellowship Program of CPSF (GZC20231296).
Join the Discussion (0)
Become a Member or Sign In to Post a Comment