1 Introduction
High-performance storage servers at Meta come in two flavors. The first,
2P server, has two sockets of compute and a large DRAM capacity, as shown in Figure
1(a), and provides excellent performance at the expense of high power and cost. In contrast,
1P server (Figure
1(b)), has one socket of compute and the DRAM-to-compute ratio is half of the
2P server. The advantages of
1P server are reduced cost, power, and increased rack density [
23]. For services with a small DRAM footprint,
1P server is the obvious choice. A large number of services at Meta fit in this category.
However, a class of workloads that may not perform adequately on a reduced DRAM server and may not take advantage of the cost and power benefits of
1P server at Meta are flash-based key-value stores. Many of these workloads use RocksDB [
26] as their underlying storage engine. RocksDB utilizes DRAM for caching frequently referenced data for faster access. A low DRAM-to-storage capacity ratio for these workloads will lead to high DRAM cache misses, resulting in increased flash IO pressure, longer data access latency, and reduced overall application throughput. Flash-based key-value stores in Meta are organized into shards. An approach to improve the performance of each shard on DRAM constrained servers is to reduce the number of shards per server. However, this approach can lead to an increase in the total number of servers required, lower storage utilization per server, and dilutes the
total cost of ownership (TCO) benefits of the
1P server. This leaves us with the difficult decision between
1P server, which is cost-effective while sacrificing performance, or
2P server with outstanding performance at high cost and power. An alternative solution that we explore in this article is using recent Intel Optane PMem 100 Series SCMs (DCPMM) [
45] to efficiently expand the volatile memory capacity for
1P server platforms. We use SCM to build new variants of the
1P server platforms as shown in Figure
1(c). In
1P server variants, the memory capacity of
1P server is extended by providing large SCM DIMMs alongside DRAM on the same DDR bus attached to the CPU memory controller.
Storage Class Memory (SCM) is a technology with the properties of both DRAM and storage. SCMs in DIMM form factor have been studied extensively in the past because of their attractive benefits including byte-addressability, data persistence, cheaper cost/GB than DRAM, high density, and their relatively low power consumption. This led to abundant research focusing on the use cases of SCM as memory and persistent storage. The works range from optimizations with varying memory hierarchy configurations [
32,
48,
49,
72,
89], novel programming models and libraries [
16,
79,
87], and file system designs [
17,
20,
65] to adopt this emerging technology. Past research was focused primarily on theoretical or simulated systems, but the recent release of DCPMM-enabled platforms from Intel motivates studies based on production-ready platforms [
6,
30,
47,
55,
68,
69,
80,
81,
85,
86]. The memory characteristics of DRAM, DCPMM, and flash are shown in Table
1. Even though DCPMM has higher access latency and lower bandwidth than DRAM, it has a much larger density, lower cost, lower power consumption, and its access latency is two orders of magnitude lower than flash. Currently, DCPMM modules come in 128 GB, 256 GB, and 512 GB capacities, much larger than DRAM that typically ranges from 4 to 32 GB in a data-center environment. Hence, we can get a tremendously larger density with DCPMM. In addition to using DDR bus for SCM, with recent high bandwidth and low latency IO interconnect like
Compute Express Link (CXL) [
1,
50], we can expand the memory capacity of our servers with SCM without the limitations of DDR bus. If we efficiently (cost, power, and performance) use this memory as an extension to DRAM, then this would enable us to build dense, flexible, servers with large memory and storage, while using fewer DIMMs and lowering the TCO.
Although recent works demonstrated the characteristics of SCM [
47,
85], the performance gain achieved in large commercial data-centers by utilizing SCM remains unanswered. There are open questions on how to efficiently configure DRAM and SCM to benefit large scale service deployments in terms of cost, power and performance. Discovering the use cases within a large-scale deployment that profit from SCM has also been challenging. To address these challenges for RocksDB, we first profiled all flash-based KV store deployments at Meta to identify where SCM fits in our environment. These studies revealed that we have abundant read-dominated workloads, which focused our design efforts on better read performance. This has also been established in previous work [
8,
11,
76] where faster reads improved overall performance for workloads serving billions of reads every second. Then, we identified the largest memory consuming component of RocksDB, the block cache used for serving read requests, and redesigned it to implement a hybrid tiered cache that leverages the latency difference of DRAM and SCM. In the hybrid cache, DRAM serves as the first-tier cache accommodating frequently accessed data for fastest read access, while SCM serves as a large second-tier cache to store less frequently accessed data. Then, we implemented cache admission and memory allocation policies that manage the data transfer between DRAM and SCM. To evaluate the tiered cache implementations, we characterize three large production RocksDB use cases at Meta using the methods described in Reference [
12] and distilled the data into new benchmark profiles for db_bench [
25]. Our results show that we can achieve 80% improvement to throughput, 20% improvement in P95 latency, and 43–48% reduction in cost for these workloads when we add SCM to existing server configurations. In summary, we make the following contributions:
•
We characterized real production workloads, identified the most benefiting SCM use case in our environment, and developed new db_bench profiles for accurately benchmarking RocksDB performance improvement.
•
We designed and implemented a new hybrid tiered cache module in RocksDB that can manage DRAM and SCM-based caches hierarchically, based on the characteristics of these memories. We implemented three admission policies for handling data transfer between DRAM and SCM cache to efficiently utilize both memories. This implementation enables any application that uses RocksDB as its KV Store back-end to be able to easily use DCPMM.
•
We evaluated our cache implementations on a newly released DCPMM platform, using commercial data center workloads. We compared different DRAM/SCM size server configurations and determined the cost, power, and performance of each configuration compared to existing production platforms.
•
We were able to match the performance of large DRAM footprint servers using small DRAM and additional SCM while decreasing the TCO of read dominated services in a production environment.
The rest of the article proceeds as follows. In Section
2, we provide a background of RocksDB, the DCPMM hardware platforms, and brief description of our workloads. Sections
3 and
4 explain the designs and implementation of the hybrid cache we developed. In Section
5, we explain the configurations of our systems and the experimental setup. Our experimental evaluations and results are provided in Section
6. We then discuss future directions and related works in Sections
7 and
8, respectively, and we conclude in Section
9.
4 DRAM-SCM Hybrid-cache Module
In our RocksDB deployment, we placed the memtables, index blocks, and filter blocks in DRAM. We then designed a new hybrid-cache module that allocates the block cache in DRAM and SCM. The database SST files and logs are located in Flash. The overview of RocksDB components allocation in the memory system is shown in Figure
8(a). Our goal in designing the new hybrid-cache module is to utilize DRAM and SCM hierarchically based on their read access latency and bandwidth characteristics. In our design, we aim to place hot blocks in DRAM for the lowest latency data access, and colder blocks in SCM as a second tier. The dense SCM-based hybrid block cache provides a larger effective capacity than practical with DRAM alone leading to higher cache hit rates. This dramatically decreases IO bandwidth requirements to the SST files on slower underlying flash media.
The block cache is an integral data structure that is completely managed by RocksDB. Similarly, in our implementations, the new hybrid-cache module is fully managed by RocksDB. This module then is an interface between RocksDB and the DRAM and SCM block caches, and fully manages the caches’ operations. The overall architecture of the hybrid cache is shown in Figure
8(c). The details of its internal components are as follows:
4.1 Block-cache Lists
The hybrid cache is a new top-level module in RocksDB that maintains a list of underlying block caches in different tiers. The list of caches are extended from the existing RocksDB block cache with LRU replacement policy. Note that in our implementations, we have DRAM and SCM cache, but the module can manage more than these two caches such as multiple DRAM and SCM caches in a complex hierarchy.
4.1.1 Block-cache Architecture and Components.
The internal structures of DRAM and SCM caches, which are both derived from the block cache, are shown in Figure
8(c). The block-cache storage is divided into cache entries and tracked in a
hashtable. Each cache entry holds a key, data block, metadata such as key size, hash, current cache usage, and a reference count of the cache entry outside of the block cache. The data block is composed of multiple key-value pairs as shown in Figure
8(b). Binary searches are performed to find a key-value pair in a data block. The data block size is configurable in RocksDB. In our case, the optimal size was 16 KB. As the number of index blocks decreases, we can increase the data block size. As a result, with 16 KB, we were able to reduce the number of index blocks making room for data blocks within our limited DRAM capacity. Every block cache has
configs that are configured externally. This includes size, a threshold for moving data, a pointer to all other caches for data movement, and the memory allocator for the cache. The cache maintains an
LRU list that tracks cache entries in order from most to least recently used. The
helper functions are used for incrementing references, checking against the reference threshold, transferring blocks from one cache to another, checking size limits, and so on. For the components listed above, we extended and modified RocksDB to support tiered structure, different kinds of admission policies, and we designed new methodologies to enable data movement between different caches and to support memory allocation to different memory types.
4.1.2 Data Access in the Block Cache.
A block is accessed by a number of external components to the block cache, such as multiple reader clients of the RocksDB database. The number of external referencers is tracked by the reference count. Mapping to a block is created when it is referenced externally, this will increment the reference count. Whereas when the referencer no longer needs a block, mapping is released, and the reference count is decremented. If a block has zero external references, then it will be in the hashtable and tracked by the LRU list. If a block gets referenced again, then it will be removed from the LRU list. Note that in the LRU list, newly released blocks with no external references are on the top of the LRU list as the most recently used blocks, and when blocks are evicted, the bottom least recently used blocks are evicted first. The block cache is used for read-only data, hence it doesn’t deal with any dirty data management. Therefore, when transferring data between DRAM and SCM, we do not have to deal with dirty data.
4.2 Cache Admission Policies
Identifying and retaining blocks in DRAM/SCM based on their access frequencies requires proactive management of data transfer between DRAM, SCM, and flash. Hence, we developed the following block-cache admission policies.
4.2.1 DRAM First Admission Policy.
In this admission policy, new blocks read from flash are first inserted into the hashtable of the DRAM cache. The block-cache data structures are size limited. Hence when the size of the blocks allocated in the DRAM cache exceeds the size limits, the oldest entries tracked by the DRAM LRU list are moved to the next-tier cache (SCM cache) by the data mover function of the DRAM cache, using the SCM cache’s memory allocator. On lookups, both the DRAM and SCM caches are searched until the cache block is found. If it is not found, then it will initiate a flash read. Similar to the DRAM cache when the capacity of the SCM cache exceeds the limit, the oldest entries in the LRU list of the SCM cache are freed to accommodate new cache blocks evicted from the DRAM cache.
4.2.2 SCM First Admission Policy.
In this admission policy, new blocks read from flash are first inserted in the hashtable of the SCM cache. Unlike the DRAM first admission policy, this policy has a configurable threshold for moving data from the SCM cache to the DRAM cache. When the external references of cache entries in the SCM cache surpasses the reference threshold, blocks are considered to be hot and will be migrated to the DRAM cache for faster access. The data movement, in this case, is handled by the data mover function of the SCM cache. When the capacity of both the DRAM and SCM caches are full, the oldest LRU blocks are evicted from both caches. In the DRAM cache, LRU entries are moved back to the SCM cache, whereas in the SCM cache, the LRU entries are freed to accommodate new block insertions. On lookup, both the DRAM and SCM caches are searched until the cache block is found.
4.2.3 Bidirectional Admission Policy.
In Bidirectional admission policy, similar to the DRAM first admission policy, new data blocks are inserted into the DRAM cache. As the capacity of DRAM and SCM cache reach the limit, the oldest LRU entries are evicted to SCM cache from the DRAM cache and are freed for the case of SCM cache. The difference between DRAM-first and Bidirectional cache is, after the oldest LRU entries are evicted from DRAM to SCM cache, if the external reference to an entry surpasses a preset threshold it is transferred back to the DRAM cache. This property allows us to re-capture fast access performance for blocks with inconsistent temporal access patterns.
In the hybrid cache, we can set the three of the admission policies, or we can easily extend a new policy by configuring how to insert, lookup, and move data in the list of block caches. These configs are global parameters in the top-level hybrid cache and are used by the block-cache operations manager and list of block caches. Optionally the thresholds for moving data in SCM first and Bidirectional policies can be set to change values based on the current usage of the caches. But in our experiments, we did not see benefit with changing values. We also performed an analysis with different sizes of cache thresholds, and we show the optimal threshold for SCM first and Bidirectional in our evaluations.
4.3 Hybrid-cache Configs
The hybrid-cache configurations are set outside of the module by RocksDB, and include pointers to configs of all block caches, the number of block caches, ids, tier numbers of the caches, and admission policy to use. Configs are used during instantiation and at run time to manage database operations.
4.4 Block-cache Operation Management
This unit redirects external RocksDB operations such as insert, lookup, update, and so on to the target block cache based on the admission policy. For example, it decides if an incoming insert request should go to the DRAM or SCM cache.
8 Related Work
Performance analysis and characterization in DCPMM: Recent studies proved the potential of commercially available Intel Optane memory for various application domains. References [
30,
70] have determined the performance improvement of DCPMM in memory and app-Direct modes for graph applications. References [
68,
69] evaluated the performance of DCPMM platform as volatile memory compared to DRAM for HPC applications. DCPMM performance for database systems were shown in References [
71,
75,
78,
83,
86] both as a memory extension and for persisting data. Works such as References [
39,
47,
80,
85] also shows the characteristics and evaluations of DCPMM when working alone or alongside of DRAM. Specifically, Reference [
85] has identified the deviation of DCPMM characteristics from earlier simulation assumptions. While these works shed light on the usage of DCPMM for data-intensive applications, in our work, based on the memory characteristic findings of these work, we analyzed the performance of DCPMM for large data data-center production workloads. Our work focused on utilizing the DCPMM platform to the best of its capability and study its possible usage as a cost-effective memory extension for future data-center system designs.
Hybrid DRAM-SCM systems: Previous works studied hybrid DRAM-SCM systems to understand how we can utilize these memories with different characteristics together and how they influence the existing system and software designs. References [
9,
35,
66,
84] have shown the need for a redesign of existing key-value stores and database systems to take into account the access latency differences between DRAM and SCM. Similarly, by noting the latency differences in these memories, we carefully place hot blocks in DRAM and colder blocks in SCM in our implementations. When deploying hybrid memory, another question that arises is, how to manage data placement between DRAM and SCM. In these aspects, Reference [
73] demonstrated efficient page migration between DRAM and SCM based on the memory access pattern observed in the memory controller. In addition, References [
10,
13,
19,
34,
52,
53,
54,
82,
88] perform data/page transfer by profiling and tracking information such as memory access patterns, read/write intensity to a page/data, resource utilization by workloads, and memory features of DRAM and SCM, in hardware, OS, and application level. These works aim to generalize usage of DRAM and SCM to various workloads without involving the application developer, hence requiring hardware and software monitoring that is transparent to the application developers. But in our case, the RocksDB application-level structure exposes separate reads and writes paths and frequency of access of data block. These motivated us to implement our designs in software without requiring any additional overhead in the OS and hardware.
RocksDB performance improvements: Reference [
21] demonstrated how to decrease the memory footprint of MyRocks, which is built on top of RocksDB, using block access-based
non-volatile memory (NVM) by implementing secondary block cache. While their methods also decrease DRAM size required in the system, the block-based nature of NVM increases read amplification. This is because, the key-value size in RocksDB is significantly less than the size of the block, whereas in our methods, using byte addressable SCM avoids such issues.