3.1 Overview
Figure
6 shows the overall architecture of LAB-DB, which builds on RocksDB. The key architecture change is that LAB-DB replaces
\(L_0\) of the LSM tree with a B
\(^{+}\) tree and then places it on NVM instead of the storage device. Since
\(L_0\) is at the highest level, reducing the compaction frequency can significantly reduce not only the
\(L_0\) compaction overhead but also the cascading lower-level compaction overhead from
\(L_0\) compaction.
LSM Tree Augmented with B\(^{+}\) Tree (LAB). A B\(^{+}\) tree is fundamentally different from SSTable files in the LSM tree in that it supports in-place updates. Rather than simply appending new put requests to the SSTable files, and then later merging them, B\(^{+}\) tree instead immediately checks if the tree already has an entry for the target key. If so, its value is updated in place. If not, a node for the new key-value pair is inserted into the tree. The primary advantage of B\(^{+}\) tree is that it maintains a concise representation for the current set of key-value pairs. On the other hand, SSTable files potentially contain multiple value instances for the same key, and thus those files need to be merged later. The in-place update of B\(^{+}\) tree greatly reduces the compaction overhead for two reasons. First, the in-place update utilizes the space more efficiently, and \(L_0\) compaction is invoked less frequently. Second, even if compaction happens, its overhead is greatly reduced. LAB-DB does not need to merge multiple SSTable files in \(L_0\) since B\(^{+}\) tree only stores a single value for each key.
Despite these advantages, replacing the \(L_0\) of the LSM tree with a B\(^{+}\) tree does not work well on a conventional system. This is because both read and write to a B\(^{+}\) tree generate random data accesses, which can result in significant performance degradation on block storage devices. To avoid this potential performance degradation, LAB-DB places the B\(^{+}\) tree on byte-addressable NVM. By doing so, it is possible to exploit all the advantages of B\(^{+}\) tree-based \(L_0\), while avoiding the performance degradation that the conventional block storage would suffer.
3.2 LAB-DB Operations
Write (Put) Operation. First, as in the conventional RocksDB, a key-value pair is recorded in the WAL. Originally, this log is directly written to the storage device to guarantee persistency. In LAB-DB, this is instead first stored in 2 MB WAL buffer (double-buffered) located in the NVM. Then, once the buffer becomes full, the WAL in the buffer is flushed to the storage device as usual. This design choice can substantially reduce the write latency since persistency is guaranteed as soon as the data is written on the NVM device. It also improves the SSD bandwidth utilization as our design issues more coarse-grained write requests (e.g., 1 MB) to the storage device. Storage devices that use a block-interface, such as HDD or SSD, are more efficient at handling large I/O requests than smaller ones. RocksDB’s group logging mechanism attempts to consolidate small I/O requests for logging purposes, but it is not always successful, resulting in some small write operations being logged individually. This can lower the utilization of I/O bandwidth. To address this, we have implemented a method that utilizes a small amount of NVM (e.g., 2 MB) to eliminate small-sized logging.
Moreover, LAB-DB optimizes the WAL structure. Specifically, the original WAL stores the replay log and key-value pairs in the contiguous region. Instead, we separate the region for values (
Value Array in Figure
6) from the one for the replay log and key (
Key Array in Figure
6). After write-ahead logging, the key-value pair needs to be written to the MemTable. Here, LAB-DB extends the MemTable and also records the logical address of the value in the storage device. This design choice is made to efficiently utilize values in the storage device as data records for the B
\(^{+}\) tree (more details in the following paragraph). When the MemTable is full, it becomes an immutable one, and a flush operation is scheduled.
Flush Operation. Originally, a flush operation is a simple operation that copies the list of key-value pairs in an Immutable MemTable to the storage device and stores it as an SSTable file. With LAB-DB, the top-level (\(L_0\)) of the LSM tree is replaced with a B\(^{+}\) tree resident on NVM, and thus the flush operation now needs to update a B\(^{+}\) tree by taking a list of key-value pairs passed from the MemTable. The B\(^{+}\) tree of LAB-DB consists of internal nodes and leaf nodes. LAB-DB proposes a cost-effective structure by utilizing the property of the B\(^{+}\) tree that allows the entire database to be recovered if only the data of leaf nodes is recoverable. For this reason, internal nodes are stored on DRAM, and leaf nodes are stored on NVM. An internal node points to another internal node or a leaf node, and each leaf node has a key and a pointer to the corresponding value in the value array resident in the storage device. During a flush operation, when a key that already exists in the B\(^{+}\) tree is inserted, an in-place update operation is performed. The old value offset in the B\(^{+}\) tree is updated to the newly inserted value offset.
Figure
7 illustrates the flush operation in LAB-DB step by step. Before the update of B
\(^{+}\) tree, all data in the WAL Buffer is first flushed to the storage device ➊. Then, a list of key-value pairs are iteratively inserted (potentially in parallel) into the B
\(^{+}\) tree ➋. During the insertion if there already exists a node with the same key, its pointer to the value is updated ➌. Once the B
\(^{+}\) tree update completes, the key array portion of the WAL is no longer necessary and thus cleared ➍. Note that LAB-DB only utilizes NVM to store the nodes of the B
\(^{+}\) tree, not the data records (values) themselves. This effectively allows the B
\(^{+}\) tree to track a larger number of key-value pairs in a space-efficient manner. To guarantee the correctness of the scan semantic, which requires to read the values at a specific point in time, even with in-place
\(L_0\) updates, LAB-DB schedules flushes in a way that scan and flush operations are mutually exclusive. This mutex incurs negligible performance overhead because a flush operation consumes very little time by leveraging fast NVM compared to much slower scan operations, as demonstrated by our evaluation with a scan workload (YCSB-E) in Section
4.2.
Compaction Operation. The above flush operation updates the B
\(^{+}\) tree by inserting key-value pairs passed from the MemTable. This leads to an increase in B
\(^{+}\) tree size. Once the total space utilized for B
\(^{+}\) tree leaf nodes exceeds a specific threshold (e.g., 75% of the pre-allocated NVM size for
\(L_0\)), the
\(L_0\) compaction happens. Specifically, if there is no existing immutable B
\(^{+}\) tree, the current B
\(^{+}\) tree is declared immutable, and the following flush operation will construct a new mutable B
\(^{+}\) tree. Then, Figure
8 illustrates the rest of the steps for a compaction operation in LAB-DB. The compaction operation for a portion of immutable B
\(^{+}\) tree is scheduled ➊. To select this portion, the leaf nodes of the immutable B
\(^{+}\) tree is logically divided into multiple similar-sized chunks in a way that each chunk is responsible for a contiguous range of keys. Then, a chunk that is responsible for the leftmost range from the remaining chunks is scheduled for compaction. During this compaction, values for the target chunk are read from the WAL stored in a storage device and they are used to update the SSTable files in
\(L_1\) whose range overlaps with this chunk ➋. Then, the leaf nodes responsible for the chunks are removed from NVM, and relevant internal nodes are updated or deleted as well. Finally, if a particular
\(L_0\) compaction processes the last chunk for the immutable B
\(^{+}\) tree, the whole value array region responsible for the immutable B
\(^{+}\) tree is cleared ➌.
Note that LAB-DB does not perform compaction on the whole B\(^{+}\) tree at once. Rather, LAB-DB makes it immutable and incrementally compacts a portion of it with \(L_1\). This approach has two advantages. First, it helps maintaining a high occupancy of the \(L_0\), so that the \(L_0\) can effectively be utilized as a cache for read operations that miss at the MemTable. Second, the finer-grained compaction leads to a tighter tail latency. LAB-DB does not affect the \(L_1\) or lower-level compaction mechanisms.
Read Operation. Overall, the LAB-DB read operation is similar to that of the RocksDB baseline. It first checks the MemTable, and if it does not find the key in the MemTable, it searches the immutable MemTable. If it misses again, then it inspects \(L_0\) of the LSM tree, which is now managed with B\(^{+}\) tree structures. First, it checks the mutable B\(^{+}\) tree. Starting from the root and internal nodes stored in DRAM, it traverses the tree and reaches the leaf node if the tree has a matching key. Then, it utilizes the pointer at the leaf node to retrieve the value in a storage device. If it does not find the key in a mutable B\(^{+}\) tree, it searches for the immutable B\(^{+}\) tree and retrieves the value in a similar manner. If the key is not found in \(L_0\), it continues its search on lower levels as in the conventional LSM tree.
Recovery. The recovery process for a failure is mostly similar to RocksDB, but LAB-DB recovery requires slightly more work. Specifically, it needs to reconstruct the internal nodes for B
\(^{+}\) tree that was stored in DRAM. The recovery process of LAB-DB consists of two additional steps: (i) reading the key and offset information of the leaf node; (ii) replaying that information to reconstruct the B
\(^{+}\) tree. Once these additional steps are completed, data recovery will be performed through the WAL replay operation, which is the recovery process of RocksDB. Figure
9 shows the recovery time breakdown for LAB-DB using a 512 MB B
\(^{+}\) tree and RocksDB-NVM, which is RocksDB with an 8 GB NVM
\(L_0\), when using a 90 GB dataset. LAB-DB spends 0.164 seconds reading all leaf nodes and 1.177 seconds reconstructing the B
\(^{+}\) tree, resulting in a total of 1.341 seconds for B
\(^{+}\) tree recovery. Notably, the reconstruction process does not require much time because the B
\(^{+}\) tree is composed solely of DRAM and NVM, so there is no need to access storage devices. The WAL replay time for LAB-DB and RocksDB-NVM is 6.5 seconds and 11.3 seconds, respectively. The reason for the shorter WAL replay time of LAB-DB is that it can process requests faster than RocksDB-NVM through B
\(^{+}\) tree. The total recovery time of LAB-DB is 3.5 seconds shorter than the baseline RocksDB-NVM, indicating that LAB-DB has no negative impact on the recovery time.
3.3 Log File Management
LAB-DB efficiently manages the \(L_0\) by applying the in-place update method of the B\(^{+}\) tree. However, as an in-place update occurs, the log size continues to grow until compaction happens. To address this, we propose two techniques to reduce log size.
First, we introduce two methods of constructing chunks, one of which is selected adaptively according to the phase of compaction. The first method scans consecutive leaf nodes (
leaf scan) mentioned in the previous section, and the second method collects the leaf nodes with the key belonging to the oldest log file (
file scan) to remove the log file as quickly as possible. Figure
10 shows the latency of
\(L_0\) compaction and the change in the log size for the two methods with the varying degree of compaction progress. As shown in the figure, we observe that, as the compaction ratio increases, the file scan method shows similar latency to the leaf scan method, but decreases log size by 11%. As the compaction process makes a progress, both the key range and the number of leaf nodes that need to be read through the file scan are reduced, thereby reducing the compaction time and the
\(L_0\) data load time, respectively. Based on this observation, the following process is added to LAB-DB’s
\(L_0\) compaction operation. When the B
\(^{+}\) tree is scheduled to be immutable, the utilization of compacted leaf nodes is calculated and the leaf node scan method is applied until 80% is reached, and after that, the file scan method is applied to delete the log.
Second, LAB-DB utilizes NVM to write the log for hot keys. As in-place updates occur more frequently, the performance of LAB-DB increases, but the log size increases accordingly. Buffering frequently accessed keys in the NVM reduces the log size significantly by leveraging DRAM-like in-place writes. By adding a part to manage access count for each key in B\(^{+}\) tree, for keys exceeding a specific threshold value, the log is recorded in NVM. Before write-ahead logging, LAB-DB is modified to check the location of the log through B\(^{+}\) tree search.
3.4 Implementation
We implement LAB-DB by extending RocksDB [
22] from Facebook (Version 5.7). We utilize
memkind library [
8] to allocate and access the byte-addressable NVM (i.e., Intel Optane DCPMM) with load/store instructions. Overall, we modify about 5,000 lines of code in RocksDB for write, read, flush, and
\(L_0\) compaction operations as follows.
Put Operation. We modify the
write_batch.cc module. We extend its WAL mechanism to implement the buffered I/O as described in Section
3.2. We allocate 2 MB in NVM for this buffering and flush the buffer when the buffered data hit 1 MB as it is double-buffered. In addition, we also increase the size of the storage region allocated for WAL buffering to 20 GB. We also extend the MemTable to store the logical address of the current value in a storage device. Since we do not increase the MemTable size, the number of MemTable entries slightly decreases due to this overhead.
Flush. We heavily modify the
flush_job.cc module so that the flush operation works as described in Section
3.2. For high-performance implementation of B
\(^{+}\) tree, we utilize FAST-FAIR [
26,
43], an open-source B
\(^{+}\) tree. For the B
\(^{+}\) tree leaf nodes, we allocate 256 MB space. We do not pre-allocate the DRAM region for the internal nodes. Note that the size of the DRAM region for internal nodes is effectively bounded by the space allocated for the leaf nodes (e.g., 256 MB).
Compaction. We modify version_set.cc and db_impl_compaction_flush.cc to implement the modified \(L_0\) compaction scheme. For efficient compaction of a chunk from B\(^{+}\) tree and SSTable files in \(L_1\), LAB-DB first fetches a set of values for the keys belonging to the target chunk and stores them in memory. Then, the merge is performed in memory and the outcome is stored in the storage device. We set the total size of the values in a single chunk to be similar to a single MemTable size (64 MB). Finally, we extend the compaction scheduler of RocksDB such that it takes into account the occupancy of the \(L_0\) area in NVM assigned to determine the priority for compaction operations; that is, the new scheduler prioritizes \(L_0\) compaction when the sizes of B\(^{+}\) tree in \(L_0\) becomes close to the pre-allocated limit (e.g., 75%).