1 Introduction
With exponential growth of data, modern web-scale systems and applications set higher demands on capacity and performance from their storage systems. Meanwhile, modern data centers widely deploy legacy block flash-based SSDs accompanied by over-provisioned capacity and cumbersome
Flash Translation Layer (
FTL) [
31,
58], which is adverse to meeting the performance or capacity demands and
service-level agreements (
SLA) [
36]. For example,
Log-Structured Merge Tree (
LSM-tree) based Key-Value Store such as RocksDB [
10], LevelDB [
34], BigTable [
13], which have become one of the most significant services in data centers, are inevitably affected by the internal activities of legacy block SSDs [
5,
6,
52,
53,
55,
56]. In the worst case, the internal activities of legacy block SSDs may result in an order of magnitude higher latency and lower throughput [
31,
58] compared with the block SSDs without internal activities.
The emerging storage device,
Zoned Namespace (
ZNS) [
5,
16] SSD affords LSM-tree with new opportunities. ZNS effectively bridges the semantic gap between flash memory and applications via exposing the internal flash blocks as zones. Unlike the legacy block interface, ZNS divides the logical address space into a collection of fixed-sized zones. Each zone must be sequentially written and reset before being rewritten. This characteristic perfectly tallies with the sequential IO patten of LSM-tree. RocksDB, one of the most popular LSM-tree based key-value store, delivers remarkable throughput and an order of magnitude lower tail latency via integration with ZNS SSDs [
5], drawing both academic and industrial interests.
However, from our perspective, LSM-tree on ZNS SSDs is still faced with a formidable challenge. Modern ZNS SSDs typically maintain a gigantic zone size [
41,
50], whereas
Sorted String Tables (
SSTables), the most files of LSM-tree, tend to have a significantly smaller size [
26,
34]. To remove duplicate key-value pairs, LSM-tree performs compactions which merge-sorts multiple SSTables, then deletes them and generates new ones. Each zone usually stores multiple SSTables and thus is fragmented due to compactions, necessitating
Garbage Collection (
GC) for defragmentation. GC migrates valid data and resets fully invalidated zones to guarantee sufficient free space for subsequent writes. Because data migration entails excessive I/O, GC is seen as a performance killer [
29,
33,
59,
61].
To tame the GC challenge, there have been many efforts yet the currently available approaches have their own set of limitations. Lifetime-based data placement [
5] places SSTables with similar lifetimes in the same zone to alleviate zone fragmentation. However, it is challenging to predict the SSTables’ lifetimes accurately in the high levels of LSM-tree owing to their broad lifetime spans. To ensure that a zone can be reset without data migration, GearDB [
59,
61] and Lifetime Leveling Compaction [
29] (LLCompaction) present their respective zone-specific compaction policies. Unfortunately, their compaction policies both aggravate compaction overheads, which in turn impairs the overall performance. Besides, since both of them mandate specific compaction policies, incorporating some existing optimizations on compactions such as LDC [
11], dCompaction [
47], priority-driven compaction policy [
8,
19], Seek Compaction [
34] into them is challenging.
With extensive experiments, we get to the following insights which present new opportunities for taming the GC challenge. ① Lifetime-based data placement is not a panacea. The ideal case, where SSTables stored in the same zone can be reset simultaneously, is impractical. ② Novel compaction policies reduce the data migration indeed, yet introduce more compaction overhead, diminishing the performance particularly under write-intensive workloads. ③ With extensive experiments, we find the root cause of GC is the mismatch between the gigantic zone size and small SSTable size. Fortunately, the internal architecture of ZNS SSD enables a smaller zone size, which is a potential solution. ④ While a smaller zone size is promising, the suboptimal performance due to under-utilized parallelism is its Achilles’ heel. Our experiments demonstrate that simply employing small zones is insufficient. To tackle the GC challenge, we must propose a novel design to efficiently expose, leverage and accelerate zones with various sizes.
Hence, we propose SplitZNS, an hardware/software co-designed approach to tackle the GC challenge via combining the inherent parallel architecture of ZNS SSDs and the multi-level peculiarity of LSM-tree. The keys to SplitZNS are enabling zones of varying sizes, employing them appropriately, and maximizing parallelism utilization for small zones. First, SplitZNS introduces
splitzones and
subzones. Each splitzone is converted from a widezone via tweaking its zone-to-chip mapping. A splitzone is composed of multiple much smaller subzones managed by independent chips to guarantee that the subzones within the same splitzone never interfere with each other. Second, SplitZNS merely employs splitzones for high-level SSTables to alleviate the performance impact due to under-utilized parallelism of subzones. Third, following the peculiarity of LSM-tree, SplitZNS proposes SubZone Ring, Read Scheduler and Read Prefetcher to accelerate the performance of subzones. SubZone Ring employs a per-chip FIFO buffer to imitate a large zone writing style; Read Prefetcher enhances parallelism utilization via prefetching data concurrently through multiple chips during compactions; Read Scheduler assigns query requests the highest priority to prevent compactions from blocking queries on subzones. To demonstrate the efficacy and efficiency of SplitZNS, we build a LSM-tree prototype based on RocksDB [
10] with ZenFS [
5] and evaluate it using synthetic and real-world workloads. The results show SplitZNS achieves a great performance advantage and significantly reduces I/O stress compared to its competitors [
5,
29,
59]. In conclusion, we make the following contributions:
–
We analyze the impact of gigantic zones for LSM-tree on ZNS SSDs and the limitations of existing works. We reveal that a novel design which is both compatible with other optimizations and GC-efficient is necessary. On the basis of the analysis, we offer our own distinct insights (Section
3).
–
We design and implement SplitZNS based on our insights. First, we introduce smaller zones via the concepts of splitzone and subzone. Second, we employ them appropriately to mitigate the performance impact of low parallelism. Third, we leverage the peculiarity of LSM-tree and the architecture of ZNS SSDs to accelerate subzones (Section
4).
–
We conduct extensive experiments that demonstrate the performance advantages of SplitZNS under different workloads. Specifically, SplitZNS achieves up to 2.77× performance in terms of write-only workloads and 1.79× performance in terms of write-intensive workloads compared to lifetime-based data placement (Section
5).
The rest of this article is structured into four sections. In Section
2, we brief on ZNS and LSM-tree. In Section
3, we analyze the existing works and give our insights. Section
4 elaborates on the design and implementation of SplitZNS. In Section
5, we evaluate and analyze the efficiency and efficacy of SplitZNS. Finally, we list related works and summarize our work in Section
6 and Section
8.
3 Motivation and Contributions
Designing a high performance LSM-tree on ZNS SSDs poses unique challenges. In this section, we first analyze GC for LSM-tree on ZNS SSDs with various zone sizes, which motivates SplitZNS’s design. Then, we revisit two kinds of exisiting approaches, lifetime-based data placement and compaction schemes tailored for zones, and reveal their respective issues.
3.1 Widezones
Modern ZNS SSDs typically feature gigantic zones. The zone capacity of INSPUR NS8600 and ZN540, for instance, are 1440 MiB and 1077 MiB [
41,
50], respectively. The reason is that SSDs are equipped with multi-level parallelism [
25], which accounts for their superior performance. As shown in Figure
2, an SSD usually consists of multiple channels, each of which is consist of multiple chips. The chip is consist of a number of dies, and each die is comprised of multiple planes. Each plane comprises numerous erase blocks. The independent execution of read, write, and erase operations is feasible across the chips and dies. In order to achieve optimal performance by simultaneously accessing multiple chips, commercial ZNS SSDs form a zone by combining multiple erase blocks from each chip [
6,
7,
22,
23,
35].
Specifically, the SSD controller picks an erase block from each plane (i.e., a few erase blocks from each chip) to form a zone, as the widezone illustrated in Figure
2. Once an IO request arrives, it can be divided into multiple sub-requests at the unit of flash page. These sub-requests can be dispatched to separate chips so that they can be processed concurrently. With the advancement of flash memory technologies, such as
Multi-Level Cell (
MLC) [
1] and 3D stacking [
42], a flash cell can store more bits, making the erase block size grow. Consequently, the zone size is expanding as well.
Unfortunately, gigantic zones have a significant negative effect on the performance of LSM-tree. Larger zones result in more severe fragmentation, necessitating more data migration to reclaim zones. In addition to wasting I/O bandwidth, data migration might stall foreground writes since free space is not available until data migration is finished. We conduct a quantitative analysis of the performance and the migrated data size for various zone sizes. For a fair comparison, zones with various sizes share identical I/O performance, which is achieved by adjusting the erase block size. To begin, we use FEMU to emulate an 80 GiB ZNS SSD. 50 GiB data is loaded into the database. Then, the overwrite workload (uniform and write-only) from RocksDB’s dbbench is used to write another 10 million key value pairs (16B key and 1 KiB value size) into the database. The overwrite performances of various zone sizes are evaluated. Figure
3(a) and
3(b) shows the results normalized to 128 MiB zone size. The results show that as the zone size increases from 128 MiB to 1,024 MiB, the write amplification from data migration increases by up to 1.9× and the throughput declines by 73%, which confirms the negative effect of significant zone size.
Although manually increasing the SSTable size can help alleviate the impact of significant zone size, the SSTable size is not fixed and usually small under many compaction policies [
48,
51]. Simply adjusting the SSTable size is not feasible for these compaction policies. Moreover, to the best of our knowledge, oversized SSTable sizes usually result in high compaction overhead, as is also noted in earlier studies (GearDB [
59], LLCompaction [
29]) and practical production experience. Specifically, TiKV can use up to 128 MiB SSTables only when compaction-guard (one of its features to reduce compaction overhead based on the characteristic of its upper application) is enabled [
51].
The performance decline associated with larger zones can be attributed to the increased amount of data that needs to be migrated before resetting. Figure
3(c) displays the average size of migrated data for different zone sizes. When increasing the zone size from 128 MiB to 1,024 MiB, the average size of migrated data increases from 43 MiB to 386 MiB while the GC count decreases from 298 to 144. This ultimately results in more data migration and slower GC. Sluggish data migration can compete for the IO bandwidth with foreground operations, while slow GC can cause foreground operations to stall, thus resulting a significant decrease in performance.
3.2 Existing Approaches
There have been several attempts to address the problem yet existing approaches have their own set of limitations. Simply reducing zone size cannot tame the challenge. Unlike widezones, smaller zones align to fewer erase blocks [
2]. Consequently, the number of chips that process I/O requests is limited, causing a severe performance degradation. Our experimental results (Section
5.4) further confirm this fact. Notice we also achieves least internal mutual interferences as Reference [
2] and even more accurately in this experiment from our perspective since we expose the internal interferences of zones and directly allocating chips (Section
4.3) instead of using profiling-based allocation. Lifetime-based data placement [
5] places SSTables with analogous lifetimes in the same zone. However, the SSTable’s lifetime is predicted by its level in the LSM-tree, which, from our perspective, is inaccurate, particularly for high-level SSTables. As seen in Figure
4, we track the lifetime of SSTables from various levels in the uniform and skewed (skewness = 0.99) write-only workloads. The experimental setting is same as above. The lifetimes of most
L4 SSTables range from approximately 10s to 800s for the uniform workload and range from approximately 10s to 4000s for the skewed workload. We observe that (1) with a higher skewness, the lifetime difference is larger since the SSTables with hot key ranges are involved in compactions frequently; (2) however, in terms of both workloads, the lifetimes of high-level SSTables can be rather short or rather long, confirming that with or without the skewness, the lifetime prediction based on the level of SSTable is difficult to achieve a high accuracy. We also evaluate the workloads with other different skewness (0.8, 0.9, 0.95) and their experimental results show similar trends.
GearDB [
59,
61] and
Lifetime-Leveling Compaction (
LLC) [
29] use zone-specific compaction policies to eliminate GC. Nevertheless, their compaction policies result in a greater write amplification and a longer compaction latency (Section
5.3), which hurts the performance. Moreover, both of them prohibit compaction optimizations when selecting input SSTables for compactions since they mandate a certain order of invalidating SSTables. Examples are LDC [
11], which take into account compaction granularity; LevelDB [
34], which takes into account the number of times an SSTable’s seek operation failed; and RocksDB [
8,
10,
19], which takes into account the SSTable’s age and overlapping ratio. These optimizations cannot co-exist with GearDB or LLC.
3.3 Our Insights
The aforementioned analysis yields the following insights, which motivate the design. First, a smaller zone size can alleviate GC. By assigning a zone to a small number of erase blocks managed by a single chip, the zone size can be drastically decreased. However, it leads to underutilized parallelism since I/O requests to these small zones are executed one at a time by a single chip, resulting in a significant decline on I/O performance. Consequently, the second insight is that by using the multi-level structure of LSM-tree, we can efficiently mitigate the negative effect of reduced zone size. Moreover, we can accelerate the performance of these small zones by enhancing parallelism utilization. Based on these insights, we propose SplitZNS, which incorporates different zone sizes into LSM-tree on ZNS SSDs.