6.7 Scalability Experiments
Task size in parallel. We analyze the impacts of the compaction task size in parallel on the throughput of gLSM. Table
6 lists the six SSTable sizes configured in the experiments. The size of an SSTable affects the throughput; at a given parallel task granularity, a large SSTable results in very large amounts of data per task, in turn creating more data transfer and write-back operations, which cost much time. Furthermore, as the SSTable size inflates, the number of SSTables per level shrinks, leading to frequent compaction on the CPU coupled with an decreased overall throughput. The results in Table
6 show the impact of parallel task size on throughput in the case of the six block sizes: the effect of block size on throughput is limited to 4% - a range subject to the systematic error.
Impact of thread-grid size on performance. The thread grid is a three-dimensional structure that divides tasks in gLSM. We set the lengths of the x- and y-dimensions to one each and variously configured the z-dimension to lengths of 4, 8, 16, 32, 64, and 128 to study the effect of task size on throughput. Figure
16(a) shows that the thread grid size imposes a limited effect on throughput, with each task divided according to different grid sizes, but the overall parallel granularity remains unchanged. The logical partitioning simply allocates threads to perform each task rather than affect the occupation of the physical SMs. As a result, the grid size has a limited impact on the throughput.
Impact of Offloading Thresholds. The offloading threshold determines the number of tasks to be executed in parallel on the GPGPU. We set six thresholds to investigate the effect of the threshold on throughput. Figure
16(b) unfolds that a large offloading threshold boosts the throughput because the number of parallel tasks grows - improving the compaction efficiency. A large threshold also reduces the number of data transfer operations at a given number of tasks, thereby lowering the total transfer time.
Impact of KV Separation and GPGPU-side Sort Modules on Performance. We investigate the baseline
gLSM in which the GPGPU did not have key-value separation or GPGPU sort modules, denoted by
gLSM-. Figure
16(c) shows that the gLSM-’s throughput declines - an expected outcome as its value size is around 128-256 B. The GPGPU sort module fully utilizes GPGPU computing resources. To shorten access latency amid sorting operations, the key-value separation module reduces the overhead of data movement and copy operations. The findings prescribed in Section
6.3 suggest that without deploying these two modules, LUDA computing resource consumption is inflated by a factor of 2.1
\(\times\) under the 4-KB-value workloads in GPGPU (see Figures
12 and
13). Figure
16(c) unveils that gLSM revamps the throughput by up to 82.6% over gLSM-: the two modules play a crucial role in gLSM to optimize the compaction performance in the GPGPU-enabled KV stores.
Impact of GPGPU Resources on Performance. We incorporate a GPGPU K80 with fewer resources into gLSM to delve into the impact of GPGPU resources on performance. The K80 differs from our GPGPU P100 in terms of CUDA cores and memory capacity. gLSM optimizes the utilization of GPGPU resources and boosts the compaction performance, even though the K80 has weak resources. The K80-based gLSM performs similarly to the P100-based gLSM. The results plotted in Figure
16(d) unveils that the K80-based gLSM offers stable performance: the remarkable stability of gLSM is made possible by GPUs with diverse computing resources. The key-value separation module in gLSM not only conserves resource costs, but also fully utilizes GPGPU resource to speed up performance at the expense some GPGPU resources. Overall, gLSM deliver better performance on P100 as opposed to the K80 counterpart. gLSM utilizes a key-value separation module to cuts down the number of resource-consuming access operations. In addition, the parallel encoder-and-decoder controller enables the parallel encoder at the block level, averting degradation in GPGPU performance under heavy access and control tasks. Then, gLSM vastly reduces the resource costs needed to expedite parallel compaction. It is flexible to run gLSM on various GPGPUs to achieve considerable acceleration in KV-store performance under low resource usage.
We conducted an experiment based on K80 and P100 GPGPUs to validate the robustness of the gLSM. We find that the performance of K80 is similar to that of P100. This is because the GPGPU-enabled collaborative compaction procedure does not fully occupy GPGPU resources, see Figures
12 and
13. In addition, the computational capability (TFLOPS) of P100 is higher than that of K80 [
24,
25]. The GPGPU-side compaction subsystem of gLSM costs few computational resources whereas the throughput is determined by the computational capability and all modules of gLSM. With the same data volume, both P100 and K80 are not fully loaded; then, P100-based gLSM has a similar throughput to K80-based gLSM.
Comparison of Speedup Ratios (SR) of gLSM and LevelDB-FACE. LevelDB-FACE [
33] is a heterogeneous system to stimulate KV-store performance using an FPGA. We configure the value size of 512 B to mimic that in LevelDB-FACE, and Figure
16(e) plots the speedup ratios of gLSM and LevelDB-FACE. We present the basic system throughput when we test the speedup ratio of gLSM and LevelDB-FACE, and gLSM has higher basic system throughput than the counterpart. gLSM reveals stable speedup ratios as the data volume increases, but this metric of LevelDB-FACE is unstable under the workloads with small-sized data owing to the settings of the memory-table size. In case of large data volume, LevelDB-FACE’s speedup ratio significantly declines. In addition, we show the variance (
\(\delta ^{2}\)) of the speedup ratios of the two KV stores: the speedup ratio of gLSM is smaller than that of LevelDB-FACE. This result unveils that our gLSM is more reliable than LevelDB-FACE under the production environments where the system stability is sensitive.
Impact of variable-sized keys on the performance of gLSM. We conducted two types of experiments based on YCSB-C to validate gLSM under workloads with variable-sized keys. First, we set the key size to 8 B, 16 B, 24 B, 32 B, 48 B, and 64 B, and the value size is 128 B. The data volume is 5 GB. Second, amid the lack of tools forging variable-sized keys, we created variable-sized keys with a range of key sizes between 6 B and 64 B. We also employ 5-GB data in which there are about 32,936,866 KV pairs. We measure the throughput of CPU and GPGPU, the sorting time, and the ratio of sorting time to the compaction time under YCSB-C.
As shown in Table
7, the comparison count is two when the key size is 8 B. We find that the number of comparison operations is more than one. Let us justify why we must convert to the maximum key length – and why we ought to convert the key size to an integer multiple of 8 B. This procedure aims at enhancing sorting performance: we convert every 8-Bytes key to an unsigned 64-bit integer; the 8-B keys need to be sorted at most once; the 32-B keys are sorted at most four times. In gLSM, we have one more comparison: we compare the sequence in the internal key to ensure that if the keys are identical, the keys can be sorted correctly according to the sequence. As shown in Table
7, when the key size increases from 16 B to 32 B, the throughput of GPGPU compaction has a small increment because the large-sized keys enlarge the amount of data. Then, the number of sorting operations rises from 3 to 5. There is a slight difference in sorting time. The GPGPU-side compaction throughput declines when the key size grows to 48 B or even larger, because the number of comparisons and the sorting time significantly increase, which in turn worsens compaction time and throughput.
As shown in Figure
17(a), the
x-axis represents the key sizes (key range between 6 B and 64 B). The
y-axis denotes the number of different-size keys. The minimum number of keys, the size of which is 28 B is 556,704. We counted the number of different key sizes. The maximum number of keys - the size of which is 30 B – is 560,354. Since the number of these variable-sized keys is basically uniform, we take the average size of these variable-sized keys as 35 B; then, we launch six comparison operations to achieve the value. As listed in Table
7, we observe that the GPGPU throughput of gLSM is 112.3 MB/s, which lies between that of gLSM under YCSB-C with 32-B and 48-B keys. For the two fixed-sized keys, the number of comparison operations is five and seven, respectively. While the average number of comparisons for these variable-sized keys is 6, which still lies between 5 and 7. When the number of comparisons expands, the sorting time is enlarged. Such an impact is marginal in case of small-sized keys because as the key size increases, the sorting time and throughput show a minor change. Given a large key, such as 48 B and above, the system performance significantly drops due to growing total sorting time increases coupled with downgraded throughput.
In addition, we conducted experiments on fixed- and variable-sized keys to validate the advantages of gLSM under workloads compared with RocksDB, LevelDB, PCP, and LUDA. Other parameters are consistent with that in the test of gLSM. We achieved the following results in Figure
17(b). Under YCSB-C with variable-sized keys, the throughput of gLSM is the same as the case of a key size that lies between 32-B and 48-B as illustrated in the test with fixed-sized keys. In the fixed-sized keys, the throughput of LevelDB and PCP grows as the key size increases because large-sized keys can expand data volume, increasing throughput. In the PCP case, when the key size reaches 64 B, the throughput of PCP outperforms that of gLSM – but both the CPU and GPGPU throughput of gLSM are larger than that of PCP under variable-sized keys. When the key size is 48 B or even larger, the throughput of RocksDB deteriorates, which may be related to the increment in the number of comparisons. In summary, we validate that our gLSM outperforms the other KV stores under workloads with variable-sized keys.
Impact of Disk I/O on the system performance. To validate whether disk I/O is the system bottleneck, we opt for three configurations for the tested platform and conduct experiments, namely, (1) CPU+ SATA SSD, (2) CPU+GPGPU+SATA SSD, and (3) CPU+GPGPU+PCIe SSD. We use an Intel Optane PCIe SSD 900P Series with 480 GB and Samsung SSD 870 EVO with 500 GB in this test. The workload is configured with a 1-KB value and 5 GB/50 GB data.
Given CPU + SATA SSD, we quantified the CPU compaction throughput of LevelDB under a diversity of workloads. For the other two tested platform configurations, we measured the CPU and GPGPU compaction throughput of gLSM. Compared gLSM with LevelDB - a baseline solution, we concluded that there is a computational bottleneck of CPU + SATA SSD in LevelDB, which is significantly alleviated by our gLSM through the heterogeneous GPGPU. In this experiment, the performance of CPU + SATA SSD using gLSM is inaccurately measured because gLSM performs compaction on the CPU in the context of data volume involved in compaction that does not reach the offloading threshold. The advantage of conducting compaction on GPGPU is opaque. For large-scale data volumes, we offloaded the compaction to GPGPU, achieving collaborative processing. In the experiments based on CPU+GPGPU+SATA SSD, we demonstrated performance improvements in terms of CPU throughput increases. This trend is expected because we offload the compaction tasks with large-sized data volumes to GPGPU, improving the overall performance. Therefore, we employed gLSM in the second experiment, the tested results of which are listed in Table
8.
Comparing the first and second experiments, we observe that the computation becomes a performance bottleneck – and the disk I/O in the second experiment is the performance bottleneck illustrated through the comparison between the second and third experiments. In the second experiment, some compaction tasks are offloaded to the GPGPU, and the disk I/O becomes the primary bottleneck of compaction procedures. The experimental results unveil that GPGPU reshapes the compaction performance because of its high parallel resources, but the disk I/O becomes the main compaction bottleneck. In particular, we discovered that the PCIe SSD-based system – achieving high overall system performance – has a performance edge over the SATA+SSD system.
We believe that there will be some opportunities to optimize the computation process on the CPU and GPGPU sides. For example, we may bypass the CPU and directly read data from the SSD by the virtue of the advanced GPGPU direct storage technology. Although the computational resources of the CPU cannot be used to manipulate data anymore, the data transmission path is shortened to lower the latency. This GPGPU direct storage technology has not been popularized.
Impact of memory usage on the performance of gLSM. To investigate the maximum memory usage of gLSM, we configured different-sized values to test the maximum memory usage of GPGPU and CPU for gLSM. We employed 16-B key and 5-GB data in this test.
As listed in Table
9, the results show that the maximum GPGPU memory of gLSM declines as the value size increases because the large-sized values reduce the number of KV pairs in the SSTable – and the amount of data involved in computation in the GPGPU is shrinking. In addition, the maximum CPU memory usage increases as the value size becomes large. The maximum CPU memory usage by gLSM decreases as the value size goes up in a range between 32 B and 512 B. The reason for this performance trend is similar to that of GPGPU. When the value size is 1024 B, the maximum memory usage of the CPU rises because the size of data blocks involved in computing and transmission is enlarged. Notably, the data block size is generally configured as 4096 B. The minimum size of the data block in gLSM is 3687 B. For the 512-B value, the data block size shrinks when the value size becomes large. The data block size is (16+8+512) * 7 = 3752 B, where (16+8) B is the internal key size – and this block size is larger than 3687 B. For the 1024-B value, if we add 3 KV pairs, the data block size is (16+8+1024)*3=3144 B. The size of the data block is smaller than the minimum value of the data block 3687 B. Therefore, we ought to insert the fourth KV pair – and the size of the data block is 4192 B, which is larger than the block size with 512-B values. The maximum CPU memory usage surges when the value size is 1024 B.
To delve into the impact of memory limitation on system performance, we limited the system’s memory usage to 2 GB, 4 GB, 8 GB, 16 GB, and no limitation on the CPU. Then, we measured the changes in system performance. In this test, we configured 16-B key, 1-KB value, and 35-GB data. We obtain the following observations as shown in Table
9:
First, when memory size increases from 2 GB to 16 GB, the throughput of both CPU and GPGPU rises. This result reveals that memory size affects the performance of the system. To achieve better performance, it is necessary to ensure that the system has large-sized memory. Second, when memory is not limited, the CPU’s throughput enhances, but GPGPU’s throughput has no obvious change and is similar to that of a 16-GB-enabled GPGPU. This result unveils that GPGPU performance may not be driven by available memory because GPGPU memory is not limited, and restricting CPU memory has zero impact on GPGPU-side throughput. Third, regardless of the memory limitation, the GPGPU-side throughput is always higher than that of the CPU: GPGPU may be more effective when processing large-scale data in KV stores.
Studying GPGPU utilization of gLSM. When the number of compaction tasks exceeds the predefined threshold, the compaction tasks are offloaded to the GPGPU with an expectation to fully utilize the GPGPU parallel computing resources. The large-sized values consume few GPGPU resources, which cannot fully exploit the advantage of parallel computing. Meanwhile, the performance bottleneck lies in disk I/Os in this case. In addition, GPGPU has an advantage in processing large-scale data; whereas for small-scale data, the time spent in triggering the GPGPU and allocating threads may be greater than the computation time. In this case, we perform compaction on the CPU, and the GPGPU utilization is still low. We study the GPGPU utilization of gLSM and recorded the GPGPU utilization every 0.5 second. Figure
18 illustrates the GPGPU utilization over time under DB_Bench configured with a 16-B key, 128-B value, and 5 GB data volume.
We drew the following observations from the results. First, gLSM has not yet hit its performance plateau with the available resources of GPGPU. This observation is expected because our design goal of gLSM to reserve a portion of GPGPU resources through optimization methods to cope with urgent applications depending on GPUs. By analyzing the instantaneous GPGPU utilization, we discover that the highest value of GPGPU utilization is 100%, which implies that the compaction tasks utilize all the GPGPU cores. GPGPU utilization during some periods, such as at 98 seconds, does not reach the peak: we record the GPGPU utilization at 0.5 seconds. In this case, the GPGPU may perform the computation, or the computation is to be completed. Because the GPGPU computation speed is high, the GPGPU utilization at this instant may appear in the moment before or after the record point. Second, our gLSM thoroughly utilizes the GPGPU resources once every a fixed period and it is unable to trigger GPGPU consistently under the workload. The GPGPU utilization is 0% when the CPU does not spark an offloading task: the GPGPU is sitting idle for a time period. gLSM conducts disk I/Os, CPU computation, and other processes. Thanks to fast GPGPU computation, most of the computation completes in GPGPU within one second; therefore, the peak GPGPU utilization does not last long. The compaction, disk I/Os, and CPU computation account for the most time. Our calculation method is to remove this portion of the time without triggering the GPGPU-side compassion tasks – and we merely calculate the time of invoking GPGPU-side compaction tasks. If we count the duration in which GPGPU does not perform compaction, the GPGPU utilization is reduced such as only 10% or less.
We also measured the average GPGPU utilization of gLSM under the workloads with different value sizes, see Table
10.
As shown in Figure
16(d), when the value size is smaller than 512B, there are a large number of KV pairs in an SSTable. GPGPU computing resources are used to perform coding and sorting operations for these KV pairs. A powerful GPGPU’s throughput becomes higher than a low-power GPGPU. Table
10 shows that the average GPGPU utilization of gLSM is low. In Figure
18, the GPGPU is idle at some periods of time, but the maximum GPGPU utilization of gLSM reaches 100%. The peak GPGPU utilization is much higher than the average one because gLSM experiences an array of computation-intensive phases. A low-powerful GPGPU, however, may not be able to provide a high performance for data-intensive computational tasks in our system. Therefore, gLSM benefit from a powerful GPGPU that delivers high performance during computation-intensive phases.
However, when the value size is large (e.g., larger than 512 B), the throughput of the low-powerful GPGPU exceeds that of the more powerful one. This is because the amount of data offloaded to the GPGPU becomes small under workloads with large-sized values. In this case, the parallel computing resources in the GPGPU cannot be fully utilized. The time consumed by GPGPU in resource allocation exceeds the performance improvement brought by the GPGPU computation. Therefore, it is impossible to fully utilize parallel resources in the more powerful GPGPU device; however, we benefit from the less powerful GPGPU. In Figure
16(d), we observe that gLSM powered by a variety of optimization methods slashes GPGPU utilization – gLSM still carries out high-efficient compaction tasks without impairing the system performance even facing a low-performance GPGPU.
Impact of Large-sized Data Volume on Performance of gLSM. In this experiment, we study the impact of large-sized data volume on the performance of tested KV stores in terms of compaction throughput. We employed workload Load (100% write) in YCSB-C with 100-GB data volume, 1-KB value, and the default size key (key range is between 22B- and 24B). As listed in Table
11, gLSM significantly outperforms LUDA (83.9 MB/s), PCP (74.4 MB/s), RocksDB (102.9 MB/s), and LevelDB (32.1 MB/s) when the workload runs on CPU (126.8 MB/s). The experimental results show that gLSM achieves efficient implementation on the CPU. When compaction tasks are offloaded to the GPGPU, the performance of gLSM (1004.2 MB/s) outperforms all other KV stores under the workload on the GPGPU. The results reveal that gLSM fully exploits the parallel computing resources of GPGPU, and gLSM achieves significant performance improvement. We find that gLSM can offer superior performance on both CPU and GPGPU for data processing, which enables it to be apt for diverse computation-intensive applications.
Performance of gLSM under updated workloads. As shown in Figure
19, the experimental results show that gLSM achieves the best performance in terms of throughput under all workloads. The performance of all tested KV stores decreases when the data size increases from 5 GB to 50 GB; however, gLSM has a smaller decrement compared with alternative KV stores under workload with 90% write and 10% update. This is because our offloading compaction mechanism to the GPGPU improves the write performance and enables gLSM to achieve stable performance under the workload. Under 50-GB workload, the performance gLSM achieves an upward trend under 90% write and 10% update workload compared with that under 50%-write and 50%-update workload, see Figure
19(b). The results demonstrate our superior performance of gLSM. LUDA also offloads compaction tasks to the GPGPU, but its performance is lower than gLSM. This is because when triggering compaction, LUDA also performs merge operations on the CPU, which is the most time-consuming operation in compaction. Therefore, the compaction time of LUDA increases in the case of a large number of writes in workloads. The merge operations ultimately affect the performance of LUDA. For example, the performance of LUDA with a 90%-write and 10%-update workload is lower than that with a 50%-write and 50%-update workload. The results indicate that the slow compaction degrades the performance of LUDA. The performance of GPGPU-empowered gLSM outperforms PCP with CPU multi-thread under 5-/50-GB workloads. This is because GPGPUs can provide sufficient computing resources, and our techniques in gLSM can meet the performance requirements of KV stores when dealing with large-scale data.
Performance of gLSM under YCSB-C. In this section, we study the performance of gLSM, LUDA, PCP, RocksDB, and LevelDB under YCSB-C with 24-B key, 1-KB value, and 50-GB data volume. The default configuration parameters of RocksDB and LevelDB are configured in the experiment. There are six types of workloads in YCSB-C, the characteristics of which are listed in Table
12. We use the Intel Optane 900P Series PCIe SSD with 480 GB as the storage device.
As shown in Figure
20, gLSM presents a higher performance than LUDA, also designed based on GPGPU. Compared with LUDA, gLSM improves the throughput by 47.7%, 22.8%, 91.1%, 79.6%, 56.6%, 200%, and 86% under workloads Load, A, B, C, D, E, and F, respectively. This is because the latency of CPU-based merge operations in LUDA is higher than that of the GPGPU-empowered radix-sort in gLSM, which increases compaction time and disk I/Os. The read I/Os must wait for the compaction to complete; therefore, the waiting time for the read I/Os increases. However, gLSM shortens the waiting time for reads from the disk and lessens the impact of compaction on read performance. The impact of different sorting algorithms on performance also exists in write- and read-intensive workloads.
Under read-intensive workloads (e.g., B and C), gLSM and RocksDB present similar performance, and the difference between the two KV stores is only 0.3 KTPS. This is because gLSM primarily improves compaction and write performance, whereas RocksDB offers techniques (e.g., block cache) to optimize the read performance. Then, RocksDB can read data from memory instead of the disk, thereby improving read performance. However, gLSM still achieves better performance under read-intensive workloads. For example, gLSM outperforms RocksDB by 3.6%, 3.2%, and 12.2% under workloads B, C, and D, respectively. The reason is that gLSM accelerates the compaction by offloading compaction to GPGPU, and high-performance compaction can reduce the disk I/O latency. This allows us to reduce storage costs and improve read performance. Meanwhile, the high-speed data transmission of PCIe SSD alleviates the I/O bottleneck, which lowers the compaction time of the gLSM and alleviates the I/O competition with reads.
For mixed read-write workloads (e.g., A and F), the compaction mechanism of gLSM improves compaction performance, reducing high-latency disk I/Os. These mixed read and write operations complete quickly, lowering data processing latency. High-concurrent applications issue read and write requests simultaneously. In this case, LSM can efficiently process data and keeps a stable performance, avoiding performance degradation caused by compaction. For example, under workload A, gLSM improves throughput by 22.8%, 7.7%, and 25% over LUDA, RocksDB, and LevelDB, respectively. Under F, gLSM achieves higher throughput than LUDA, RocksDB, and LevelDB by 86%, 33.3%, and 60%, respectively. It outperforms PCP by 12.3\(\times\) in terms of throughput.
Under workload E, gLSM maintains excellent write, read, and range-query performance. For the range-query performance, gLSM outperforms LUDA, RocksDB, and LevelDB by 200%, 200%, and 22.2%, respectively. Especially, gLSM has 7.2\(\times\) higher than PCP in terms of throughput. This is because gLSM enhances the compaction performance through GPGPU parallel computing resources. The fast-compaction procedure can quickly merge duplicated KV pairs on the disk, making the data orderly, which is favorable for the range query. The read performance becomes high when the data is ordered.