CODAG: Characterizing and Optimizing Decompression Algorithms for GPUs
arXiv preprint arXiv:2307.03760, 2023•arxiv.org
Data compression and decompression have become vital components of big-data
applications to manage the exponential growth in the amount of data collected and stored.
Furthermore, big-data applications have increasingly adopted GPUs due to their high
compute throughput and memory bandwidth. Prior works presume that decompression is
memory-bound and have dedicated most of the GPU's threads to data movement and
adopted complex software techniques to hide memory latency for reading compressed data …
applications to manage the exponential growth in the amount of data collected and stored.
Furthermore, big-data applications have increasingly adopted GPUs due to their high
compute throughput and memory bandwidth. Prior works presume that decompression is
memory-bound and have dedicated most of the GPU's threads to data movement and
adopted complex software techniques to hide memory latency for reading compressed data …
Data compression and decompression have become vital components of big-data applications to manage the exponential growth in the amount of data collected and stored. Furthermore, big-data applications have increasingly adopted GPUs due to their high compute throughput and memory bandwidth. Prior works presume that decompression is memory-bound and have dedicated most of the GPU's threads to data movement and adopted complex software techniques to hide memory latency for reading compressed data and writing uncompressed data. This paper shows that these techniques lead to poor GPU resource utilization as most threads end up waiting for the few decoding threads, exposing compute and synchronization latencies. Based on this observation, we propose CODAG, a novel and simple kernel architecture for high throughput decompression on GPUs. CODAG eliminates the use of specialized groups of threads, frees up compute resources to increase the number of parallel decompression streams, and leverages the ample compute activities and the GPU's hardware scheduler to tolerate synchronization, compute, and memory latencies. Furthermore, CODAG provides a framework for users to easily incorporate new decompression algorithms without being burdened with implementing complex optimizations to hide memory latency. We validate our proposed architecture with three different encoding techniques, RLE v1, RLE v2, and Deflate, and a wide range of large datasets from different domains. We show that CODAG provides 13.46x, 5.69x, and 1.18x speed up for RLE v1, RLE v2, and Deflate, respectively, when compared to the state-of-the-art decompressors from NVIDIA RAPIDS.
arxiv.org