1 Introduction
Query compilation is an effective means of improving the execution time of data-intensive SQL queries [Neumann
2011; Mostak
2014; Chen et al.
2018a;,
2020]. As query programs are relatively short, however, the cost of compilation itself often exceeds that of execution and may even be detrimental to overall performance. This problem is especially evident in GPU-based approaches that depend on expensive JIT compilation from manufacturer toolchains when lowering to the exact target architecture [Breß et al.
2017; Krolik et al.
2021]. r3d3, for example, a GPU query compiler and database system, uses the HorseIR array-based representation [Chen et al.
2018a] to greedily identify and fuse GPU kernels in queries based on HyPer query plans, translates them to PTX, NVIDIA’s low-level intermediate representation, and generates CUDA binaries using the proprietary closed source assembler [Krolik et al.
2021]. When data are cached on GPU memory and the query pre-compiled, this proves effective but falls short for end-to-end performance shown in Figure
1 primarily due to the high cost of NVIDIA’s assembler. Online query compilation systems targeting GPU hardware must therefore bypass vendor toolchains in favor of domain-specific pipelines that are judicious in their optimizations and focus on reducing overhead. Other recent approaches to GPU database systems focus instead on exploiting tensor units [Hu et al.
2022; He et al.
2022], optimizing parallel algorithms [Shanbhag et al.
2020], and robustness [Lee et al.
2021].
In this work, we extend the existing r3d3 system shown in Figure
2 to form an optimized end-to-end and open source pipeline for compiling and assembling SQL queries for NVIDIA GPUs. Notably, we address a major bottleneck in overall query performance through the design and implementation of a runtime-suitable assembler that generates code comparable to optimizing compilers without the typical overhead. This best-of-both-worlds approach acts as a drop-in replacement for the costly proprietary toolchain and is based on one key principle: profitably trading slowdown in execution for improvements in compilation. While challenging for general scientific computation, this work exploits the comparably narrow patterns found in query compilation and the array-based HorseIR representation (e.g., binary operations and well-defined algorithms like join) and pre-optimized
intermediate representation (IR) generated by r3d3 to eschew traditional compiler passes like loop transformations and redundancy elimination. We employ a balanced and low-overhead approach that extends well-known techniques for instruction scheduling, register allocation, code generation, and assembly to runtime query compilation for GPUs. We thus couple the execution speed and algorithmic optimizations of r3d3 (e.g., kernel fusion) with a correspondingly optimized compilation pipeline. Additionally, support for any program expressible in array-based form naturally extends from our work, although further optimizations may be necessary for best performance as we assume the input programs are optimized algorithmically and only require simple lowering (e.g., no register spills). More recent work has looked at the possible support of user-defined functionality in HorseIR queries [Chen et al.
2021].
We compare our approach against both interpreted and compiled state-of-the-art CPU and GPU databases and evaluate the performance tradeoffs and breakdown of our proposed system [BlazingSQL Inc.
2021; Mostak
2014; Chen et al.
2018a; Idreos et al.
2012]. Results indicate that our pipeline is effective at improving both compilation and end-to-end time against all comparison systems, while maintaining high speedups in cached scenarios. Importantly, in comparison to NVIDIA’s optimizing assembler, we vastly improve compilation time while showing only minor slowdown in kernel performance. Measured against the non-optimizing variant, our proposed pipeline is notably better balanced for query compilation and maintains significant compilation speedup while also improving execution. Additionally, we demonstrate that our approach remains beneficial even as data sizes grow and the relative importance of fixed-cost compilation decreases against that of execution. Contributions of our work include the following:
•
A runtime-optimized and open source
1 assembler for NVIDIA GPUs that minimizes compilation cost while maintaining high performance on SQL queries. This removes a major limitation on the practical use of query compilation for GPUs.
•
Exploration of low-level compilation steps and challenges for NVIDIA GPUs, defining and demystifying the necessary compilation pipeline for query programs.
•
Evaluation on the complete TPC-H benchmark, showing performance improvements on end-to-end scenarios over both CPU and GPU database systems at multiple data sizes.
2 Background
A GPU is a multi-core execution unit optimized for throughput computation [Lindholm et al.
2008]. Abstractly, it follows a hierarchical design for both computation and memory, as shown in Figure
3(a). At the lowest level, a GPU consists of thousands of
processing elements (PE), each of which compute the result of a single thread. PEs are then grouped into
streaming multiprocessors (SM), which in turn form the GPU. Memory is likewise hierarchical, consisting of private registers for PEs, shared memory for each SM, and globally accessible device memory. The GPU is connected to the host system and memory via the PCIe bus. A GPU program defines the execution of a single thread (SIMT), as well as synchronization within a
thread block—inter-block communication is limited. During execution, the collection of thread blocks form the
grid. At runtime, each thread block is assigned to a multiprocessor and executed on the processing elements. More recently,
thread block clusters give an additional layer of granularity for sharing and synchronization [NVIDIA
2021d].
While the abstract view is sufficient for most high-level applications, to maximize performance and produce machine code, a lower-level view is necessary. Our approach targets two recent NVIDIA architectures, Pascal [NVIDIA
2016] and Ampere [NVIDIA
2020], while remaining extensible to other GPU platforms and families that follow similar principles. We show the layout of the Ampere architecture in Figure
3(b), describe its important features below and note differences with Pascal.
GPU multiprocessors are divided into partitions, each of which is responsible for the execution of a set of
warps. A warp is a collection of 32 consecutive threads that are part of the same thread block. At runtime, warps are assigned to partitions by the hardware as are the scheduling tasks for the GPU.
2 Each partition contains a
warp scheduler, functional units, and register memory, while shared memory is accessible to the entire multiprocessor.
Each cycle, the warp scheduler picks the next available warp and the
dispatch units issue the next instruction to the appropriate functional unit. In the case of Pascal, dual issue is supported for some instruction pairs and improves instruction-level parallelism. Functional units include cores (e.g., single-precision), double precision, special function units (e.g., math functions), and load-store. Integer computation is supported by all cores (e.g., Pascal) or a subset (e.g., Ampere). Other core types like tensors are not considered in our approach but have been evaluated in work like TCUDB [Hu et al.
2022]. Note that the exact number of each functional unit is architecture specific and impacts the throughput at which instructions can be issued [NVIDIA
2021d]. When the number of units is less than the size of a warp, multiple cycles are required to issue the instruction. Note that exact architectural details may differ from this model, but it provides sufficient accuracy to implement a high-performance and competitive assembler.
2.1 Compilation Pipeline
As part of the CUDA toolkit, NVIDIA provides a complete compilation pipeline, translating from CUDA code to an assembled binary in two stages shown in Figure
4 [NVIDIA
2021c]. CUDA code written by the programmer is first compiled to PTX, a low-level NVIDIA-developed IR for GPUs that targets a virtual architecture [NVIDIA
2021e]. This step can be performed on the developer’s machine as the code is mostly platform independent. When the exact device is known, the stage 2 compiler produces a machine-specific binary representation (SASS) that can be executed on the GPU [NVIDIA
2021a]. We adopt NVIDIA’s terminology of
assembler for this last stage, although it more resembles a backend compiler responsible for low-level optimizations including instruction scheduling and register allocation.
4 Register Allocation
PTX is an intermediate representation targeting a virtual machine with unlimited registers [NVIDIA
2021e]. We must therefore implement a register allocation scheme that considers real GPU properties. In particular, GPUs have a large number of general-purpose 32-bit registers per thread (255), a smaller number of predicate/Boolean register (7), and uniform registers [Burgess
2019;,
2020; NVIDIA
2021a] in more recent architectures. This reduces the need for expensive allocators and spilling heuristics, although we must still be mindful due to the impact on occupancy (number of active warps) [NVIDIA
2021b]. For types larger than 32-bit, two or more consecutive registers may be used, aligned to the data size as shown in Figure
6. At a hardware level, registers are organized in banks, where an instruction may only read 1 value in each bank per clock cycle [Gray
2016]. In our allocator, we ignore such
bank conflicts, as the compilation overhead exceeds any potential benefit to execution dominated by data loads and stores.
We use a modified linear scan allocator [Poletto and Sarkar
1999], considering register pairs and alignment, as it has proved efficient in JIT contexts and produces reasonable allocations compared to Chaitin’s algorithm [Chaitin
1982; Briggs et al.
1994]. In our preliminary evaluation, we also observed that further reducing registers had little impact on performance and was expensive to compute. This approach is supported by posts on NVIDIA’s forums that indicate that the proprietary scheme will use all “necessary” registers to improve performance.
3 As GPUs contain a large number of registers (255) that far exceeds the number required for our benchmarks (always below 100), neither spilling nor register pressure are considered by the kernel fusion in the frontend compiler or our register allocator. Note that Ampere requires an additional 2 registers be allocated per thread for program counters.
4Linear scan register allocation uses live intervals, conservative ranges that do not consider dead sections. The input PTX code should therefore share virtual registers only when necessary and keep their use short. Registers are allocated for each virtual location using these intervals:
(1)
Compute live intervals (start,end), sorted by start position;
(2)
For each interval (start,end):
(b)
Allocate aligned registers RX.RY;
(c)
Add interval (start,end,RX.RY) to live allocations;
Predicate registers are allocated independently alongside regular registers. During code generation, temporary registers are allocated above the high-water mark.
5 Code Generation
Given the register allocation, we then translate from PTX to SASS, generating high-performance code without the need for expensive optimization. In particular, we address control-flow structuring and efficient code generation patterns. While our techniques may be generalized to other domains, we avoid unstructured control-flow and optimize for code templates used by database queries.
5.1 SASS
SASS is a machine-level language designed for NVIDIA GPUs that varies between GPU iterations. Unfortunately, while PTX follows an open source specification, NVIDIA has yet to publish a detailed view of SASS. Multiple homegrown assembler projects have therefore attempted to reverse-engineer the machine code specification (asfermi [Yunqing
2015], MaxAs [Gray
2016], TuringAs [Yan et al.
2020], CuAssembler [cloudcores
2021]), and a recent paper by Hayes et al. presents a systematic approach to determine instructions regardless of architecture [
2019]. Microbenchmarks have also been successful at discovering low-level hardware details [Jia et al.
2018;,
2019]. These projects are our primary source of low-level details, complemented by NVIDIA binary tools [NVIDIA
2021a] and our own reverse-engineering efforts. Shown in Figure
7 are example SASS programs for the Pascal and Ampere architectures that increment all values in an array.
5.2 Structured Control-flow
Control-flow poses additional challenges for GPU compilers, as the underlying hardware uses SIMD units and threads are not guaranteed to execute the same path. Additionally, the execution is platform dependent and differs between Pascal and Ampere. We therefore limit our implementation to well-defined structured control-flow found in query programs and generate the appropriate templates. Typical if(-else) statements, loops, and unlabelled break are recovered from the CFG using dominator/post-dominator analyses and in future work could be directly generated by the front-end and represented in the IR.
5.2.1 Pascal.
On the Pascal architecture, instructions are issued for an entire warp using a single program counter, even with branching and control-flow. If all threads within a warp take the same path through the program, then standard SIMD units are sufficient. However, if threads within a warp
diverge and take separate paths, then the GPU must serialize execution of the branches and manage active threads for each path. Control-flow can be implemented using two approaches, first with
predication and second through a
divergence stack. The simplest solution, predication (
@P prefix), allows an instruction to be issued to the SIMD units but only executed on threads flagged by a mask. This is effective for simple branching like if statements but challenging for nesting and unstructured graphs. To support these more complex cases, a hardware stack keeps track of the divergence points within a warp (branching), and the active threads for each path [Coon et al.
2010;,
2012; Nickolls et al.
2011; Mayank et al.
2015; Kothiya
2014]. Note that regardless of the approach, divergent execution paths within a warp are serialized at high cost [Bialas and Strzelecki
2016], while divergence between warps has no added overhead.
Shown in Figure
8 is the code and corresponding progression for an example if-else statement managed using the divergence stack. Each token on the stack conceptually represents a future execution path and contains metadata for its program counter and associated mask. Before the conditional branch is executed, the program thus pushes a
SYNC token onto the stack using the
SSY instruction, indicating the
reconvergence point of the structure as well as the currently active threads. The reconvergence point serves as the continuation point once both branches have been executed. Next, a predicated branch instruction splits threads into two paths and a
DIV (divergence) token is pushed containing the location and thread mask of the
path not-taken. The active mask is also updated with the subset of active threads for the
path taken. Execution continues along the taken path until complete (
SYNC instruction). The stack is then popped and the program continues along the other path indicated by the popped
DIV token. Once both paths have been executed, the program continues from the reconvergence point specified by the
SYNC token.
5.2.2 Ampere.
More recent architectures like Ampere use
Independent Thread Scheduling, storing a separate program counter for each thread [Luke Durant and Stam
2017]. This allows complex control-flow without a dedicated hardware stack but must be used cautiously to ensure high performance. In particular, note that since SIMD units are still used for execution, divergence within a warp causes the same instructions to be issued multiple times if divergent threads arrive at different time points [NVIDIA
2021f]. Two new reconvergence methods have thus been implemented, a
WARPSYNC instruction that forces reconvergence of all threads within a warp and a more flexible barrier resource alternative [NVIDIA
2021a;,
2021f]. The latter is equivalent to a limited height divergence stack and replaces the prior
SSY/
SYNC instructions with their barrier equivalents
BSSY/
BSYNC. These new instructions operate on barrier resources
B0.B15, each of which counts the number of threads signaled using
BSYNC and jumps to the continuation point set by
BSSY when complete. Unconventional control-flow is thus simplified and can easily reconverge at arbitrary locations.
5.2.3 Patterns.
We define code generation patterns shown in Figure
9 for each control-flow kind, allowing for nested structures. Immediate post-dominators are used for the reconvergence point as they naturally capture the exit of loops and if-else structures [Fung et al.
2007]. In the case of loops, a conditional
SYNC instruction updates the thread mask until none are active and the continuation point is then followed. For simplicity, we opt for an equivalent strategy on both architectures, mapping each position in the divergence stack to a distinct barrier resource on Ampere. Note that simple if-else branches may be flattened and implemented directly using predication.
5.3 Patterns
We use template-based code generation, mapping each PTX operation to its equivalent SASS instruction sequence. Each pattern incorporates power-of-two strength reduction and NVIDIA’s optimizations (e.g.,
XMAD vs.
IMAD for multiplication as shown in Figure
10 [Emmart et al.
2016]). We assume the frontend code generation is efficient, avoiding redundant expressions and dead code. This simplified translation strategy allows us to minimize time spent in instruction selection.
5.4 Optimization
Given the generated SASS code, we use peephole optimization to remove dead operations (e.g., store values to RZ) and redundant moves (e.g., MOV R0, R0) without resorting to frequent special case handling—a typical compiler approach to optimization. As the code generation patterns already employ algorithmic optimizations, this proves sufficient in practice to generate high-performance code and is a key idea behind rNdN—defining the minimal compilation steps necessary for high performance execution. Optimizations requiring fixed-point analyses such as constant propagation and dead-code elimination were too expensive in our preliminary evaluations.
6 Scheduler
NVIDIA GPUs rely on compile-time scheduling to maximize
instruction-level parallelism (ILP) and hide the latency of each operation. We first explore the required GPU scheduling directives, before outlining the properties of an efficient schedule. Last, we describe our instruction scheduling implementation, formalizing and extending the approach taken by MaxAs [Gray
2016].
6.1 Directives
Embedded within the machine code are scheduling directives generated at compile time and used by the warp scheduler during execution. Referred to as SCHI directives, they may be present for groups of three instructions (e.g., Pascal) or embedded directly in the instruction (e.g., Ampere) [Gray
2016; Hayes et al.
2019; Jia et al.
2019]. The directives binary layout is shared by both architectures and shown in Figure
11,
•
Number of cycles to stall after dispatch;
•
Yield flag, allowing switching to other warps;
Together these elements ensure that dependencies between instructions are satisfied. In particular, stall counts allow fixed-latency instructions to complete before issuing dependent operations, and barriers handle cases of variable-latency instructions (e.g., memory loads/stores). There are six barrier resources (
SB0.SB5) shared for reads and writes and each containing a scoreboard register [Ohannessian Jr et al.
2015]. Issuing an instruction with a read or write barrier increases the scoreboard value, while signaling decreases. Waiting on a barrier resource with an SCHI directive requires that the corresponding scoreboard register be zero before being available to dispatch. The directive also contains a yield hint, allowing the scheduler to switch between warps [Gray
2016; Yan et al.
2020] and operand cache flags, which enables a small register operand cache to avoid bank conflicts. Both hints are optional but may improve parallelism in some contexts. Note that as database queries are likely dominated by data loads and stores, the performance impact of scheduling is limited.
6.2 Instruction Properties
Each instruction is associated with an architecture-specific
instruction class of scheduling properties, extending the idea used by MaxAs [Gray
2016] for the older Maxwell architecture. Using microbenchmarks, we collect the following properties for both Pascal and Ampere:
•
Depth latency for fixed-latency instructions;
•
Read/write latencies for variable-latency instructions;
•
Throughput for the functional unit;
•
Dual issue and register cache support.
Variable latencies are estimates as they vary between executions, whereas fixed-latency instructions are exact. We derive the functional unit and throughput for each instruction based on the programming manual [NVIDIA
2021d] and hardware diagrams.
6.3 Schedule Properties
Instruction scheduling must produce an efficient order of instructions that satisfies dependencies. We therefore consider the following instruction and program properties in our approach:
(1)
Fixed-latency instructions must complete;
(2)
Variable-latency instructions must set and signal a barrier resource;
(3)
Independent instructions should be overlapped; and
(4)
Long critical paths should be prioritized.
Note that properties (1) and (2) guarantee correctness, while (3) and (4) improve ILP and minimize unnecessary stalls. We show examples of some properties in Figure
12 and any required scheduling directives.
For fixed-pipeline depth instructions such as integer and single-precision, dependent instructions must wait until the prior instruction is complete. In the case of
IADD, this means waiting a minimum of six cycles on the Pascal architecture or five on Ampere. To maximize ILP, independent instructions should be issued while waiting on the previous result. The scheduler must also consider the throughput of each functional unit; when the warp size exceeds the number of units, multiple cycles are required to issue the instruction. Although the hardware will insert necessary stalls for throughput [Gray
2016], the scheduler should still consider their effect on performance. The exact number of functional units, throughput, and pipeline depths vary between platforms.
In the case of variable-latency instructions shown in Figure
12(b), the scheduler must insert necessary barriers, ensuring source registers are free for reuse (read dependency), or destination registers are written (write dependency). For an efficient schedule, the scheduler must hide the high latency of memory reads and writes with independent instructions as they are typically among the most expensive operations. Additionally, since the number of barrier resources is limited, an assignment policy is required to avoid excessive sharing and the associated performance degradation. Last, as pictured in Figure
12(c), an efficient schedule should prioritize instructions with a long chain of dependencies so they can be overlapped by independent instructions.
6.4 List Scheduler
We implement list scheduling [Gibbons and Muchnick
1986] at the basic block level, generating a topological order for instructions based on a scheduling heuristic. A similar algorithm is used in MaxAs [Gray
2016], although we extend it with an updated heuristic and support automatic barrier insertion. In addition, we automatically decompose the basic block by producing a separate dependency graph for each
schedulable section, sequences of instructions free of control dependencies, and capitalize on long sequences of code generated by branch inlining. Synchronization directives naturally impose constraints on the graph, allowing us to optimize the representation and remove excessive edges. Note that an efficient dependency graph representation is required for list scheduling to avoid the allocation overhead as well as frequent heuristic and state updates. For each schedulable section, we produce an optimized schedule:
(1)
Initialize available instructions and heuristics;
(2)
While next available instruction:
(a)
Insert dependency barriers;
(b)
Update stall count of previous instruction;
(c)
Update available instructions and their heuristic;
(3)
Wait on all active barriers.
As an extension to list scheduling, we maintain two times for each instruction: expected time and available time, splitting the legal time an instruction is available from its expected execution time (e.g., waiting on a barrier). Dependency barriers are automatically inserted for variable-latency dependencies as their completion time is unknown.
Based on the expected value, our scheduling heuristic selects the next available instruction to reduce stall counts and improve ILP, while considering any fixed-latency or variable-latency dependencies. For variable-latency instructions, we approximate the stall based on microbenchmarks. If equal, then we prefer instructions with a longer expected critical path (path to exit) as their effect on execution is higher. Last, in the event of a tie, the line number is used to produce a deterministic order. Scheduling ends by conservatively requiring all instructions be complete, either by stalling for the entire instruction duration (fixed-latency) or through the appropriate barrier (variable-latency).
Note that as the Pascal architecture has two dispatch units, it can issue two independent instructions for separate functional units each clock cycle. In our workload, performance improvements from dual issue are negligible, likely because query programs are read/write intensive and dual issue is ineffective at hiding this latency. Register reuse is similar, where the cost of reuse-scheduling outweighs improvements. As further discussed in our evaluation, the cost of optimizations such as list scheduling must be carefully balanced against their impact on execution.
6.5 Barriers
Barrier waits inserted in the SCHI directives must wait on all instructions that set the barrier, regardless of the actual dependency. This is problematic for programs with numerous variable-latency instructions, as the number of barrier resources is limited. To circumvent this issue, NVIDIA associates a scoreboard register with each barrier, representing the number of unsignaled dependencies, and a
DEPBAR instruction that can wait until the count drops below a threshold. During scheduling, the compiler can opt to use these partial barrier instructions if waiting on an earlier dependency. An example use case is shown in Figure
13, where the
IADD instruction is only dependent on the first of three loads and would therefore benefit from a partial barrier.
This strategy is effective for hiding high-latency instructions but requires that instructions share a barrier signal in-order. For this reason, we reserve counting barriers for load and store instructions, while using SCHI directives for shorter operations with variable latencies. To simplify logic we use separate barriers for read and writes and count the number of interim instructions incrementing the scoreboard for the partial barrier. In contrast, NVIDIA’s patent on software supported scoreboarding suggests more aggressive barrier sharing for both reads and writes [Ohannessian Jr et al.
2015]. Our list scheduler automatically inserts the appropriate barrier, either counting or SCHI, depending on the instruction dependencies. Note that due to unknown changes to load instruction re-ordering, we have been unable to adapt this strategy directly to Ampere and require further investigation.
7 Evaluation
In this section, we evaluate our implementation, including a breakdown of execution and compilation phases, comparison against recent open source CPU and GPU systems, and the impact of assembly phases on compilation and execution. Last, we consider the scalability of our approach.
7.1 Methodology
Experimental Setup:.
All experiments were performed on Ubuntu 20.04 LTS, equipped with an Intel i7-8700K @ 3.7GHz, 32-GB DDR4 RAM, and an NVIDIA GeForce GTX 3080 (Ampere). Similar trends and conclusions throughout our evaluation on an older NVIDIA GeForce GTX 1080 Ti (Pascal) not pictured. We use CUDA 11.4.2 and graphics driver 470.57.02. Experiments are averaged over 10 runs (dropping the highest and lowest) after five warm-up iterations. Comparison systems and their characteristics are listed in Table
1, each using their respective execution plans. ptxas represents the existing r3d3 system using NVIDIA’s PTX compiler APIs.
5 Both
-O0 and
-O3 are evaluated, comparing our performance against both the optimizing and non-optimizing assemblers. r3d3 thus shows the impact from query parallelization, while rNdN illustrates the benefit specific to this work. We include single-threaded CPU comparison points to evaluate the effectiveness of offloading database queries to accelerators even with the additional overhead. Evaluating the impact of multi-core CPUs and considering the tradeoffs is left as future work.
TPC-H Benchmark:.
We use the TPC-H benchmark at
scale factor (SF) 1 (corresponding to 1 GB input data) to evaluate the main performance of our approach and scale factors 2/4/8 for scalability (limited by the available GPU memory on our system) [TPC Council
2017]. It consists of 22 SQL queries for a business context and is commonly used for performance evaluation of database systems. HorsePower, r3d3, and rNdN support the entire benchmark suite using query plans generated from HyPer [Neumann
2011]. MonetDB is equally complete. OmniSci does not support three queries: q14, q21, and q22; BlazingSQL requires modified input queries due to parsing requirements
6 and does not support and queries: q11, q15, q21, and q22. Note that as data increase, some additional queries may no longer be supported due to memory constraints. As queries execute on substantial data and the output is validated, reasonable correctness is ensured for each compiler phase. Architecture details like fixed-latency stalls were confirmed by microbenchmarks.
7.2 Motivation
Motivating our focus on efficient compilation, we begin by considering the baseline performance of r3d3 using NVIDIA’s assembler on the TPC-H benchmark. Shown in Figure
14 are execution and compilation breakdowns of both the optimizing assembler (
-O3) and non-optimizing variant (
-O0). Execution is divided into CPU and GPU computation (e.g., kernel and analysis), data caching and intermediate operations (e.g., transfers, resizing), and overhead, while compilation is split into the stage 1 and 2 phases. Stage 1 is provided by the r3d3 system, while stage 2 corresponds to NVIDIA’s PTX compiler API. All times are measured in milliseconds, and note that the right compilation scale is 3
\(\times\) that of the left execution scale with annotations for bars exceeding the plot.
Notably, we observe that compilation time greatly exceeds execution throughout the benchmark, representing the largest cost even with optimizations disabled. In fact, the marginal improvements in compilation time from the non-optimizing backend are offset by significant slowdowns in execution and give no significant end-to-end speedup. Short-running queries would therefore benefit from a more targeted and balanced approach that performs the necessary performance optimizations without the typical overhead—the cornerstone of our approach. Note that the compilation time of q1 is higher with -O0 than with -O3, likely due to the larger code size when unoptimized.
7.3 Execution Breakdown
We next present the breakdown of execution and compilation time using our rNdN system, pictured in Figure
15. As in the motivation, we include data caching and intermediate management, CPU and GPU computation, and overhead for execution, while compilation is divided into two parts, compile and assemble, corresponding to stages 1 and 2 of the pipeline. Stage 1 is provided by the existing r3d3 system, while stage 2 represents our new optimized assembler. All timings are measured in milliseconds with identical scales for execution and compilation.
Overall, execution time exceeds that of compilation, with a few disproportionate exceptions (q2 and q11) due to the relatively high-complexity of the programs compared to their execution. This is a notable improvement over previous query compilers, which typically showed execution time dominated by compilation. We do re-affirm, however, that assembly is the most expensive part of the pipeline. In terms of execution, most computation is GPU-resident, with the exception of ordered string comparisons (e.g., @order in q16) and @substring in q22. High-performance GPU kernels are thus a necessity for best results. As with most GPU programs, data transfer is costly, which highlights the importance of caching.
Shown in Figure
16 is a breakdown of compilation, indicating the relative proportion of each phase in the pipeline. Included are the frontend (stage 1) compiler, and a more detailed view of the backend including the control-flow structuring, register allocator, code generator, instruction scheduler, peephole optimizer and finally the binary assembler. A similar breakdown for NVIDIA’s ptxas, used by r3d3, is not possible as the implementation is closed source.
Most queries follow the same pattern, with the backend dominating the frontend. The exception to this, q6, generates a small PTX program with limited control flow and is thus simple to compile. In terms of backend phases, register allocation is the most expensive, requiring computation of live variables using a fixed point analysis. Scheduling, code generation, and control-flow structuring take up the remaining time in roughly equal proportions, while optimization and binary assembly are fast. Backend compilation is thus the primary cost, which we attribute primarily to the relative complexity of low-level PTX over the high-level input query language, HorseIR.
7.4 Performance Comparison
We define 3 comparison points, evaluating common usage scenarios. First, we evaluate compilation time, measuring the cost of compiling the query before execution. Next, we measure cached execution, assuming a pre-compiled query with GPU-resident data. Together these results show the effectiveness of trading compilation for execution. Last, we show total execution, measuring the query end-to-end including data caching and compilation. This represents cold-start, or situations where the data or query is changing (e.g., parameterization). All graphs show speedups normalized to the performance of each comparison system, with higher results better for our system and “x” indicating the comparison system failed.
Compilation: Starting with compilation time in Figure
17(a), we see significant performance improvements across the board. Importantly, this holds true for both ptxas
-O0 and
-O3 with 7.6
\(\times\) and 8.7
\(\times\) geomean speedups, respectively. This shows the performance of our approach even against non-optimizing systems. Note that we observed significant fixed-cost with NVIDIA’s assembler on empty programs, which explains the large speedup for q6, a simple query. Compared to OmniSci and HorsePower, both use compilers designed for offline use and are thus inefficient. As compilation is the largest cost for many queries, this overhead has significant impact on end-to-end results.
Execution: Next we consider the cached execution scenario seen in Figure
17(b), measuring only query execution and the performance of the generated code. Compared to ptxas
-O3, we achieve 0.97
\(\times\) speedup, with most queries above 90% performance. Using
-O0, this speedup increases to 1.7
\(\times\). rNdN therefore produces code comparable to ptxas
-O3 at significantly lower cost than
-O0. In relation to the other baseline systems, we see overall speedups with few exceptions. Queries with slowdowns typically have inefficient group computations (q1 and q3) and require further investigation.
Tradeoff: Last, we present the total execution comparison in Figure
17(c). Notably, we see speedups on all queries compared to ptxas with both
-O0 and
-O3, indicating that rNdN successfully minimizes compilation cost without excessively slowing execution. Interestingly, geomean results for
-O0 and
-O3 are near identical, indicating that compilation speedup from skipping optimizations is matched by slowdown in execution. Compared to other compiled systems (OmniSci, HorsePower), we see significant performance improvements in all queries, largely due to optimized compilation. Compared to non-compiled systems we also see overall speedup, including the cost of data transfers.
7 GPUs are thus effective for end-to-end query optimization, even in cold-start cases.
7.5 Register Allocation
At runtime, GPUs assign thread blocks to multiprocessors, maximizing instruction-level parallelism by saturating each warp scheduler. However, as the number of registers is fixed, register pressure limits the number of thread blocks in flight (i.e., occupancy). We thus evaluate the effectiveness of our allocation by comparing the register usage of rNdN and ptxas for each kernel.
As with all register allocation algorithms, linear scan is a heuristic approach, performing better in some cases than others. Geomean-wise, we see a 10% geomean increase in register usage compared to ptxas -O3 and a 22% decrease compared to ptxas -O0. As our cached execution is within 5% of NVIDIA’s optimizing assembler, register allocation is likely not the most significant bottleneck, supporting our simplistic approach. Additionally, the number of registers is significantly lower than the limit across the entire benchmark and never requires spilling (no kernel exceeds 100 registers).
7.6 Scheduler
We next consider the impact of our scheduler on execution and compilation, measuring the impact on compilation, kernel performance and end-to-end evaluation as shown in Figure
18. As a comparison point, we implemented a naive scheduler that maintains the instruction order and does not consider pipelining. This ensures that each instruction completes before the next is dispatched.
Unsurprisingly, we see a significant slowdown (1/0.85\(\times\)) in compilation time over the naive scheduler, due to the cost of dependency analysis and heuristics. This is offset by improvements in kernel execution, which increases performance by up to ~25%. Note that as we believe kernel cost is dominated by high-latency instructions (e.g., double-precision, load/store) on the critical path, most queries show lower speedup. While we hoped that these improvements would be sufficient, overall speedup is only 0.95\(\times\), with most queries showing overall slowdown compared to a naive scheduler. Advanced scheduling is thus most applicable to repeatedly executed queries, or those that have long execution times and can offset the increased overhead. In other cases, it may be advantageous to use naive scheduling for the compilation speedup. We address longer execution in the following section and consider its performance as data sizes increase. Further research may also explore interactions with register allocation and false dependencies.
7.7 Scalability
Scalability is a important concern for database systems, as they are inherently designed for big data. rNdN excels at short-running queries, an area where compilation overhead is a significant barrier to adoption. As data sizes increase, however, and the computation and data transfers become more costly, the fixed compilation overhead we target clearly has less relevance. We therefore extend our analysis to larger scale factors on the TPC-H benchmark and consider the performance of our approach on increasing data sizes within the constraints of a single-GPU system.
7.7.1 Motivation.
We begin our scalability analysis by considering the performance of r3d3 with NVIDIA’s ptxas assembler on the TPC-H benchmark at SFs 2, 4 and 8, corresponding to 2 GB, 4- and 8-GB input data, respectively. Beyond this scale, the GPU memory is exceeded on some queries—a concern that requires additional runtime data management. Shown in Figure
19 is a percentage breakdown analysis for end-to-end execution at each scale factor. We can therefore determine the relative importance of each component as data grow, giving direction to future research.
As in our initial analysis, we observe a significant compilation overhead for all queries, even as data sizes increase and the computation/data transfer costs are correspondingly larger. In particular, the cost of the proprietary assembler (i.e., backend compiler) remains the single largest component for nearly all queries and at all scales. Exceptions include simple queries with large input data compared to computation (q6), and those with expensive execution (q1, q3, and q18). However, even in these cases, compilation is still a noteworthy overhead. A runtime-suitable backend compiler is thus advantageous even for larger input data, provided that execution is reasonably competitive.
7.7.2 rNdN.
Next, we show the percentage breakdowns for our rNdN system on the TPC-H benchmark in Figure
20. As expected, data transfers and computation become increasingly important as the input size grows, corresponding to the trend with ptxas. However, the compilation overhead is now significantly smaller throughout the scale factors, showing that a fast compilation pipeline is important for both short-running queries and those that operate on larger data. Further optimization at higher scales thus rests largely on improvements in data management and the database operations themselves, while smaller data sizes would benefit from additional compiler tuning.
7.7.3 Performance Comparison.
We next compare our performance at SF 8 against other database systems, focusing on the cached and end-to-end evaluations shown in Figure
21. Note that some additional queries may fail on the comparison systems due to memory limitations.
Execution: In terms of cached results, we see competitive performance compared to all baseline systems and geomean speedup on most. On the CPU side, we continue outperforming all comparison points, with increased performance than at SF 1. This is likely due to the higher throughput processor used in our approach. On the GPU side, we significantly outperform BlazingSQL, although our performance compared to OmniSci is mixed—some queries show significant speedup (q2, q4, q11, q16, and q20), while others show significant slowdown (q1 and q3). The geomean speedup over OmniSci thus decreases from 3.8\(\times\) on SF 1 to 2.1\(\times\) on SF 8. Compared to ptxas, we see improved performance compared to -O0 and competitive performance (0.96\(\times\)) on -O3 – a similar result to SF 1.
Tradeoff: On end-to-end evaluations, our approach is clearly beneficial, with geomean speedup compared to all other systems despite focusing on compile time rather than algorithmic optimization. This is especially clear for compiled databases and remains true for those that use interpreters. A notable exception includes q17 on MonetDB, although our performance was already poor at SF 1. Importantly, we maintain geomean speedups in relation to ptxas despite the small slowdowns in execution—a result owing to the vastly improved compilation time. Although the performance is less impressive than SF 1, this is to be expected as our work addresses the fixed-cost of compilation time. Note that although we show slowdowns on several queries compared to BlazingSQL, their data transfer times have been excluded due to the unrealistic cost.
7.7.4 Scheduler.
At lower scale factors, the cost of list scheduling outweighs the improvements in execution in nearly all queries. We therefore consider its use for larger input data where the execution is significantly longer and compilation overhead less important. Shown in Figure
22 is the performance of list scheduling compared to a naive baseline that does not reorder instructions or consider pipelining. Notice that the overall performance of most queries is now competitive to the naive scheduler despite the increase in cost, a notable improvement over lower scale factors. List scheduling is thus useful in some contexts, and should be investigated further. In particular, reductions in cost or a heuristic approach to selecting the best plan may be ideal.
9 Conclusion
Query compilation is limited by the additional overhead, a problem especially present in GPU database systems that rely on heavy proprietary compilers. Our open source and balanced approach to PTX assembly greatly reduces these costs without overly sacrificing execution speed, resulting in significant overall performance gains. This overcomes a major bottleneck in the practical use of query compilation for GPUs, beneficial even as data scales on single-GPU systems.
Balancing compilation cost and execution is critical to optimizing end-to-end performance. Additional improvements may also be possible through the use of caching or highly optimized implementations of higher-level code transformations. Given the minimal compilation overhead, we are also interested in exploring recompilation and its effect on repeated execution. This would expand our work to support parameterized queries and other runtime values. In addition, exploring the range of GPU inputs like graphics and scientific computation would be very interesting, validating the correctness and applicability of our approach on a wider set of industry benchmarks. Recent research on HorseIR has also explored the integration of user-defined functionality into queries, a direction that might necessitate additional optimization passes that are unnecessary in pure-relational queries [Chen et al.
2021].