skip to main content
research-article
Open access

rNdN: Fast Query Compilation for NVIDIA GPUs

Published: 19 July 2023 Publication History

Abstract

GPU database systems are an effective solution to query optimization, particularly with compilation and data caching. They fall short, however, in end-to-end workloads, as existing compiler toolchains are too expensive for use with short-running queries. In this work, we define and evaluate a runtime-suitable query compilation pipeline for NVIDIA GPUs that extracts high performance with only minimal optimization. In particular, our balanced approach successfully trades minor slowdowns in execution for major speedups in compilation, even as data sizes increase. We demonstrate performance benefits compared to both CPU and GPU database systems using interpreters and compilers, extending query compilation for GPUs beyond cached use cases.

1 Introduction

Query compilation is an effective means of improving the execution time of data-intensive SQL queries [Neumann 2011; Mostak 2014; Chen et al. 2018a;, 2020]. As query programs are relatively short, however, the cost of compilation itself often exceeds that of execution and may even be detrimental to overall performance. This problem is especially evident in GPU-based approaches that depend on expensive JIT compilation from manufacturer toolchains when lowering to the exact target architecture [Breß et al. 2017; Krolik et al. 2021]. r3d3, for example, a GPU query compiler and database system, uses the HorseIR array-based representation [Chen et al. 2018a] to greedily identify and fuse GPU kernels in queries based on HyPer query plans, translates them to PTX, NVIDIA’s low-level intermediate representation, and generates CUDA binaries using the proprietary closed source assembler [Krolik et al. 2021]. When data are cached on GPU memory and the query pre-compiled, this proves effective but falls short for end-to-end performance shown in Figure 1 primarily due to the high cost of NVIDIA’s assembler. Online query compilation systems targeting GPU hardware must therefore bypass vendor toolchains in favor of domain-specific pipelines that are judicious in their optimizations and focus on reducing overhead. Other recent approaches to GPU database systems focus instead on exploiting tensor units [Hu et al. 2022; He et al. 2022], optimizing parallel algorithms [Shanbhag et al. 2020], and robustness [Lee et al. 2021].
Fig. 1.
Fig. 1. Average execution breakdown for queries in the TPC-H database benchmark.
In this work, we extend the existing r3d3 system shown in Figure 2 to form an optimized end-to-end and open source pipeline for compiling and assembling SQL queries for NVIDIA GPUs. Notably, we address a major bottleneck in overall query performance through the design and implementation of a runtime-suitable assembler that generates code comparable to optimizing compilers without the typical overhead. This best-of-both-worlds approach acts as a drop-in replacement for the costly proprietary toolchain and is based on one key principle: profitably trading slowdown in execution for improvements in compilation. While challenging for general scientific computation, this work exploits the comparably narrow patterns found in query compilation and the array-based HorseIR representation (e.g., binary operations and well-defined algorithms like join) and pre-optimized intermediate representation (IR) generated by r3d3 to eschew traditional compiler passes like loop transformations and redundancy elimination. We employ a balanced and low-overhead approach that extends well-known techniques for instruction scheduling, register allocation, code generation, and assembly to runtime query compilation for GPUs. We thus couple the execution speed and algorithmic optimizations of r3d3 (e.g., kernel fusion) with a correspondingly optimized compilation pipeline. Additionally, support for any program expressible in array-based form naturally extends from our work, although further optimizations may be necessary for best performance as we assume the input programs are optimized algorithmically and only require simple lowering (e.g., no register spills). More recent work has looked at the possible support of user-defined functionality in HorseIR queries [Chen et al. 2021].
Fig. 2.
Fig. 2. r3d3 compiler and runtime overview. NVIDIA’s assembler highlighted in green.
We compare our approach against both interpreted and compiled state-of-the-art CPU and GPU databases and evaluate the performance tradeoffs and breakdown of our proposed system [BlazingSQL Inc. 2021; Mostak 2014; Chen et al. 2018a; Idreos et al. 2012]. Results indicate that our pipeline is effective at improving both compilation and end-to-end time against all comparison systems, while maintaining high speedups in cached scenarios. Importantly, in comparison to NVIDIA’s optimizing assembler, we vastly improve compilation time while showing only minor slowdown in kernel performance. Measured against the non-optimizing variant, our proposed pipeline is notably better balanced for query compilation and maintains significant compilation speedup while also improving execution. Additionally, we demonstrate that our approach remains beneficial even as data sizes grow and the relative importance of fixed-cost compilation decreases against that of execution. Contributions of our work include the following:
A runtime-optimized and open source1 assembler for NVIDIA GPUs that minimizes compilation cost while maintaining high performance on SQL queries. This removes a major limitation on the practical use of query compilation for GPUs.
Exploration of low-level compilation steps and challenges for NVIDIA GPUs, defining and demystifying the necessary compilation pipeline for query programs.
Evaluation on the complete TPC-H benchmark, showing performance improvements on end-to-end scenarios over both CPU and GPU database systems at multiple data sizes.

2 Background

A GPU is a multi-core execution unit optimized for throughput computation [Lindholm et al. 2008]. Abstractly, it follows a hierarchical design for both computation and memory, as shown in Figure 3(a). At the lowest level, a GPU consists of thousands of processing elements (PE), each of which compute the result of a single thread. PEs are then grouped into streaming multiprocessors (SM), which in turn form the GPU. Memory is likewise hierarchical, consisting of private registers for PEs, shared memory for each SM, and globally accessible device memory. The GPU is connected to the host system and memory via the PCIe bus. A GPU program defines the execution of a single thread (SIMT), as well as synchronization within a thread block—inter-block communication is limited. During execution, the collection of thread blocks form the grid. At runtime, each thread block is assigned to a multiprocessor and executed on the processing elements. More recently, thread block clusters give an additional layer of granularity for sharing and synchronization [NVIDIA 2021d].
Fig. 3.
Fig. 3. GPU architectures.
While the abstract view is sufficient for most high-level applications, to maximize performance and produce machine code, a lower-level view is necessary. Our approach targets two recent NVIDIA architectures, Pascal [NVIDIA 2016] and Ampere [NVIDIA 2020], while remaining extensible to other GPU platforms and families that follow similar principles. We show the layout of the Ampere architecture in Figure 3(b), describe its important features below and note differences with Pascal.
GPU multiprocessors are divided into partitions, each of which is responsible for the execution of a set of warps. A warp is a collection of 32 consecutive threads that are part of the same thread block. At runtime, warps are assigned to partitions by the hardware as are the scheduling tasks for the GPU.2 Each partition contains a warp scheduler, functional units, and register memory, while shared memory is accessible to the entire multiprocessor.
Each cycle, the warp scheduler picks the next available warp and the dispatch units issue the next instruction to the appropriate functional unit. In the case of Pascal, dual issue is supported for some instruction pairs and improves instruction-level parallelism. Functional units include cores (e.g., single-precision), double precision, special function units (e.g., math functions), and load-store. Integer computation is supported by all cores (e.g., Pascal) or a subset (e.g., Ampere). Other core types like tensors are not considered in our approach but have been evaluated in work like TCUDB [Hu et al. 2022]. Note that the exact number of each functional unit is architecture specific and impacts the throughput at which instructions can be issued [NVIDIA 2021d]. When the number of units is less than the size of a warp, multiple cycles are required to issue the instruction. Note that exact architectural details may differ from this model, but it provides sufficient accuracy to implement a high-performance and competitive assembler.

2.1 Compilation Pipeline

As part of the CUDA toolkit, NVIDIA provides a complete compilation pipeline, translating from CUDA code to an assembled binary in two stages shown in Figure 4 [NVIDIA 2021c]. CUDA code written by the programmer is first compiled to PTX, a low-level NVIDIA-developed IR for GPUs that targets a virtual architecture [NVIDIA 2021e]. This step can be performed on the developer’s machine as the code is mostly platform independent. When the exact device is known, the stage 2 compiler produces a machine-specific binary representation (SASS) that can be executed on the GPU [NVIDIA 2021a]. We adopt NVIDIA’s terminology of assembler for this last stage, although it more resembles a backend compiler responsible for low-level optimizations including instruction scheduling and register allocation.
Fig. 4.
Fig. 4. NVIDIA CUDA compilation pipeline [NVIDIA 2021c].

3 Overview

rNdN is a GPU assembler that specializes in runtime compilation of SQL queries for NVIDIA GPUs, improving performance on end-to-end evaluations over existing approaches while maintaining fast execution. As drop-in replacement for NVIDIA’s assembler (stage 2), each step is designed for high-performance code and implemented using algorithms that minimize compilation time. We integrate our approach into the r3d3 GPU database using a purpose-built framework for analysis and code generation supported by third-party optimized data structures [Leitner-Ankerl 2021]. An overview of the system is shown in Figure 5.
Fig. 5.
Fig. 5. rNdN assembler overview.
The assembler implements the second phase of the compilation pipeline, translating from IR to binary form using GPU-adapted techniques. Given an input PTX program, we first allocate registers before recovering structured-control flow used in our code generation strategy. We then generate a low-level, architecture-specific SASS code and fix small inefficiencies using a peephole optimizer. Instruction scheduling produces an efficient instruction order and embeds dependencies in the generated code. Last, we produce a CUDA binary directly in memory using the ELFIO framework [Lamikhov-Center 2020]. The result is loaded and executed on the GPU in an identical manner to using NVIDIA’s own assembler.

4 Register Allocation

PTX is an intermediate representation targeting a virtual machine with unlimited registers [NVIDIA 2021e]. We must therefore implement a register allocation scheme that considers real GPU properties. In particular, GPUs have a large number of general-purpose 32-bit registers per thread (255), a smaller number of predicate/Boolean register (7), and uniform registers [Burgess 2019;, 2020; NVIDIA 2021a] in more recent architectures. This reduces the need for expensive allocators and spilling heuristics, although we must still be mindful due to the impact on occupancy (number of active warps) [NVIDIA 2021b]. For types larger than 32-bit, two or more consecutive registers may be used, aligned to the data size as shown in Figure 6. At a hardware level, registers are organized in banks, where an instruction may only read 1 value in each bank per clock cycle [Gray 2016]. In our allocator, we ignore such bank conflicts, as the compilation overhead exceeds any potential benefit to execution dominated by data loads and stores.
Fig. 6.
Fig. 6. Register allocation pairings.
We use a modified linear scan allocator [Poletto and Sarkar 1999], considering register pairs and alignment, as it has proved efficient in JIT contexts and produces reasonable allocations compared to Chaitin’s algorithm [Chaitin 1982; Briggs et al. 1994]. In our preliminary evaluation, we also observed that further reducing registers had little impact on performance and was expensive to compute. This approach is supported by posts on NVIDIA’s forums that indicate that the proprietary scheme will use all “necessary” registers to improve performance.3 As GPUs contain a large number of registers (255) that far exceeds the number required for our benchmarks (always below 100), neither spilling nor register pressure are considered by the kernel fusion in the frontend compiler or our register allocator. Note that Ampere requires an additional 2 registers be allocated per thread for program counters.4
Linear scan register allocation uses live intervals, conservative ranges that do not consider dead sections. The input PTX code should therefore share virtual registers only when necessary and keep their use short. Registers are allocated for each virtual location using these intervals:
(1)
Compute live intervals (start,end), sorted by start position;
(2)
For each interval (start,end):
(a)
Free dead allocations;
(b)
Allocate aligned registers RX.RY;
(c)
Add interval (start,end,RX.RY) to live allocations;
Predicate registers are allocated independently alongside regular registers. During code generation, temporary registers are allocated above the high-water mark.

5 Code Generation

Given the register allocation, we then translate from PTX to SASS, generating high-performance code without the need for expensive optimization. In particular, we address control-flow structuring and efficient code generation patterns. While our techniques may be generalized to other domains, we avoid unstructured control-flow and optimize for code templates used by database queries.

5.1 SASS

SASS is a machine-level language designed for NVIDIA GPUs that varies between GPU iterations. Unfortunately, while PTX follows an open source specification, NVIDIA has yet to publish a detailed view of SASS. Multiple homegrown assembler projects have therefore attempted to reverse-engineer the machine code specification (asfermi [Yunqing 2015], MaxAs [Gray 2016], TuringAs [Yan et al. 2020], CuAssembler [cloudcores 2021]), and a recent paper by Hayes et al. presents a systematic approach to determine instructions regardless of architecture [2019]. Microbenchmarks have also been successful at discovering low-level hardware details [Jia et al. 2018;, 2019]. These projects are our primary source of low-level details, complemented by NVIDIA binary tools [NVIDIA 2021a] and our own reverse-engineering efforts. Shown in Figure 7 are example SASS programs for the Pascal and Ampere architectures that increment all values in an array.
Fig. 7.
Fig. 7. SASS example increment function for NVIDIA Pascal (left) and Ampere (right) GPUs.

5.2 Structured Control-flow

Control-flow poses additional challenges for GPU compilers, as the underlying hardware uses SIMD units and threads are not guaranteed to execute the same path. Additionally, the execution is platform dependent and differs between Pascal and Ampere. We therefore limit our implementation to well-defined structured control-flow found in query programs and generate the appropriate templates. Typical if(-else) statements, loops, and unlabelled break are recovered from the CFG using dominator/post-dominator analyses and in future work could be directly generated by the front-end and represented in the IR.

5.2.1 Pascal.

On the Pascal architecture, instructions are issued for an entire warp using a single program counter, even with branching and control-flow. If all threads within a warp take the same path through the program, then standard SIMD units are sufficient. However, if threads within a warp diverge and take separate paths, then the GPU must serialize execution of the branches and manage active threads for each path. Control-flow can be implemented using two approaches, first with predication and second through a divergence stack. The simplest solution, predication (@P prefix), allows an instruction to be issued to the SIMD units but only executed on threads flagged by a mask. This is effective for simple branching like if statements but challenging for nesting and unstructured graphs. To support these more complex cases, a hardware stack keeps track of the divergence points within a warp (branching), and the active threads for each path [Coon et al. 2010;, 2012; Nickolls et al. 2011; Mayank et al. 2015; Kothiya 2014]. Note that regardless of the approach, divergent execution paths within a warp are serialized at high cost [Bialas and Strzelecki 2016], while divergence between warps has no added overhead.
Shown in Figure 8 is the code and corresponding progression for an example if-else statement managed using the divergence stack. Each token on the stack conceptually represents a future execution path and contains metadata for its program counter and associated mask. Before the conditional branch is executed, the program thus pushes a SYNC token onto the stack using the SSY instruction, indicating the reconvergence point of the structure as well as the currently active threads. The reconvergence point serves as the continuation point once both branches have been executed. Next, a predicated branch instruction splits threads into two paths and a DIV (divergence) token is pushed containing the location and thread mask of the path not-taken. The active mask is also updated with the subset of active threads for the path taken. Execution continues along the taken path until complete (SYNC instruction). The stack is then popped and the program continues along the other path indicated by the popped DIV token. Once both paths have been executed, the program continues from the reconvergence point specified by the SYNC token.
Fig. 8.
Fig. 8. Example if-else program using the stack to manage thread divergence on the NVIDIA Pascal architecture (inspired by Coon et al. [2010];, 2012] and [Kothiya 2014]). Assumes four threads per warp (normally 32).

5.2.2 Ampere.

More recent architectures like Ampere use Independent Thread Scheduling, storing a separate program counter for each thread [Luke Durant and Stam 2017]. This allows complex control-flow without a dedicated hardware stack but must be used cautiously to ensure high performance. In particular, note that since SIMD units are still used for execution, divergence within a warp causes the same instructions to be issued multiple times if divergent threads arrive at different time points [NVIDIA 2021f]. Two new reconvergence methods have thus been implemented, a WARPSYNC instruction that forces reconvergence of all threads within a warp and a more flexible barrier resource alternative [NVIDIA 2021a;, 2021f]. The latter is equivalent to a limited height divergence stack and replaces the prior SSY/SYNC instructions with their barrier equivalents BSSY/BSYNC. These new instructions operate on barrier resources B0.B15, each of which counts the number of threads signaled using BSYNC and jumps to the continuation point set by BSSY when complete. Unconventional control-flow is thus simplified and can easily reconverge at arbitrary locations.

5.2.3 Patterns.

We define code generation patterns shown in Figure 9 for each control-flow kind, allowing for nested structures. Immediate post-dominators are used for the reconvergence point as they naturally capture the exit of loops and if-else structures [Fung et al. 2007]. In the case of loops, a conditional SYNC instruction updates the thread mask until none are active and the continuation point is then followed. For simplicity, we opt for an equivalent strategy on both architectures, mapping each position in the divergence stack to a distinct barrier resource on Ampere. Note that simple if-else branches may be flattened and implemented directly using predication.
Fig. 9.
Fig. 9. Structured control-flow SASS code generation patterns. Ampere patterns use barrier resources B0.B15 for each level of nested control-flow replacing the generic B (e.g., B0 for top-level structures).

5.3 Patterns

We use template-based code generation, mapping each PTX operation to its equivalent SASS instruction sequence. Each pattern incorporates power-of-two strength reduction and NVIDIA’s optimizations (e.g., XMAD vs. IMAD for multiplication as shown in Figure 10 [Emmart et al. 2016]). We assume the frontend code generation is efficient, avoiding redundant expressions and dead code. This simplified translation strategy allows us to minimize time spent in instruction selection.
Fig. 10.
Fig. 10. Code generation for 32-bit unsigned integer multiplication on Pascal (left) and Ampere (right) given by the PTX instruction mul.u32.lo %d, %a, %b.

5.4 Optimization

Given the generated SASS code, we use peephole optimization to remove dead operations (e.g., store values to RZ) and redundant moves (e.g., MOV R0, R0) without resorting to frequent special case handling—a typical compiler approach to optimization. As the code generation patterns already employ algorithmic optimizations, this proves sufficient in practice to generate high-performance code and is a key idea behind rNdN—defining the minimal compilation steps necessary for high performance execution. Optimizations requiring fixed-point analyses such as constant propagation and dead-code elimination were too expensive in our preliminary evaluations.

6 Scheduler

NVIDIA GPUs rely on compile-time scheduling to maximize instruction-level parallelism (ILP) and hide the latency of each operation. We first explore the required GPU scheduling directives, before outlining the properties of an efficient schedule. Last, we describe our instruction scheduling implementation, formalizing and extending the approach taken by MaxAs [Gray 2016].

6.1 Directives

Embedded within the machine code are scheduling directives generated at compile time and used by the warp scheduler during execution. Referred to as SCHI directives, they may be present for groups of three instructions (e.g., Pascal) or embedded directly in the instruction (e.g., Ampere) [Gray 2016; Hayes et al. 2019; Jia et al. 2019]. The directives binary layout is shared by both architectures and shown in Figure 11,
Fig. 11.
Fig. 11. SCHI directives binary layout.
Number of cycles to stall after dispatch;
Yield flag, allowing switching to other warps;
Read/write dependencies;
Wait dependencies;
Register operand cache.
Together these elements ensure that dependencies between instructions are satisfied. In particular, stall counts allow fixed-latency instructions to complete before issuing dependent operations, and barriers handle cases of variable-latency instructions (e.g., memory loads/stores). There are six barrier resources (SB0.SB5) shared for reads and writes and each containing a scoreboard register [Ohannessian Jr et al. 2015]. Issuing an instruction with a read or write barrier increases the scoreboard value, while signaling decreases. Waiting on a barrier resource with an SCHI directive requires that the corresponding scoreboard register be zero before being available to dispatch. The directive also contains a yield hint, allowing the scheduler to switch between warps [Gray 2016; Yan et al. 2020] and operand cache flags, which enables a small register operand cache to avoid bank conflicts. Both hints are optional but may improve parallelism in some contexts. Note that as database queries are likely dominated by data loads and stores, the performance impact of scheduling is limited.

6.2 Instruction Properties

Each instruction is associated with an architecture-specific instruction class of scheduling properties, extending the idea used by MaxAs [Gray 2016] for the older Maxwell architecture. Using microbenchmarks, we collect the following properties for both Pascal and Ampere:
Functional unit;
Depth latency for fixed-latency instructions;
Read/write latencies for variable-latency instructions;
Throughput for the functional unit;
Dual issue and register cache support.
Variable latencies are estimates as they vary between executions, whereas fixed-latency instructions are exact. We derive the functional unit and throughput for each instruction based on the programming manual [NVIDIA 2021d] and hardware diagrams.

6.3 Schedule Properties

Instruction scheduling must produce an efficient order of instructions that satisfies dependencies. We therefore consider the following instruction and program properties in our approach:
(1)
Fixed-latency instructions must complete;
(2)
Variable-latency instructions must set and signal a barrier resource;
(3)
Independent instructions should be overlapped; and
(4)
Long critical paths should be prioritized.
Note that properties (1) and (2) guarantee correctness, while (3) and (4) improve ILP and minimize unnecessary stalls. We show examples of some properties in Figure 12 and any required scheduling directives.
Fig. 12.
Fig. 12. Instruction scheduling selects the next available instruction to improve ILP and minimize stalls. Note that the IADD fixed-latency is architecture dependent (5 on Ampere; 6 on Pascal).
For fixed-pipeline depth instructions such as integer and single-precision, dependent instructions must wait until the prior instruction is complete. In the case of IADD, this means waiting a minimum of six cycles on the Pascal architecture or five on Ampere. To maximize ILP, independent instructions should be issued while waiting on the previous result. The scheduler must also consider the throughput of each functional unit; when the warp size exceeds the number of units, multiple cycles are required to issue the instruction. Although the hardware will insert necessary stalls for throughput [Gray 2016], the scheduler should still consider their effect on performance. The exact number of functional units, throughput, and pipeline depths vary between platforms.
In the case of variable-latency instructions shown in Figure 12(b), the scheduler must insert necessary barriers, ensuring source registers are free for reuse (read dependency), or destination registers are written (write dependency). For an efficient schedule, the scheduler must hide the high latency of memory reads and writes with independent instructions as they are typically among the most expensive operations. Additionally, since the number of barrier resources is limited, an assignment policy is required to avoid excessive sharing and the associated performance degradation. Last, as pictured in Figure 12(c), an efficient schedule should prioritize instructions with a long chain of dependencies so they can be overlapped by independent instructions.

6.4 List Scheduler

We implement list scheduling [Gibbons and Muchnick 1986] at the basic block level, generating a topological order for instructions based on a scheduling heuristic. A similar algorithm is used in MaxAs [Gray 2016], although we extend it with an updated heuristic and support automatic barrier insertion. In addition, we automatically decompose the basic block by producing a separate dependency graph for each schedulable section, sequences of instructions free of control dependencies, and capitalize on long sequences of code generated by branch inlining. Synchronization directives naturally impose constraints on the graph, allowing us to optimize the representation and remove excessive edges. Note that an efficient dependency graph representation is required for list scheduling to avoid the allocation overhead as well as frequent heuristic and state updates. For each schedulable section, we produce an optimized schedule:
(1)
Initialize available instructions and heuristics;
(2)
While next available instruction:
(a)
Insert dependency barriers;
(b)
Update stall count of previous instruction;
(c)
Update available instructions and their heuristic;
(3)
Wait on all active barriers.
As an extension to list scheduling, we maintain two times for each instruction: expected time and available time, splitting the legal time an instruction is available from its expected execution time (e.g., waiting on a barrier). Dependency barriers are automatically inserted for variable-latency dependencies as their completion time is unknown.
Based on the expected value, our scheduling heuristic selects the next available instruction to reduce stall counts and improve ILP, while considering any fixed-latency or variable-latency dependencies. For variable-latency instructions, we approximate the stall based on microbenchmarks. If equal, then we prefer instructions with a longer expected critical path (path to exit) as their effect on execution is higher. Last, in the event of a tie, the line number is used to produce a deterministic order. Scheduling ends by conservatively requiring all instructions be complete, either by stalling for the entire instruction duration (fixed-latency) or through the appropriate barrier (variable-latency).
Note that as the Pascal architecture has two dispatch units, it can issue two independent instructions for separate functional units each clock cycle. In our workload, performance improvements from dual issue are negligible, likely because query programs are read/write intensive and dual issue is ineffective at hiding this latency. Register reuse is similar, where the cost of reuse-scheduling outweighs improvements. As further discussed in our evaluation, the cost of optimizations such as list scheduling must be carefully balanced against their impact on execution.

6.5 Barriers

Barrier waits inserted in the SCHI directives must wait on all instructions that set the barrier, regardless of the actual dependency. This is problematic for programs with numerous variable-latency instructions, as the number of barrier resources is limited. To circumvent this issue, NVIDIA associates a scoreboard register with each barrier, representing the number of unsignaled dependencies, and a DEPBAR instruction that can wait until the count drops below a threshold. During scheduling, the compiler can opt to use these partial barrier instructions if waiting on an earlier dependency. An example use case is shown in Figure 13, where the IADD instruction is only dependent on the first of three loads and would therefore benefit from a partial barrier.
Fig. 13.
Fig. 13. Counting barriers example, minimizing wait time.
This strategy is effective for hiding high-latency instructions but requires that instructions share a barrier signal in-order. For this reason, we reserve counting barriers for load and store instructions, while using SCHI directives for shorter operations with variable latencies. To simplify logic we use separate barriers for read and writes and count the number of interim instructions incrementing the scoreboard for the partial barrier. In contrast, NVIDIA’s patent on software supported scoreboarding suggests more aggressive barrier sharing for both reads and writes [Ohannessian Jr et al. 2015]. Our list scheduler automatically inserts the appropriate barrier, either counting or SCHI, depending on the instruction dependencies. Note that due to unknown changes to load instruction re-ordering, we have been unable to adapt this strategy directly to Ampere and require further investigation.

7 Evaluation

In this section, we evaluate our implementation, including a breakdown of execution and compilation phases, comparison against recent open source CPU and GPU systems, and the impact of assembly phases on compilation and execution. Last, we consider the scalability of our approach.

7.1 Methodology

Experimental Setup:.

All experiments were performed on Ubuntu 20.04 LTS, equipped with an Intel i7-8700K @ 3.7GHz, 32-GB DDR4 RAM, and an NVIDIA GeForce GTX 3080 (Ampere). Similar trends and conclusions throughout our evaluation on an older NVIDIA GeForce GTX 1080 Ti (Pascal) not pictured. We use CUDA 11.4.2 and graphics driver 470.57.02. Experiments are averaged over 10 runs (dropping the highest and lowest) after five warm-up iterations. Comparison systems and their characteristics are listed in Table 1, each using their respective execution plans. ptxas represents the existing r3d3 system using NVIDIA’s PTX compiler APIs.5 Both -O0 and -O3 are evaluated, comparing our performance against both the optimizing and non-optimizing assemblers. r3d3 thus shows the impact from query parallelization, while rNdN illustrates the benefit specific to this work. We include single-threaded CPU comparison points to evaluate the effectiveness of offloading database queries to accelerators even with the additional overhead. Evaluating the impact of multi-core CPUs and considering the tradeoffs is left as future work.
Table 1.
SystemVersionDeviceExecution
ptxas (r3d3) [Krolik et al. 2021]Apr. 2022GPUCompiled
OmniSci (previously MapD [Mostak 2014],5.10.2 (Feb. 2022)GPUCompiled
currently HeavyDB [HEAVY.AI 2023])   
BlazingSQL [BlazingSQL Inc. 2021]21.08.02 (July 2021)GPURAPIDS.AI
MonetDB [Idreos et al. 2012]11.43.9 (Jan. 2022)CPUInterpreted
HorsePower [Chen et al. 2018a]August 2020CPUCompiled
Table 1. Comparison Systems

TPC-H Benchmark:.

We use the TPC-H benchmark at scale factor (SF) 1 (corresponding to 1 GB input data) to evaluate the main performance of our approach and scale factors 2/4/8 for scalability (limited by the available GPU memory on our system) [TPC Council 2017]. It consists of 22 SQL queries for a business context and is commonly used for performance evaluation of database systems. HorsePower, r3d3, and rNdN support the entire benchmark suite using query plans generated from HyPer [Neumann 2011]. MonetDB is equally complete. OmniSci does not support three queries: q14, q21, and q22; BlazingSQL requires modified input queries due to parsing requirements6 and does not support and queries: q11, q15, q21, and q22. Note that as data increase, some additional queries may no longer be supported due to memory constraints. As queries execute on substantial data and the output is validated, reasonable correctness is ensured for each compiler phase. Architecture details like fixed-latency stalls were confirmed by microbenchmarks.

7.2 Motivation

Motivating our focus on efficient compilation, we begin by considering the baseline performance of r3d3 using NVIDIA’s assembler on the TPC-H benchmark. Shown in Figure 14 are execution and compilation breakdowns of both the optimizing assembler (-O3) and non-optimizing variant (-O0). Execution is divided into CPU and GPU computation (e.g., kernel and analysis), data caching and intermediate operations (e.g., transfers, resizing), and overhead, while compilation is split into the stage 1 and 2 phases. Stage 1 is provided by the r3d3 system, while stage 2 corresponds to NVIDIA’s PTX compiler API. All times are measured in milliseconds, and note that the right compilation scale is 3\(\times\) that of the left execution scale with annotations for bars exceeding the plot.
Fig. 14.
Fig. 14. r3d3 (ptxas) execution and compilation breakdowns at SF 1.
Notably, we observe that compilation time greatly exceeds execution throughout the benchmark, representing the largest cost even with optimizations disabled. In fact, the marginal improvements in compilation time from the non-optimizing backend are offset by significant slowdowns in execution and give no significant end-to-end speedup. Short-running queries would therefore benefit from a more targeted and balanced approach that performs the necessary performance optimizations without the typical overhead—the cornerstone of our approach. Note that the compilation time of q1 is higher with -O0 than with -O3, likely due to the larger code size when unoptimized.

7.3 Execution Breakdown

We next present the breakdown of execution and compilation time using our rNdN system, pictured in Figure 15. As in the motivation, we include data caching and intermediate management, CPU and GPU computation, and overhead for execution, while compilation is divided into two parts, compile and assemble, corresponding to stages 1 and 2 of the pipeline. Stage 1 is provided by the existing r3d3 system, while stage 2 represents our new optimized assembler. All timings are measured in milliseconds with identical scales for execution and compilation.
Fig. 15.
Fig. 15. rNdN execution and compilation breakdown at SF 1.
Overall, execution time exceeds that of compilation, with a few disproportionate exceptions (q2 and q11) due to the relatively high-complexity of the programs compared to their execution. This is a notable improvement over previous query compilers, which typically showed execution time dominated by compilation. We do re-affirm, however, that assembly is the most expensive part of the pipeline. In terms of execution, most computation is GPU-resident, with the exception of ordered string comparisons (e.g., @order in q16) and @substring in q22. High-performance GPU kernels are thus a necessity for best results. As with most GPU programs, data transfer is costly, which highlights the importance of caching.
Shown in Figure 16 is a breakdown of compilation, indicating the relative proportion of each phase in the pipeline. Included are the frontend (stage 1) compiler, and a more detailed view of the backend including the control-flow structuring, register allocator, code generator, instruction scheduler, peephole optimizer and finally the binary assembler. A similar breakdown for NVIDIA’s ptxas, used by r3d3, is not possible as the implementation is closed source.
Fig. 16.
Fig. 16. rNdN detailed compilation breakdown.
Most queries follow the same pattern, with the backend dominating the frontend. The exception to this, q6, generates a small PTX program with limited control flow and is thus simple to compile. In terms of backend phases, register allocation is the most expensive, requiring computation of live variables using a fixed point analysis. Scheduling, code generation, and control-flow structuring take up the remaining time in roughly equal proportions, while optimization and binary assembly are fast. Backend compilation is thus the primary cost, which we attribute primarily to the relative complexity of low-level PTX over the high-level input query language, HorseIR.

7.4 Performance Comparison

We define 3 comparison points, evaluating common usage scenarios. First, we evaluate compilation time, measuring the cost of compiling the query before execution. Next, we measure cached execution, assuming a pre-compiled query with GPU-resident data. Together these results show the effectiveness of trading compilation for execution. Last, we show total execution, measuring the query end-to-end including data caching and compilation. This represents cold-start, or situations where the data or query is changing (e.g., parameterization). All graphs show speedups normalized to the performance of each comparison system, with higher results better for our system and “x” indicating the comparison system failed.
Compilation: Starting with compilation time in Figure 17(a), we see significant performance improvements across the board. Importantly, this holds true for both ptxas -O0 and -O3 with 7.6\(\times\) and 8.7\(\times\) geomean speedups, respectively. This shows the performance of our approach even against non-optimizing systems. Note that we observed significant fixed-cost with NVIDIA’s assembler on empty programs, which explains the large speedup for q6, a simple query. Compared to OmniSci and HorsePower, both use compilers designed for offline use and are thus inefficient. As compilation is the largest cost for many queries, this overhead has significant impact on end-to-end results.
Fig. 17.
Fig. 17. Speedup of rNdN compared to baseline systems at SF 1 (higher bar indicates better results for rNdN). Queries that do not execute on the baseline are indicated by “x.”
Execution: Next we consider the cached execution scenario seen in Figure 17(b), measuring only query execution and the performance of the generated code. Compared to ptxas -O3, we achieve 0.97\(\times\) speedup, with most queries above 90% performance. Using -O0, this speedup increases to 1.7\(\times\). rNdN therefore produces code comparable to ptxas -O3 at significantly lower cost than -O0. In relation to the other baseline systems, we see overall speedups with few exceptions. Queries with slowdowns typically have inefficient group computations (q1 and q3) and require further investigation.
Tradeoff: Last, we present the total execution comparison in Figure 17(c). Notably, we see speedups on all queries compared to ptxas with both -O0 and -O3, indicating that rNdN successfully minimizes compilation cost without excessively slowing execution. Interestingly, geomean results for -O0 and -O3 are near identical, indicating that compilation speedup from skipping optimizations is matched by slowdown in execution. Compared to other compiled systems (OmniSci, HorsePower), we see significant performance improvements in all queries, largely due to optimized compilation. Compared to non-compiled systems we also see overall speedup, including the cost of data transfers.7 GPUs are thus effective for end-to-end query optimization, even in cold-start cases.

7.5 Register Allocation

At runtime, GPUs assign thread blocks to multiprocessors, maximizing instruction-level parallelism by saturating each warp scheduler. However, as the number of registers is fixed, register pressure limits the number of thread blocks in flight (i.e., occupancy). We thus evaluate the effectiveness of our allocation by comparing the register usage of rNdN and ptxas for each kernel.
As with all register allocation algorithms, linear scan is a heuristic approach, performing better in some cases than others. Geomean-wise, we see a 10% geomean increase in register usage compared to ptxas -O3 and a 22% decrease compared to ptxas -O0. As our cached execution is within 5% of NVIDIA’s optimizing assembler, register allocation is likely not the most significant bottleneck, supporting our simplistic approach. Additionally, the number of registers is significantly lower than the limit across the entire benchmark and never requires spilling (no kernel exceeds 100 registers).

7.6 Scheduler

We next consider the impact of our scheduler on execution and compilation, measuring the impact on compilation, kernel performance and end-to-end evaluation as shown in Figure 18. As a comparison point, we implemented a naive scheduler that maintains the instruction order and does not consider pipelining. This ensures that each instruction completes before the next is dispatched.
Fig. 18.
Fig. 18. Speedup of list scheduling compared to a naive approach at SF 1 (no pipelining).
Unsurprisingly, we see a significant slowdown (1/0.85\(\times\)) in compilation time over the naive scheduler, due to the cost of dependency analysis and heuristics. This is offset by improvements in kernel execution, which increases performance by up to ~25%. Note that as we believe kernel cost is dominated by high-latency instructions (e.g., double-precision, load/store) on the critical path, most queries show lower speedup. While we hoped that these improvements would be sufficient, overall speedup is only 0.95\(\times\), with most queries showing overall slowdown compared to a naive scheduler. Advanced scheduling is thus most applicable to repeatedly executed queries, or those that have long execution times and can offset the increased overhead. In other cases, it may be advantageous to use naive scheduling for the compilation speedup. We address longer execution in the following section and consider its performance as data sizes increase. Further research may also explore interactions with register allocation and false dependencies.

7.7 Scalability

Scalability is a important concern for database systems, as they are inherently designed for big data. rNdN excels at short-running queries, an area where compilation overhead is a significant barrier to adoption. As data sizes increase, however, and the computation and data transfers become more costly, the fixed compilation overhead we target clearly has less relevance. We therefore extend our analysis to larger scale factors on the TPC-H benchmark and consider the performance of our approach on increasing data sizes within the constraints of a single-GPU system.

7.7.1 Motivation.

We begin our scalability analysis by considering the performance of r3d3 with NVIDIA’s ptxas assembler on the TPC-H benchmark at SFs 2, 4 and 8, corresponding to 2 GB, 4- and 8-GB input data, respectively. Beyond this scale, the GPU memory is exceeded on some queries—a concern that requires additional runtime data management. Shown in Figure 19 is a percentage breakdown analysis for end-to-end execution at each scale factor. We can therefore determine the relative importance of each component as data grow, giving direction to future research.
Fig. 19.
Fig. 19. r3d3 (ptxas -O3) execution and compilation scalability breakdowns across SF 1, 2, 4, and 8.
As in our initial analysis, we observe a significant compilation overhead for all queries, even as data sizes increase and the computation/data transfer costs are correspondingly larger. In particular, the cost of the proprietary assembler (i.e., backend compiler) remains the single largest component for nearly all queries and at all scales. Exceptions include simple queries with large input data compared to computation (q6), and those with expensive execution (q1, q3, and q18). However, even in these cases, compilation is still a noteworthy overhead. A runtime-suitable backend compiler is thus advantageous even for larger input data, provided that execution is reasonably competitive.

7.7.2 rNdN.

Next, we show the percentage breakdowns for our rNdN system on the TPC-H benchmark in Figure 20. As expected, data transfers and computation become increasingly important as the input size grows, corresponding to the trend with ptxas. However, the compilation overhead is now significantly smaller throughout the scale factors, showing that a fast compilation pipeline is important for both short-running queries and those that operate on larger data. Further optimization at higher scales thus rests largely on improvements in data management and the database operations themselves, while smaller data sizes would benefit from additional compiler tuning.
Fig. 20.
Fig. 20. rNdN execution and compilation scalability breakdowns across SF 1, 2, 4, and 8.

7.7.3 Performance Comparison.

We next compare our performance at SF 8 against other database systems, focusing on the cached and end-to-end evaluations shown in Figure 21. Note that some additional queries may fail on the comparison systems due to memory limitations.
Fig. 21.
Fig. 21. Speedup of rNdN compared to baseline systems at SF 8 (higher bar indicates better results for rNdN). Queries that do not execute on the baseline are indicated by “x.”
Execution: In terms of cached results, we see competitive performance compared to all baseline systems and geomean speedup on most. On the CPU side, we continue outperforming all comparison points, with increased performance than at SF 1. This is likely due to the higher throughput processor used in our approach. On the GPU side, we significantly outperform BlazingSQL, although our performance compared to OmniSci is mixed—some queries show significant speedup (q2, q4, q11, q16, and q20), while others show significant slowdown (q1 and q3). The geomean speedup over OmniSci thus decreases from 3.8\(\times\) on SF 1 to 2.1\(\times\) on SF 8. Compared to ptxas, we see improved performance compared to -O0 and competitive performance (0.96\(\times\)) on -O3 – a similar result to SF 1.
Tradeoff: On end-to-end evaluations, our approach is clearly beneficial, with geomean speedup compared to all other systems despite focusing on compile time rather than algorithmic optimization. This is especially clear for compiled databases and remains true for those that use interpreters. A notable exception includes q17 on MonetDB, although our performance was already poor at SF 1. Importantly, we maintain geomean speedups in relation to ptxas despite the small slowdowns in execution—a result owing to the vastly improved compilation time. Although the performance is less impressive than SF 1, this is to be expected as our work addresses the fixed-cost of compilation time. Note that although we show slowdowns on several queries compared to BlazingSQL, their data transfer times have been excluded due to the unrealistic cost.

7.7.4 Scheduler.

At lower scale factors, the cost of list scheduling outweighs the improvements in execution in nearly all queries. We therefore consider its use for larger input data where the execution is significantly longer and compilation overhead less important. Shown in Figure 22 is the performance of list scheduling compared to a naive baseline that does not reorder instructions or consider pipelining. Notice that the overall performance of most queries is now competitive to the naive scheduler despite the increase in cost, a notable improvement over lower scale factors. List scheduling is thus useful in some contexts, and should be investigated further. In particular, reductions in cost or a heuristic approach to selecting the best plan may be ideal.
Fig. 22.
Fig. 22. Speedup of list scheduling compared to a naive approach at SF 8 (no pipelining).

8 Related Work

Query compilation is an effective strategy to improve the performance of SQL queries, removing the interpretive overhead of traditional database systems. Extending this technique to GPU hardware, we adapt well-known compilation techniques into a novel, runtime-optimized system.

8.1 Query Compilation

Despite performance improvements from query compilation, the additional overhead often proves prohibitive in end-to-end evaluation. The OmniSci GPU database [Mostak 2014] and CPU system HyPer [Neumann 2011] both address this issue in part by using lower-level toolchains such as LLVM and code caching strategies. More recently, Umbra [Kersten et al. 2021] and Flounder IR [Funke et al. 2020; Funke and Teubner 2021] have opted for machine-level intermediate representations for even lower overhead in CPU databases. Our work extends this idea to GPUs, similarly addressing the cost behind ahead-of-time compilers by generating machine code, minimizing register allocation cost, and relying on well-defined code generation patterns in the frontend for high-performance. Other compiled GPU database systems typically consider the execution and compilation time independently [Breß et al. 2017] or do not evaluate the overhead [Paul et al. 2020].

8.2 Compilers and Assemblers

Backend GPU compilers or assemblers have typically been closed source implementations developed by the hardware manufacturers themselves. This trend has recently been reversed, with AMD (2021, 2022), Intel [Chandrasekhar et al. 2019], and Apple [Maggioni and Chandrasekaran 2017] all releasing information on their LLVM extensions. Our solution targets NVIDIA hardware, a comparatively opaque area besides instruction mnemonics and disassembly tools [NVIDIA 2021a].

8.2.1 NVIDIA Architecture.

Multiple open source assemblers targeting NVIDIA hardware have been developed in recent years, each supporting a collection of compatible GPU families through reverse-engineering (Decuda for G80 [van der Laan 2010], asfermi for Fermi [Yunqing 2015], KeplerAs for Kepler [Zhang et al. 2017], MaxAs for Pascal and Maxwell [Gray 2016], and TuringAs for Turing, Volta, and Ampere [Yan et al. 2020]). Domain-level experts may thus modify existing CUDA binaries or writing low-level programs directly in SASS. They are, however, limited by pipelines that perform little optimization and may be designed for specific kernels (e.g., convolutions [Yan et al. 2020]). Our work extends these reverse-engineering efforts and formalizes a complete, optimizing, and runtime-suitable PTX to SASS compilation framework for database queries. A similar pipeline implemented in the Nouveau graphics driver is equivalent to this approach, although we target compute applications in PTX rather than graphics intermediate forms [freedesktop.org 2022].
Concurrently developed, Yan et al. have also proposed a preliminary, open source, LLVM-based implementation for compiling LLVM IR to SASS (supporting Volta, Turing, and Ampere) without the use of NVIDIA’s proprietary pipeline [2022]. Similar to our approach, they implement instruction scheduling and control-flow flattening to improve performance, although we focus on runtime efficiency for just-in-time use and sidestep ahead-of-time frameworks due to their cost.

8.2.2 Register Allocation.

GPU-specific register allocation algorithms are typically designed with both vector and scalar registers in mind, as serialized execution from divergent branches introduces additional scalar dependencies not caught by liveness analysis [Kalra 2015; Chen et al. 2018b]. Consequently, Kalra proposed a graph colouring approach for AMD GPUs that separately allocates scalar values using additional control edges [2015]. Similarly, the Intel production compiler uses an augmentation analysis to add additional constraints to the interference graph [Chen et al. 2018b]. Their approach also minimizes bank conflicts and false dependencies, with variables contained within a basic block allocated using linear scan and the remaining global variables relegated to graph colouring. As our implementation allocates vector registers, we can safely ignore these dependencies. For embedded GPUs with limited register resources, efficient allocation is essential. You and Chen introduce an extension to linear scan, element-based register allocation, that independently considers the live range of each component (i.e., x and y) for vector allocations [You and Chen 2015]. This approach allows more efficient register usage as some vector components may be reused before others.
MaxAs, an assembler for NVIDIA Pascal and Maxwell GPUs, implements a basic register allocator for code without control-flow [Gray 2016]. It focuses on efficient use of register banks, reducing conflicts through its assignment policy and use of register caching. Our register allocator is based on linear scan [Poletto and Sarkar 1999], computing live intervals for the complete control-flow and adapting the allocation to support register-pairs and alignment (similar to what is done in Intel’s graph colouring approach [Chen et al. 2018b]). We therefore support general GPU code, but ignore low-level allocation details like bank conflicts due to the compilation overhead. Other production compilers like Apple’s have discussed the importance of balancing register pressure and occupancy in their approach, while minimizing costly spills [Maggioni and Chandrasekaran 2017]. Although no absolute timings are presented, they indicate that instruction selection is heavier than register allocation—the opposite of our pipeline. Note that other approximations for live variables have been proposed for CPU databases, achieving near linear complexity [Kohn et al. 2018; Neumann 2020]. Other approaches like Flounder further reduce the compilation cost through specific liveness directives in the representation [Funke et al. 2020; Funke and Teubner 2021]. The Nouveau allocation scheme is based mainly on interference graphs and is unlikely to be suitable for runtime use [freedesktop.org 2022].

8.2.3 Instruction Scheduling.

Efficient instruction scheduling balances resource usage against increased parallelism. MaxAs implements list scheduling [Gibbons and Muchnick 1986] for Pascal and Maxwell architectures, optimizing user-defined schedulable sections. Their heuristic is based on (in order): fixed stall counts, dual issue capability, mixing functional units, and the number of dependencies [Gray 2016]. Our approach builds on and formalizes this work, systematically supporting multiple architectures and general control-flow. Additionally, we automatically insert variable-cycle dependency barriers (possibly using scoreboard registers) and use a tuned heuristic that prioritizes the expected stall. The latter is important for our context given the prominence of high-latency memory accesses in database queries. Additionally, while MaxAs supports dual issue and register reuse flags, we omit these optimizations to improve compile time.
The Nouveau driver scheduler is likewise derived from MaxAs, although it does not consider instruction reordering or throughputs [freedesktop.org 2022]. Scheduling is on a per-block basis, tracking dependencies and computing stall counts by recording the available time for each register rather than using an explicit dependency graph. Barriers are inserted automatically (no scoreboarding), and they consider interactions between basic blocks. TuringAs has explored manual optimization using the yield flag and high-latency instruction placement [Yan et al. 2020]. Our use of the yield flag corresponds to their “natural yield strategy”, chosen for its simplicity.
Gong et al. propose TwinKernels, a compile-time scheduling approach that produces two distinct instruction schedules for each kernel [2017]. Warps are assigned to either of the two implementations during execution, reducing contention from high-latency operations. Compile-time trace scheduling has also been investigated, using speculation to move (high-latency) instructions to earlier basic blocks and minimize divergent execution [Jablin et al. 2014]. Synchronization instructions preclude certain reorderings, similar to our idea of schedulable sections. Additionally, the authors note that predication may produce longer traces, an approach used in our work to expose additional reordering possibilities, though they rely on NVIDIA’s pipeline for this optimization.
Other instruction schedulers merge the problem with register allocation, balancing register pressure and increased ILP. Goodman and Hsu proposed both an adaptive solution that uses multiple scheduling heuristics and a DAG-based register allocator that reduces false dependencies [1988]. Shobaki et al. employ a branch-and-bound algorithm, prioritizing occupancy and register usage before selecting the instruction order [2020]; a technique augmented with graph transformations to reduce the search space [Shobaki et al. 2022]. Occupancy is also used in the LLVM list scheduler for AMD GPUs, amongst other GPU properties [AMD 2021; Shobaki et al. 2020]. Similarly, Intel and Apple both use scheduling heuristics that balance register pressure and ILP [Chandrasekhar et al. 2019; Maggioni and Chandrasekaran 2017].

8.2.4 Control-flow.

Unstructured control-flow is challenging for compiler writers, although structuring control-flow is a complex process in itself. Zhang and D’Hollander introduced hammock graphs, “single-entry, single-exit regions” for structuring control-flow using a set of transformations [2004]. This work has been extended to GPU contexts, some of which require structured control-flow (e.g., AMD IL [2011]) [Domínguez and Kaeli 2013]. Similarly, Wu et al. have proposed a transformation-based approach to structuring while reinforcing the importance of the reconvergence point [2011]. Reissmann et al. demonstrated that re-execution of code duplicated during structuring can be avoided by producing tail-controlled loops and well-nested control-flow [2016]. Other strategies to reduce code expansion include linearization through predicated basic blocks [Anantpur and R. 2014], and less stringent methods like reconverging CFGs [Wahlster 2019].
Our approach to control-flow is more limited, requiring a well-structured graph in return for smaller code size and simple immediate post-dominator reconvergence. This differs from Apple’s approach that supports unstructured control-flow through basic structuring and typical code duplication [Maggioni and Chandrasekaran 2017]. Apple, Nouveau [freedesktop.org 2022], and rNdN all flatten branch structures using predication, exposing additional scheduling opportunities.

9 Conclusion

Query compilation is limited by the additional overhead, a problem especially present in GPU database systems that rely on heavy proprietary compilers. Our open source and balanced approach to PTX assembly greatly reduces these costs without overly sacrificing execution speed, resulting in significant overall performance gains. This overcomes a major bottleneck in the practical use of query compilation for GPUs, beneficial even as data scales on single-GPU systems.
Balancing compilation cost and execution is critical to optimizing end-to-end performance. Additional improvements may also be possible through the use of caching or highly optimized implementations of higher-level code transformations. Given the minimal compilation overhead, we are also interested in exploring recompilation and its effect on repeated execution. This would expand our work to support parameterized queries and other runtime values. In addition, exploring the range of GPU inputs like graphics and scientific computation would be very interesting, validating the correctness and applicability of our approach on a wider set of industry benchmarks. Recent research on HorseIR has also explored the integration of user-defined functionality into queries, a direction that might necessitate additional optimization passes that are unnecessary in pure-relational queries [Chen et al. 2021].

Footnotes

7
Note that BlazingSQL transfer overheads are unrealistic in our testing, and are therefore excluded for fairness.

References

[2]
AMD. 2021. GCN Native ISA LLVM Code Generator–ROCm Documentation 1.0.0 documentation. Retrieved from https://rocmdocs.amd.com/en/latest/ROCm_Compiler_SDK/ROCm-Native-ISA.html.
[3]
AMD. 2022. Let’s Build Everything–GPUOpen. Retrieved from https://gpuopen.com/.
[4]
Jayvant Anantpur and Govindarajan R.2014. Taming control divergence in GPUs through control flow linearization. In Compiler Construction, Albert Cohen (Ed.). Springer, Berlin, 133–153.
[5]
Piotr Bialas and Adam Strzelecki. 2016. Benchmarking the cost of thread divergence in CUDA. In Parallel Processing and Applied Mathematics, Roman Wyrzykowski, Ewa Deelman, Jack Dongarra, Konrad Karczewski, Jacek Kitowski, and Kazimierz Wiatr (Eds.). Springer International, Cham, 570–579.
[6]
BlazingSQL Inc.2021. BlazingSQL—High Performance SQL Engine on RAPIDS AI. Retrieved from https://blazingsql.com/.
[7]
Sebastian Breß, Bastian Köcher, Henning Funke, Tilmann Rabl, and Volker Markl. 2017. Generating custom code for efficient query execution on heterogeneous processors. VLDB J. 27, 6 (092017).
[8]
Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to graph coloring register allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May1994), 428–455.
[9]
John Burgess. 2019. RTX on—The NVIDIA Turing GPU. Hot Chips Symposium in 2019. Retrieved from https://old.hotchips.org/hc31/HC31_2.12_NVIDIA_final.pdf.
[10]
John Burgess. 2020. RTX On—The NVIDIA Turing GPU. IEEE Micro 40, 2 (2020), 36–44.
[11]
G. J. Chaitin. 1982. Register allocation & spilling via graph coloring. In Proceedings of the SIGPLAN Symposium on Compiler Construction (SIGPLAN’82). Association for Computing Machinery, New York, NY, 98–105.
[12]
Anupama Chandrasekhar, Gang Chen, Po-Yu Chen, Wei-Yu Chen, Junjie Gu, Peng Guo, Shruthi Hebbur Prasanna Kumar, Guei-Yuan Lueh, Pankaj Mistry, Wei Pan, Thomas Raoux, and Konrad Trifunovic. 2019. IGC: The open source intel graphics compiler. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, 254–265.
[13]
Hanfeng Chen, Joseph Vinish D’silva, Hongji Chen, Bettina Kemme, and Laurie Hendren. 2018a. HorseIR: Bringing array programming languages together with database query processing. In Proceedings of the 14th ACM SIGPLAN International Symposium on Dynamic Languages (DLS 2018). Association for Computing Machinery, New York, NY, 37–49.
[14]
Hanfeng Chen, Joseph Vinish D’silva, Laurie J. Hendren, and Bettina Kemme. 2021. HorsePower: Accelerating database queries for advanced data analytics. In Proceedings of the 24th International Conference on Extending Database Technology (EDBT’21), Yannis Velegrakis, Demetris Zeinalipour-Yazti, Panos K. Chrysanthis, and Francesco Guerra (Eds.). OpenProceedings.org, 361–366.
[15]
Hanfeng Chen, Alexander Krolik, Bettina Kemme, Clark Verbrugge, and Laurie Hendren. 2020. Improving database query performance with automatic fusion. In Proceedings of the 29th International Conference on Compiler Construction (CC’20). Association for Computing Machinery, New York, NY, 63–73.
[16]
Wei-Yu Chen, Guei-Yuan Lueh, Pratik Ashar, Kaiyu Chen, and Buqi Cheng. 2018b. Register allocation for intel processor graphics. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’18). Association for Computing Machinery, New York, NY, 352–364.
[17]
cloudcores. 2021. GitHub—Cloudcores/CuAssembler. Retrieved from https://github.com/cloudcores/CuAssembler/.
[18]
Brett W. Coon, John Erik Lindholm, Peter C. Mills, and John R. Nickolls. 2010. Processing an Indirect Branch Instruction in a SIMD Architecture. Retrieved from https://patents.google.com/patent/US7761697B1/en.
[19]
Brett W. Coon, John R. Nickolls, Lars Nyland, Peter C. Mills, and John Erik Lindholm. 2012. Indirect Function Call Instructions in a Synchronous Parallel Thread Processor. Retrieved from https://patents.google.com/patent/US8312254B2/en.
[20]
Rodrigo Domínguez and David R. Kaeli. 2013. Unstructured control flow in GPGPU. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum. 1194–1202.
[21]
Niall Emmart, Justin Luitjens, Charles Weems, and Cliff Woolley. 2016. Optimizing modular multiplication for NVIDIA’s maxwell GPUs. In Proceedings of the IEEE 23nd Symposium on Computer Arithmetic (ARITH’16). 47–54.
[22]
freedesktop.org. 2022. Mesa–GitLab. Retrieved from https://gitlab.freedesktop.org/mesa/mesa.
[23]
Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 40). IEEE Computer Society, 407–420.
[24]
Henning Funke, Jan Mühlig, and Jens Teubner. 2020. Efficient generation of machine code for query compilers. In Proceedings of the 16th International Workshop on Data Management on New Hardware (DaMoN’20). Association for Computing Machinery, New York, NY, Article 6, 7 pages.
[25]
Henning Funke and Jens Teubner. 2021. Low-latency compilation of SQL queries to machine code. Proc. VLDB Endow. 14, 12 (Jul.2021), 2691–2694.
[26]
Philip B. Gibbons and Steven S. Muchnick. 1986. Efficient instruction scheduling for a pipelined architecture. In Proceedings of the 1986 SIGPLAN Symposium on Compiler Construction (SIGPLAN’86). Association for Computing Machinery, New York, NY, 11–16.
[27]
Xiang Gong, Zhongliang Chen, Amir Kavyan Ziabari, Rafael Ubal, and David Kaeli. 2017. TwinKernels: An execution model to improve GPU hardware scheduling at compile time. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). 39–49.
[28]
J. R. Goodman and W.-C. Hsu. 1988. Code scheduling and register allocation in large basic blocks. In Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume. Association for Computing Machinery, New York, NY, 88–98.
[29]
Scott Gray. 2016. GitHub—NervanaSystems/maxas. Retrieved from https://github.com/NervanaSystems/maxas.
[30]
Ari B. Hayes, Fei Hua, Jin Huang, Yanhao Chen, and Eddy Z. Zhang. 2019. Decoding CUDA binary. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, 229–241.
[31]
Dong He, Supun C. Nakandala, Dalitso Banda, Rathijit Sen, Karla Saur, Kwanghyun Park, Carlo Curino, Jesús Camacho-Rodríguez, Konstantinos Karanasos, and Matteo Interlandi. 2022. Query processing on tensor computation runtimes. Proc. VLDB Endow. 15, 11 (Sep.2022), 2811–2825.
[32]
HEAVY.AI. 2023. GitHub—heavyai/heavydb: HeavyDB (formerly OmniSciDB). Retrieved from https://github.com/heavyai/heavydb.
[33]
Yu-Ching Hu, Yuliang Li, and Hung-Wei Tseng. 2022. TCUDB: Accelerating database with tensor processors. In Proceedings of the International Conference on Management of Data (SIGMOD’22). Association for Computing Machinery, New York, NY, 1360–1374.
[34]
Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, Sjoerd Mullender, and Martin Kersten. 2012. MonetDB: Two decades of research in column-oriented database architectures. IEEE Data Eng. Bull. 35, 1 (2012), 40–45.
[35]
James A. Jablin, Thomas B. Jablin, Onur Mutlu, and Maurice Herlihy. 2014. Warp-aware trace scheduling for GPUs. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT’14). Association for Computing Machinery, New York, NY, 163–174.
[36]
Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele Paolo Scarpazza. 2018. Dissecting the NVIDIA Volta GPU architecture via microbenchmarking. arXiv:1804.06826. Retrieved from http://arxiv.org/abs/1804.06826.
[37]
Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Scarpazza. 2019. Dissecting the NVidia Turing T4 GPU via Microbenchmarking. arxiv:1903.07486. Retrieved from https://arxiv.org/abs/1903.07486.
[38]
Charu Kalra. 2015. Design and Evaluation of Register Allocation on GPUs. Master’s thesis. Northeastern University.
[39]
Timo Kersten, Viktor Leis, and Thomas Neumann. 2021. Tidy tuples and flying start: Fast compilation and fast execution of relational queries in umbra. VLDB J. 30 (2021), 883–905.
[40]
André Kohn, Viktor Leis, and Thomas Neumann. 2018. Adaptive execution of compiled queries. In Proceedings of the IEEE 34th International Conference on Data Engineering (ICDE’18). 197–208.
[41]
Mayank Kothiya. 2014. Understanding the ISA Impact on GPU Architecture. Master’s thesis. North Carolina State University.
[42]
Alexander Krolik, Clark Verbrugge, and Laurie Hendren. 2021. r3d3: Optimized query compilation on GPUs. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO’21). 277–288.
[43]
Serge Lamikhov-Center. 2020. GitHub—serge1/ELFIO. Retrieved from https://github.com/serge1/ELFIO.
[44]
Rubao Lee, Minghong Zhou, Chi Li, Shenggang Hu, Jianping Teng, Dongyang Li, and Xiaodong Zhang. 2021. The art of balance: A rateupDB experience of building a CPU/GPU hybrid database product. Proc. VLDB Endow. 14, 12 (Oct.2021), 2999–3013.
[45]
Martin Leitner-Ankerl. 2021. martinus/robin-hood-hashing: Fast & Memory Efficient Hashtable Based on Robin Hood Hashing for C++11/14/17/20. Retrieved from https://github.com/martinus/robin-hood-hashing.
[46]
Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (March2008), 39–55.
[47]
Mark Harris Luke Durant, Olivier Giroux and Nick Stam. 2017. Inside Volta: The World’s Most Advanced Data Center GPU—NVIDIA Developer Blog. Retrieved from https://developer.nvidia.com/blog/inside-volta/.
[48]
Marcello Maggioni and Charu Chandrasekaran. 2017. Apple LLVM GPU Compiler: Embedded Dragons. US LLVM Developers’ Meeting.
[49]
Kothiya Mayank, Hongwen Dai, Jizeng Wei, and Huiyang Zhou. 2015. Analyzing graphics processor unit (GPU) instruction set architectures. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’15). 155–156.
[50]
Todd Mostak. 2014. An Overview of MapD (Massively Parallel Database). Retrieved from http://www.smallake.kr/wp-content/uploads/2014/09/mapd_overview.pdf.
[51]
Thomas Neumann. 2011. Efficiently compiling efficient query plans for modern hardware. Proc. VLDB Endow. 4, 9 (June2011), 539–550.
[52]
Thomas Neumann. 2020. Database Architects: Linear Time Liveness Analysis. Retrieved from http://databasearchitects.blogspot.com/2020/04/linear-time-liveness-analysis.html.
[53]
John R. Nickolls, Richard Craig Johnson, Robert Steven Glanville, and Guillermo Juan Rozas. 2011. Unanimous Branch Instructions in a Parallel Thread Processor. Retrieved from https://patents.google.com/patent/US20110072248/en.
[56]
NVIDIA. 2021a. CUDA Binary Utilities :: CUDA Toolkit Documentation. Retrieved from https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html.
[57]
NVIDIA. 2021b. CUDA Occupancy Calculator :: CUDA Toolkit Documentation. Retrieved from https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html.
[58]
[59]
NVIDIA. 2021d. Programming Guide :: CUDA Toolkit Documentation. Retrieved from https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
[60]
NVIDIA. 2021e. PTX ISA :: CUDA Toolkit Documentation. Retrieved from https://docs.nvidia.com/cuda/parallel-thread-execution/index.html.
[61]
NVIDIA. 2021f. Volta Tuning Guide :: CUDA Toolkit Documentation. Retrieved from https://docs.nvidia.com/cuda/volta-tuning-guide/index.html.
[62]
Robert Ohannessian Jr, Michael Alan Fetterman, Olivier Giroux, Jack H. Choquette, Xiaogang Qiu, Shirish Gadre, and Meenaradchagan Vishnu. 2015. System, Method, and Computer Program Product for Implementing Software-based Scoreboarding. Retrieved from https://patents.google.com/patent/US20150220341A1/en.
[63]
Johns Paul, Bingsheng He, Shengliang Lu, and Chiew Tong Lau. 2020. Improving execution efficiency of just-in-time compilation based query processing on GPUs. Proc. VLDB Endow. 14, 2 (Oct.2020), 202–214.
[64]
Massimiliano Poletto and Vivek Sarkar. 1999. Linear scan register allocation. ACM Trans. Program. Lang. Syst. 21, 5 (Sept.1999), 895–913.
[65]
Nico Reissmann, Thomas L. Falch, Benjamin A. Bjørnseth, Helge Bahmann, Jan Christian Meyer, and Magnus Jahre. 2016. Efficient control flow restructuring for GPUs. In Proceedings of the International Conference on High Performance Computing Simulation (HPCS’16). 48–57.
[66]
Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A study of the fundamental performance characteristics of GPUs and CPUs for database analytics. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’20). Association for Computing Machinery, New York, NY, 1617–1632.
[67]
Ghassan Shobaki, Justin Bassett, Mark Heffernan, and Austin Kerbow. 2022. Graph transformations for register-pressure-aware instruction scheduling. In Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction (CC’22). Association for Computing Machinery, New York, NY, 41–53.
[68]
Ghassan Shobaki, Austin Kerbow, and Stanislav Mekhanoshin. 2020. Optimizing occupancy and ILP on the GPU using a combinatorial approach. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization (CGO’20). Association for Computing Machinery, New York, NY, 133–144.
[69]
TPC Council. 2017. TPC Benchmark H.
[70]
Wladimir J. van der Laan. 2010. GitHub—laanwj/decuda. Retrieved from https://github.com/laanwj/decuda.
[71]
Fabian Wahlster. 2019. Implementing SPMD control flow in LLVM using reconverging CFGs. European LLVM Developers’ Meeting.
[72]
Haicheng Wu, Gregory Diamos, Si Li, and Sudhakar Yalamanchili. 2011. Characterization and transformation of unstructured control flow in GPU applications. In Proceedings of the 1st International Workshop on Characterizing Applications for Heterogeneous Exascale Systems.
[73]
Da Yan, Wei Wang, and Xiaowen Chu. 2020. Optimizing batched winograd convolution on GPUs. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’20). ACM, New York, NY.
[74]
Da Yan, Wei Wang, and Xiaowen Chu. 2022. An LLVM-based open-source compiler for NVIDIA GPUs. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’22). Association for Computing Machinery, New York, NY, 448–449.
[75]
Yi-Ping You and Szu-Chieh Chen. 2015. Vector-aware register allocation for GPU shader processors. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’15). 99–108.
[76]
Hou Yunqing. 2015. GitHub—hyqneuron/asfermi. Retrieved from https://github.com/hyqneuron/asfermi/.
[77]
F. Zhang and E. H. D’Hollander. 2004. Using hammock graphs to structure programs. IEEE Trans. Softw. Eng. 30, 4 (2004), 231–245.
[78]
Xiuxia Zhang, Guangming Tan, Shuangbai Xue, Jiajia Li, Keren Zhou, and Mingyu Chen. 2017. Understanding the GPU microarchitecture to achieve bare-metal performance tuning. In Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’17). Association for Computing Machinery, New York, NY, 31–43.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 20, Issue 3
September 2023
346 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3609237
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 July 2023
Online AM: 09 June 2023
Accepted: 16 May 2023
Revised: 26 March 2023
Received: 02 September 2022
Published in TACO Volume 20, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. compilers
  3. assemblers
  4. register allocation
  5. instruction scheduling
  6. JIT
  7. SQL
  8. database queries

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1,176
  • Downloads (Last 6 weeks)138
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media