As modern GPU workloads become larger and more complex, there is an ever-increasing demand for GPU computational power. Traditionally, GPUs have lacked generalized data-dependent parallelism and synchronization. In recent years, there have been attempts to introduce a more sophisticated form of synchronization between different kernels in an application to control the flow and ensure the correctness of the outputs. However, coarse synchronization between such kernels can significantly reduce GPU utilization. Moreover, with hundreds or thousands of kernels in a workload, the overhead can be consequential. Due to GPU’s massive parallel design, data can be split among thread blocks, which allows us to manage the data dependencies on a more fine-grained level between the thread blocks themselves rather than the kernel containing them. In this dissertation, we propose several methods to improve the performance of data-dependent GPU applications in this fashion.
In our first method, Wireframe, we propose a hardware-software solution that enables generalized support for data-dependent parallelism and synchronization. It allows dependencies between the thread blocks in the GPU kernel to be expressed through a global dependency graph, which is then sent by the GPU hardware at kernel launch, which then enforces the dependencies in the graph through a dependency-aware thread block scheduler.
Our second method, BlockMaestro, is aimed at improving the user transparency in the process of determining the inter-kernel thread block dependencies through static analysis of memory access patterns at kernel-launch time. During the runtime, BlockMaestro enables kernel launch hiding by launching multiple kernels on the GPU and utilizes a thread block scheduler in hardware to schedule the thread blocks with satisfied dependencies for execution.
In our third method, SEER, we aim to expand our support for data-dependent applications to those with non-static memory accesses, which can only be known during runtime. Seeking a solution to this problem, we use a machine learning model in an effort to estimate the memory addresses accessed in global load and store instructions in a kernel, and using that information to predict the inter-kernel dependency pattern among thread blocks using such accesses in order to improve the performance.