Valkyrie: Leveraging inter-tlb locality to enhance gpu performance

T Baruah, Y Sun, SA Mojumder, JL Abellán… - Proceedings of the …, 2020 - dl.acm.org
Proceedings of the ACM International Conference on Parallel Architectures …, 2020dl.acm.org
Programming on a GPU has been made considerably easier with the introduction of Virtual
Memory features, which support common pointer-based semantics between the CPU and
the GPU. However, supporting virtual memory on a GPU comes with some additional costs
and overhead, with the largest being from the support for address translation. The fact that a
massive number of threads run concurrently on a GPU means that the translation lookaside
buffers (TLBs) are oversubscribed most of the time. Our investigation into a diverse set of …
Programming on a GPU has been made considerably easier with the introduction of Virtual Memory features, which support common pointer-based semantics between the CPU and the GPU. However, supporting virtual memory on a GPU comes with some additional costs and overhead, with the largest being from the support for address translation. The fact that a massive number of threads run concurrently on a GPU means that the translation lookaside buffers (TLBs) are oversubscribed most of the time. Our investigation into a diverse set of GPU workloads shows that TLB misses can be extremely high (up to 99%), which inevitably leads to significant performance degradation due to long-latency page-table walks. Our profiling of TLB-sensitive workloads reveals a high degree of page sharing across the different cores of a GPU. In many applications, a page can be accessed in temporal proximity by multiple cores, following similar memory access patterns. To support the inherent sharing present in GPU workloads, we propose Valkyrie, an integrated cooperative TLB prefetching mechanism and an inter L1-TLB probing scheme that can efficiently reduce TLB bottlenecks in GPUs. Our evaluation using a diverse set of GPU workloads reveals that Valkyrie is able to achieve an average speedup of 1.95x, while adding modest hardware overhead.
ACM Digital Library