There is a newer version of the record available.

Published February 16, 2019 | Version 1.0
Software Open

From Loop Fusion to Kernel Fusion: A Domain-specific Approach to Locality Optimization

  • 1. Friedrich-Alexander University Erlangen-Nürnberg (FAU)

Description

This artifact describes the steps to reproduce the results for the CUDA code generation with kernel fusion in Hipacc (an image processing DSL and source-to-source compiler embedded in C++), as presented in the CGO19 paper "From Loop Fusion to Kernel Fusion: A Domain-specific Approach to Locality Optimization". We provide the original binaries as well as the source code to regenerate the binaries, which can be executed on x86_64 Linux system with CUDA enabled GPUs. Furthermore, we include two python scripts to run the application and compute the statistics as depicted in Figure 6 in the paper.

Hardware Dependencies: CUDA enabled GPUs are required. We used three Nvidia cards, as discussed in Section 5.1 in the paper: (a) Geforce GTX 745 facilitates 384 CUDA cores with a base clock of 1,033 MHz and 900 MHz memory clock. (b) Geforce GTX 680 has 1,536 CUDA cores with a base clock of 1,058 MHz and 3,004 MHz memory clock. (c) Tesla K20c has 2,496 CUDA cores with a base clock of 706 MHz and 2,600 MHz memory clock. For all three GPUs, the total amount of shared memory per block is 48 Kbytes, the total number of registers available per block is 65,536. GPUs with similar configurations are expected to generate comparable results.

Software Dependencies: Clang/LLVM (6.0), compiler_rt and libcxx for Linux (6.0). CMake (3.4 or later), Git (2.7 or later). Nvidia CUDA Driver (9.0 or later). OpenCV for producing visual output in the samples.

Files

artifact.zip

Files (49.0 MB)

Name Size Download all
md5:48ddf62480c8b4c8e5f7a11ee3216df7
49.0 MB Preview Download