From Loop Fusion to Kernel Fusion: A Domain-specific Approach to Locality Optimization
Creators
- 1. Friedrich-Alexander University Erlangen-Nürnberg (FAU)
Description
This artifact describes the steps to reproduce the results for the CUDA code generation with kernel fusion in Hipacc (an image processing DSL and source-to-source compiler embedded in C++), as presented in the CGO19 paper "From Loop Fusion to Kernel Fusion: A Domain-specific Approach to Locality Optimization". Hardware Dependencies: CUDA enabled GPUs are required. We used three Nvidia cards, as discussed in Section 5.1 in the paper: (a) Geforce GTX 745 facilitates 384 CUDA cores with a base clock of 1,033 MHz and 900 MHz memory clock. (b) Geforce GTX 680 has 1,536 CUDA cores with a base clock of 1,058 MHz and 3,004 MHz memory clock. (c) Tesla K20c has 2,496 CUDA cores with a base clock of 706 MHz and 2,600 MHz memory clock. For all three GPUs, the total amount of shared memory per block is 48 Kbytes, the total number of registers available per block is 65,536. GPUs with similar configurations are expected to generate comparable results.
Software Dependencies: Clang/LLVM (8.0), compiler_rt and libcxx for Linux (8.0). CMake (3.4 or later), Git (2.7 or later). Nvidia CUDA Driver (10.0 or later). OpenCV for producing visual output in the samples.
Files
hipacc-siemens-dev.zip
Files
(11.0 MB)
Name | Size | Download all |
---|---|---|
md5:19182e1f61c19551828f65e70504630f
|
11.0 MB | Preview Download |
Additional details
Related works
- Is supplemented by
- Conference paper: 10.1109/CGO.2019.8661176 (DOI)