skip to main content
10.1145/2578948.2560692acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
tutorial

Reduction Operations in Parallel Loops for GPGPUs

Published: 07 February 2014 Publication History

Abstract

Manycore accelerators offer the potential of significantly improving the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. Directive-based programming models such as OpenACC and OpenMP are high-level programming model for users to create applications for accelerators by annotating region of code for offloading with directives. In these programming models, most of the offloaded kernels are data parallel loops processing one or multiple multi-dimensional arrays, and it is often that scalar variables are used in the parallel loop body for reduction operations. Since reduction operation itself has loop-carried dependency preventing the parallelization of the loops, this could have a significant impact on the performance if not handled properly.
In this paper, we present the design and parallelization of reduction operations in parallel loops for GPGPU accelerators. Using OpenACC as the high-level directive-based programing model, we discuss how reduction operations are parallelized when appearing in each level of the loop nest and thread hierarchy. We present how we handle the mapping of the loops and parallelized reduction to single- or multiple-level parallelism of GPGPU architectures. These algorithms have been implemented in the open source OpenACC compiler OpenUH. We compare our implementation with two other commercial OpenACC compilers using test cases and applications, and demonstrate better robustness and competitive performance than others.

References

[1]
CUDA. http://www.nvidia.com/object/cuda_home_new.html, October 2013.
[2]
OpenACC. http://www.openacc-standard.org, June 2013.
[3]
OpenCL Reduction. http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/, November 2013.
[4]
OpenCL Standard. http://www.khronos.org/opencl, October 2013.
[5]
OpenMP. http://www.openmp.org, October 2013.
[6]
The GNU OpenMP Implementation. http://gcc.gnu.org/onlinedocs/libgomp.pdf, November 2013.
[7]
D. Butenhof. Programming with POSIX (R) Threads. Addison-Wesley Professional, 1997.
[8]
S. Cook. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs. Newnes, 2012.
[9]
R. Dolbeau, S. Bihan, and F. Bodin. HMPP: A Hybrid Multi-core Parallel Programming Environment. In Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2007), 2007.
[10]
M. Harris. Optimizing Parallel Reduction in CUDA. NVIDIA Developer Technology, 6, 2007.
[11]
T. Komada, S. Miwa, H. Nakamura, and N. Maruyama. Integrating Multi-GPU Execution in an OpenACC Compiler. In ICPP '13: Proceedings of the 42nd International Conference on Parallel Processing, pages 260--269, 2013.
[12]
C. Liao, O. Hernandez, B. M. Chapman, W. Chen, and W. Zheng. OpenUH: An Optimizing, Portable OpenMP Compiler. Concurrency and Computation: Practice and Experience, 19(18):2317--2332, 2007.
[13]
R. Nanjegowda, O. Hernandez, B. Chapman, and H. H. Jin. Scalability Evaluation of Barrier Algorithms for OpenMP. In Evolving OpenMP in an Age of Extreme Parallelism, pages 42--52. Springer, 2009.
[14]
G. Pullan. Cambridge cuda course 25-27 may 2009. http://www.many-core.group.cam.ac.uk/archive/CUDAcourse09/.
[15]
X. Tian, R. Xu, Y. Yan, Z. Yun, S. Chandrasekaran, and B. Chapman. Compiling A High-Level Directive-based Programming Model for Accelerators. In LCPC 2013: The 26th International Workshop on Languages and Compilers for Parallel Computing, 2013.
[16]
N. Whitehead and A. Fit-Florea. Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs. nVidia technical white paper, 2011.
[17]
R. Xu, S. Chandrasekaran, B. Chapman, and C. F. Eick. Directive-based Programming Models for Scientific Applications-A Comparison. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pages 1--9. IEEE, 2012.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores
February 2014
156 pages
ISBN:9781450326575
DOI:10.1145/2578948
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Compiler
  2. OpenACC
  3. OpenUH
  4. Reduction

Qualifiers

  • Tutorial
  • Research
  • Refereed limited

Conference

PPoPP '14
Sponsor:

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media