tutorial

Reduction Operations in Parallel Loops for GPGPUs

Authors:

Sunita Chandrasekaran,

Barbara ChapmanAuthors Info & Claims

PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Pages 10 - 20

https://doi.org/10.1145/2578948.2560692

Published: 07 February 2014 Publication History

Abstract

Manycore accelerators offer the potential of significantly improving the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. Directive-based programming models such as OpenACC and OpenMP are high-level programming model for users to create applications for accelerators by annotating region of code for offloading with directives. In these programming models, most of the offloaded kernels are data parallel loops processing one or multiple multi-dimensional arrays, and it is often that scalar variables are used in the parallel loop body for reduction operations. Since reduction operation itself has loop-carried dependency preventing the parallelization of the loops, this could have a significant impact on the performance if not handled properly.

In this paper, we present the design and parallelization of reduction operations in parallel loops for GPGPU accelerators. Using OpenACC as the high-level directive-based programing model, we discuss how reduction operations are parallelized when appearing in each level of the loop nest and thread hierarchy. We present how we handle the mapping of the loops and parallelized reduction to single- or multiple-level parallelism of GPGPU architectures. These algorithms have been implemented in the open source OpenACC compiler OpenUH. We compare our implementation with two other commercial OpenACC compilers using test cases and applications, and demonstrate better robustness and competitive performance than others.

References

[1]

CUDA. http://www.nvidia.com/object/cuda_home_new.html, October 2013.

[2]

OpenACC. http://www.openacc-standard.org, June 2013.

[3]

OpenCL Reduction. http://developer.amd.com/resources/documentation-articles/articles-whitepapers/opencl-optimization-case-study-simple-reductions/, November 2013.

[4]

OpenCL Standard. http://www.khronos.org/opencl, October 2013.

[5]

OpenMP. http://www.openmp.org, October 2013.

[6]

The GNU OpenMP Implementation. http://gcc.gnu.org/onlinedocs/libgomp.pdf, November 2013.

[7]

D. Butenhof. Programming with POSIX (R) Threads. Addison-Wesley Professional, 1997.

Digital Library

[8]

S. Cook. CUDA Programming: A Developer's Guide to Parallel Computing with GPUs. Newnes, 2012.

Digital Library

[9]

R. Dolbeau, S. Bihan, and F. Bodin. HMPP: A Hybrid Multi-core Parallel Programming Environment. In Workshop on General Purpose Processing on Graphics Processing Units (GPGPU 2007), 2007.

[10]

M. Harris. Optimizing Parallel Reduction in CUDA. NVIDIA Developer Technology, 6, 2007.

[11]

T. Komada, S. Miwa, H. Nakamura, and N. Maruyama. Integrating Multi-GPU Execution in an OpenACC Compiler. In ICPP '13: Proceedings of the 42nd International Conference on Parallel Processing, pages 260--269, 2013.

Digital Library

[12]

C. Liao, O. Hernandez, B. M. Chapman, W. Chen, and W. Zheng. OpenUH: An Optimizing, Portable OpenMP Compiler. Concurrency and Computation: Practice and Experience, 19(18):2317--2332, 2007.

Digital Library

[13]

R. Nanjegowda, O. Hernandez, B. Chapman, and H. H. Jin. Scalability Evaluation of Barrier Algorithms for OpenMP. In Evolving OpenMP in an Age of Extreme Parallelism, pages 42--52. Springer, 2009.

Digital Library

[14]

G. Pullan. Cambridge cuda course 25-27 may 2009. http://www.many-core.group.cam.ac.uk/archive/CUDAcourse09/.

[15]

X. Tian, R. Xu, Y. Yan, Z. Yun, S. Chandrasekaran, and B. Chapman. Compiling A High-Level Directive-based Programming Model for Accelerators. In LCPC 2013: The 26th International Workshop on Languages and Compilers for Parallel Computing, 2013.

[16]

N. Whitehead and A. Fit-Florea. Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs. nVidia technical white paper, 2011.

[17]

R. Xu, S. Chandrasekaran, B. Chapman, and C. F. Eick. Directive-based Programming Models for Scientific Applications-A Comparison. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pages 1--9. IEEE, 2012.

Cited By

Tian XXu RYan YChandrasekaran SEachempati DChapman B(2016)Compiler transformation of nested loops for general purpose GPUsConcurrency and Computation: Practice & Experience10.1002/cpe.364828:2(537-556)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1002/cpe.3648
Yan YLin PLiao Cde Supinski BQuinlan D(2015)Supporting multiple accelerators in high-level programming modelsProceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2712386.2712405(170-180)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.1145/2712386.2712405

Index Terms

Reduction Operations in Parallel Loops for GPGPUs
1. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features

Recommendations

Reduction Operations in Parallel Loops for GPGPUs
PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

Manycore accelerators offer the potential of significantly improving the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. Directive-based programming models such as OpenACC and OpenMP are ...
Compiler transformation of nested loops for general purpose GPUs

Manycore accelerators have the potential to significantly improve performance of scientific applications when offloading computationally intensive program portions to accelerators. Directive-based high-level programming models, such as OpenACC and ...
OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

This paper presents Open Accelerator Research Compiler (OpenARC): an open-source framework that supports the full feature set of OpenACC V1.0 and performs source-to-source transformations, targeting heterogeneous devices, such as NVIDIA GPUs. Combined ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PMAM'14: Proceedings of Programming Models and Applications on Multicores and Manycores

February 2014

156 pages

ISBN:9781450326575

DOI:10.1145/2578948

Conference Chairs:
Pavan Balaji
Argonne National Laboratory, USA
,
Minyi Guo
Shanghai Jiao Tong, University, China
,
Zhiyi Huang
University of Otago, New Zealand

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 February 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Tutorial
Research
Refereed limited

Conference

PPoPP '14

Sponsor:

SIGPLAN

PPoPP '14: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 15 - 19, 2014

FL, Orlando, USA

Acceptance Rates

Overall Acceptance Rate 53 of 97 submissions, 55%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
211
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)1

Reflects downloads up to 01 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tian XXu RYan YChandrasekaran SEachempati DChapman B(2016)Compiler transformation of nested loops for general purpose GPUsConcurrency and Computation: Practice & Experience10.1002/cpe.364828:2(537-556)Online publication date: 1-Feb-2016
https://dl.acm.org/doi/10.1002/cpe.3648
Yan YLin PLiao Cde Supinski BQuinlan D(2015)Supporting multiple accelerators in high-level programming modelsProceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/2712386.2712405(170-180)Online publication date: 7-Feb-2015
https://dl.acm.org/doi/10.1145/2712386.2712405

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten