research-article

Open access

DawnCC: Automatic Annotation for Data Parallelism and Offloading

Authors:

Gleison Mendonça,

Breno Guimarães,

Péricles Alves,

Márcio Pereira,

Fernando Magno Quintão PereiraAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 14, Issue 2

Article No.: 13, Pages 1 - 25

https://doi.org/10.1145/3084540

Published: 26 May 2017 Publication History

Abstract

Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This article provides a suite of compiler-related methods to mitigate this problem. Such techniques rely on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing. We have materialized our ideas into a tool, DawnCC, which can be used stand-alone or through an online interface. To demonstrate its effectiveness, we show how DawnCC can annotate the programs available in PolyBench without any intervention from users. Such annotations lead to speedups of over 100× in an Nvidia architecture and over 50× in an ARM architecture.

References

[1]

C. Alias, A. Darte, and A. Plesco. 2013. Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA. In Proceedings of the 2013 DATE Conference (DATE’13). 575--580.

Digital Library

[2]

Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, and Fernando Magno Quintão Pereira. 2015. Runtime pointer disambiguation. In Proceedings of the 2015 OOPSLA Conference (OOPSLA’15). ACM, New York, NY, 589--606.

Digital Library

[3]

M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton, F. Irigoin, P. Jouvelot, R. Keryell, and P. Villalon. 2012. PIPS Is Not (Only) Polyhedral Software. Technical Report. IMPACT.

[4]

Lars Ole Andersen. 1994. Program Analysis and Specialization for the C Programming Language. Ph.D. Dissertation. DIKU, University of Copenhagen.

[5]

José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez, and Juan Tourino. 2016. Locality-aware automatic parallelization for GPGPU with OpenHMPP directives. International Journal of Parallel Programming 44, 3, 620--643.

Digital Library

[6]

R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, et al. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 PACT Conference (PACT’15). IEEE, Los Alamitos, CA, 138--149.

Digital Library

[7]

M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 2010 CC Conference (CC’10). 244--263.

Digital Library

[8]

Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the LLVM-HPC Conference (LLVM-HPC’14). IEEE, Los Alamitos, CA, 12--21.

Digital Library

[9]

Victor H. S. Campos, Péricles Rafael Oliveira Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira. 2016. Restrictification of function arguments. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 163--173.

Digital Library

[10]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IISWC Conference (IISWC’09). IEEE, Los Alamitos, CA, 44--54.

Digital Library

[11]

R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems 13, 4, 451--490.

Digital Library

[12]

Gregory J. Duck and Roland H. C. Yap. 2016. Heap bounds protection with low fat pointers. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 132--142.

Digital Library

[13]

Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems 9, 3, 319--349.

Digital Library

[14]

Swapnil Ghike, Ruben Gran, María Jesús Garzarán, and David A. Padua. 2014. Directive-based compilers for GPUs. In Proceedings of the 2014 LCPC Conference (LCPC’14). 19--35.

[15]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the 2012 InPar Conference (InPar’12). IEEE, Los Alamitos, CA, 1--10.

[16]

Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the 2011 ISPASS Conference (ISPASS’11). IEEE, Los Alamitos, CA, 134--144.

Digital Library

[17]

Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 4, 1--28.

[18]

Serge Guelton, Mehdi Amini, and Béatrice Creusillet. 2012. Beyond do loops: Data transfer generation with convex array regions. In Proceedings of the 2012 LCPC Conference (LCPC’12). 249--263.

[19]

Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU communication management and optimization. In Proceedings of the 2011 PLDI Conference (PLDI’11). ACM, New York, NY, 142--151.

Digital Library

[20]

Julien Jaeger, Patrick Carribault, and Marc Pérache. 2015. Fine-grain data management directory for OpenMP 4.0 and OpenACC. Concurrency and Computation: Practice and Experience 27, 6, 1528--1539.

Digital Library

[21]

Jens Knoop, Oliver Rüthing, and Bernhard Steffen. 1992. Lazy code motion. In Proceedings of the 1992 PLDI Conference (PLDI’92). ACM, New York, NY, 224--234.

Digital Library

[22]

Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi. 2014. IPMACC: Open source OpenACC to CUDA/OpenCL translator. arXiv:1412.1127.

[23]

Chris Lattner and Sarita Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the 2004 CGO Conference (CGO’04). IEEE, Los Alamitos, CA, 75--86.

Digital Library

[24]

S. Lee and R. Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the 2010 SC Conference (SC’10). IEEE, Los Alamitos, CA, 1--11.

Digital Library

[25]

Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In Proceedings of the 2014 HPDC Conference (HPDC’14). ACM, New York, NY, 115--120.

Digital Library

[26]

Cor Meenderinck and Ben Juurlink. 2011. Nexus: Hardware support for task-based programming. In Proceedings of the 2011 DSD Conference (DSD’11). 442--445.

Digital Library

[27]

Gleison Mendonça, Breno Guimaraes, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintao Pereira. 2016. Automatic insertion of copy annotation in data-parallel programs. In Proceedings of the 2016 SBAC-PAD Conference (SBAC-PAD’16). IEEE, Los Alamitos, CA, 1--8.

[28]

H. Nazaré, I. Maffra, W. Santos, L. Barbosa, L. Gonnord, and F. M. Q. Pereira. 2014. Validation of memory accesses through symbolic analyses. In Proceedings of the 2014 OOPSLA Conference (OOPSLA’14). ACM, New York, NY, 791--809.

Digital Library

[29]

Cedric Nugteren and Henk Corporaal. 2014. Bones: An automatic skeleton-based C-to-CUDA compiler for GPUs. ACM Transactions on Architecture and Code Optimization 11, 4, 35:1--35:25.

Digital Library

[30]

OpenACC Standard. 2013. The OpenACC Programming Interface. Technical Report. CAPS.

[31]

Fernando Magno Quintao Pereira and Daniel Berlin. 2009. Wave propagation and deep propagation for pointer analysis. In Proceedings of the 2009 CGO Conference (CGO’09). IEEE, Los Alamitos, CA, 126--135.

Digital Library

[32]

A Raghesh. 2011. A Framework for Automatic OpenMP Code Generation. Master’s thesis. IIT Madras.

[33]

R. Reyes, I. López-Rodríguez, J. Fumero, and F. Sande. 2012. AccULL: An OpenACC implementation with CUDA and OpenCL support. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 871--882.

Digital Library

[34]

Radu Rugina and Martin Rinard. 2000. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM SIGPLAN Notices 35, 5, 182--195.

Digital Library

[35]

Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. 2003. Hybrid analysis: Static and dynamic memory reference analysis. International Journal of Parallel Programming 31, 251--283.

Digital Library

[36]

O. Shivers. 1988. Control flow analysis in scheme. In Proceedings of the 1988 PLDI Conference (PLDI’88). ACM, New York, NY, 164--174.

Digital Library

[37]

John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. IMPACT.

[38]

Rémi Triolet, Francois Irigoin, and Paul Feautrier. 1986. Direct parallelization of call statements. In Proceedings of the 1986 SIGPLAN Conference (SIGPLAN’86). ACM, New York, NY, 176--185.

Digital Library

[39]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, 54:1--54:23.

Digital Library

[40]

Sandra Wienke, Paul L. Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC—first experiences with real-world applications. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 859--870.

Digital Library

[41]

M. J. Wolfe. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley, Boston, MA.

Digital Library

Cited By

Yu MMa GWang ZTang SChen YWang YLiu YJia DWei Z(2024)swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputerCCF Transactions on High Performance Computing10.1007/s42514-023-00159-76:4(439-458)Online publication date: 11-Jan-2024
https://doi.org/10.1007/s42514-023-00159-7
Pornmaneerattanatri STakahashi KKashiwa YIchikawa KIida H(2023)Parallelizable Loop Detection using Pre-trained Transformer Models for Code UnderstandingParallel and Distributed Computing, Applications and Technologies10.1007/978-981-99-8211-0_4(32-42)Online publication date: 29-Nov-2023
https://doi.org/10.1007/978-981-99-8211-0_4
Korakitis Ode Gonzalo SGuidotti NBarreto JMonteiro JPena A(2022)OmpSs-2 and OpenACC Interoperation2022 Workshop on Accelerator Programming Using Directives (WACCPD)10.1109/WACCPD56842.2022.00007(11-21)Online publication date: Nov-2022
https://doi.org/10.1109/WACCPD56842.2022.00007
Show More Cited By

Index Terms

DawnCC: Automatic Annotation for Data Parallelism and Offloading
1. Computing methodologies
  1. Parallel computing methodologies
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

The use of GPUs for general purpose computation has increased dramatically in the past years due to the rising demands of computing power and their tremendous computing capacity at low cost. Hence, new programming models have been developed to integrate ...
Precise flow-insensitive may-alias analysis is NP-hard

Determining aliases is one of the foundamental static analysis problems, in part because the precision with which this problem is solved can affect the precision of other analyses such as live variables, available expressions, and constant propagation. ...
WYSINWYX: What you see is not what you eXecute

Over the last seven years, we have developed static-analysis methods to recover a good approximation to the variables and dynamically allocated memory objects of a stripped executable, and to track the flow of values through them. The article presents ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization

ACM Transactions on Architecture and Code Optimization Volume 14, Issue 2

June 2017

259 pages

ISSN:1544-3566

EISSN:1544-3973

DOI:10.1145/3086564

Editor:
Koen De Bosschere
Ghent University

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 May 2017

Accepted: 01 April 2017

Revised: 01 March 2017

Received: 01 November 2016

Published in TACO Volume 14, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

LG Electronics
FAPEMIG, FAPESP, CNPq and CAPES

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
733
Total Downloads

Downloads (Last 12 months)107
Downloads (Last 6 weeks)25

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Yu MMa GWang ZTang SChen YWang YLiu YJia DWei Z(2024)swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputerCCF Transactions on High Performance Computing10.1007/s42514-023-00159-76:4(439-458)Online publication date: 11-Jan-2024
https://doi.org/10.1007/s42514-023-00159-7
Pornmaneerattanatri STakahashi KKashiwa YIchikawa KIida H(2023)Parallelizable Loop Detection using Pre-trained Transformer Models for Code UnderstandingParallel and Distributed Computing, Applications and Technologies10.1007/978-981-99-8211-0_4(32-42)Online publication date: 29-Nov-2023
https://doi.org/10.1007/978-981-99-8211-0_4
Korakitis Ode Gonzalo SGuidotti NBarreto JMonteiro JPena A(2022)OmpSs-2 and OpenACC Interoperation2022 Workshop on Accelerator Programming Using Directives (WACCPD)10.1109/WACCPD56842.2022.00007(11-21)Online publication date: Nov-2022
https://doi.org/10.1109/WACCPD56842.2022.00007
Shaaban OAguilar JBeltran VCarpenter PAyguade EMancho J(2022)Automatic aggregation of subtask accesses for nested OpenMP-style tasks2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD55451.2022.00042(315-325)Online publication date: Nov-2022
https://doi.org/10.1109/SBAC-PAD55451.2022.00042
Mishra AChheda SSoto CMalik ALin MChapman B(2022)COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00074(391-400)Online publication date: May-2022
https://doi.org/10.1109/IPDPSW55747.2022.00074
Da Silva JLeão LPetrucci VGamatié APereira F(2021)Mapping Computations in Heterogeneous Multicore Systems with Statistical Regression on Program InputsACM Transactions on Embedded Computing Systems10.1145/347828820:6(1-35)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3478288
Şuşu A(2020)A Vector-Length Agnostic Compiler for the Connex-S Accelerator with Scratchpad MemoryACM Transactions on Embedded Computing Systems10.1145/340653619:6(1-30)Online publication date: 3-Oct-2020
https://dl.acm.org/doi/10.1145/3406536
Mendonça GLiao CPereira FAyguadé EHwu WBadia RHofstee H(2020)AutoParBenchProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392744(1-10)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392744
Diener MKale L(2020)Unified data movement for offloading Charm++ applications2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW50202.2020.00085(471-474)Online publication date: May-2020
https://doi.org/10.1109/IPDPSW50202.2020.00085
Lima CCezar JVieira Leobas GRohou EQuintão Pereira F(2020)Guided just-in-time specializationScience of Computer Programming10.1016/j.scico.2019.102318185:COnline publication date: 1-Jan-2020
https://dl.acm.org/doi/10.1016/j.scico.2019.102318
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents