skip to main content
research-article
Open access

DawnCC: Automatic Annotation for Data Parallelism and Offloading

Published: 26 May 2017 Publication History

Abstract

Directive-based programming models, such as OpenACC and OpenMP, allow developers to convert a sequential program into a parallel one with minimum human intervention. However, inserting pragmas into production code is a difficult and error-prone task, often requiring familiarity with the target program. This difficulty restricts the ability of developers to annotate code that they have not written themselves. This article provides a suite of compiler-related methods to mitigate this problem. Such techniques rely on symbolic range analysis, a well-known static technique, to achieve two purposes: populate source code with data transfer primitives and to disambiguate pointers that could hinder automatic parallelization due to aliasing. We have materialized our ideas into a tool, DawnCC, which can be used stand-alone or through an online interface. To demonstrate its effectiveness, we show how DawnCC can annotate the programs available in PolyBench without any intervention from users. Such annotations lead to speedups of over 100× in an Nvidia architecture and over 50× in an ARM architecture.

References

[1]
C. Alias, A. Darte, and A. Plesco. 2013. Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA. In Proceedings of the 2013 DATE Conference (DATE’13). 575--580.
[2]
Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, and Fernando Magno Quintão Pereira. 2015. Runtime pointer disambiguation. In Proceedings of the 2015 OOPSLA Conference (OOPSLA’15). ACM, New York, NY, 589--606.
[3]
M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton, F. Irigoin, P. Jouvelot, R. Keryell, and P. Villalon. 2012. PIPS Is Not (Only) Polyhedral Software. Technical Report. IMPACT.
[4]
Lars Ole Andersen. 1994. Program Analysis and Specialization for the C Programming Language. Ph.D. Dissertation. DIKU, University of Copenhagen.
[5]
José M. Andión, Manuel Arenaz, François Bodin, Gabriel Rodríguez, and Juan Tourino. 2016. Locality-aware automatic parallelization for GPGPU with OpenHMPP directives. International Journal of Parallel Programming 44, 3, 620--643.
[6]
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, et al. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the 2015 PACT Conference (PACT’15). IEEE, Los Alamitos, CA, 138--149.
[7]
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA code generation for affine programs. In Proceedings of the 2010 CC Conference (CC’10). 244--263.
[8]
Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin O’Brien, Zehra Sura, Arpith C. Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GPU threads for OpenMP 4.0 in LLVM. In Proceedings of the LLVM-HPC Conference (LLVM-HPC’14). IEEE, Los Alamitos, CA, 12--21.
[9]
Victor H. S. Campos, Péricles Rafael Oliveira Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira. 2016. Restrictification of function arguments. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 163--173.
[10]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IISWC Conference (IISWC’09). IEEE, Los Alamitos, CA, 44--54.
[11]
R. Cytron, J. Ferrante, B. Rosen, M. Wegman, and F. Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems 13, 4, 451--490.
[12]
Gregory J. Duck and Roland H. C. Yap. 2016. Heap bounds protection with low fat pointers. In Proceedings of the 2016 CC Conference (CC’16). ACM, New York, NY, 132--142.
[13]
Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. 1987. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems 9, 3, 319--349.
[14]
Swapnil Ghike, Ruben Gran, María Jesús Garzarán, and David A. Padua. 2014. Directive-based compilers for GPUs. In Proceedings of the 2014 LCPC Conference (LCPC’14). 19--35.
[15]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Proceedings of the 2012 InPar Conference (InPar’12). IEEE, Los Alamitos, CA, 1--10.
[16]
Chris Gregg and Kim Hazelwood. 2011. Where is the data? Why you cannot debate CPU vs. GPU performance without the answer. In Proceedings of the 2011 ISPASS Conference (ISPASS’11). IEEE, Los Alamitos, CA, 134--144.
[17]
Tobias Grosser, Armin Größlinger, and Christian Lengauer. 2012. Polly—performing polyhedral optimizations on a low-level intermediate representation. Parallel Processing Letters 22, 4, 1--28.
[18]
Serge Guelton, Mehdi Amini, and Béatrice Creusillet. 2012. Beyond do loops: Data transfer generation with convex array regions. In Proceedings of the 2012 LCPC Conference (LCPC’12). 249--263.
[19]
Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. Johnson, Stephen R. Beard, and David I. August. 2011. Automatic CPU-GPU communication management and optimization. In Proceedings of the 2011 PLDI Conference (PLDI’11). ACM, New York, NY, 142--151.
[20]
Julien Jaeger, Patrick Carribault, and Marc Pérache. 2015. Fine-grain data management directory for OpenMP 4.0 and OpenACC. Concurrency and Computation: Practice and Experience 27, 6, 1528--1539.
[21]
Jens Knoop, Oliver Rüthing, and Bernhard Steffen. 1992. Lazy code motion. In Proceedings of the 1992 PLDI Conference (PLDI’92). ACM, New York, NY, 224--234.
[22]
Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi. 2014. IPMACC: Open source OpenACC to CUDA/OpenCL translator. arXiv:1412.1127.
[23]
Chris Lattner and Sarita Adve. 2004. LLVM: A compilation framework for lifelong program analysis transformation. In Proceedings of the 2004 CGO Conference (CGO’04). IEEE, Los Alamitos, CA, 75--86.
[24]
S. Lee and R. Eigenmann. 2010. OpenMPC: Extended OpenMP programming and tuning for GPUs. In Proceedings of the 2010 SC Conference (SC’10). IEEE, Los Alamitos, CA, 1--11.
[25]
Seyong Lee and Jeffrey S. Vetter. 2014. OpenARC: Open accelerator research compiler for directive-based, efficient heterogeneous computing. In Proceedings of the 2014 HPDC Conference (HPDC’14). ACM, New York, NY, 115--120.
[26]
Cor Meenderinck and Ben Juurlink. 2011. Nexus: Hardware support for task-based programming. In Proceedings of the 2011 DSD Conference (DSD’11). 442--445.
[27]
Gleison Mendonça, Breno Guimaraes, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintao Pereira. 2016. Automatic insertion of copy annotation in data-parallel programs. In Proceedings of the 2016 SBAC-PAD Conference (SBAC-PAD’16). IEEE, Los Alamitos, CA, 1--8.
[28]
H. Nazaré, I. Maffra, W. Santos, L. Barbosa, L. Gonnord, and F. M. Q. Pereira. 2014. Validation of memory accesses through symbolic analyses. In Proceedings of the 2014 OOPSLA Conference (OOPSLA’14). ACM, New York, NY, 791--809.
[29]
Cedric Nugteren and Henk Corporaal. 2014. Bones: An automatic skeleton-based C-to-CUDA compiler for GPUs. ACM Transactions on Architecture and Code Optimization 11, 4, 35:1--35:25.
[30]
OpenACC Standard. 2013. The OpenACC Programming Interface. Technical Report. CAPS.
[31]
Fernando Magno Quintao Pereira and Daniel Berlin. 2009. Wave propagation and deep propagation for pointer analysis. In Proceedings of the 2009 CGO Conference (CGO’09). IEEE, Los Alamitos, CA, 126--135.
[32]
A Raghesh. 2011. A Framework for Automatic OpenMP Code Generation. Master’s thesis. IIT Madras.
[33]
R. Reyes, I. López-Rodríguez, J. Fumero, and F. Sande. 2012. AccULL: An OpenACC implementation with CUDA and OpenCL support. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 871--882.
[34]
Radu Rugina and Martin Rinard. 2000. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. ACM SIGPLAN Notices 35, 5, 182--195.
[35]
Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. 2003. Hybrid analysis: Static and dynamic memory reference analysis. International Journal of Parallel Programming 31, 251--283.
[36]
O. Shivers. 1988. Control flow analysis in scheme. In Proceedings of the 1988 PLDI Conference (PLDI’88). ACM, New York, NY, 164--174.
[37]
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniel Liu, and Wen-Mei W. Hwu. 2012. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing. Technical Report. IMPACT.
[38]
Rémi Triolet, Francois Irigoin, and Paul Feautrier. 1986. Direct parallelization of call statements. In Proceedings of the 1986 SIGPLAN Conference (SIGPLAN’86). ACM, New York, NY, 176--185.
[39]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Transactions on Architecture and Code Optimization 9, 4, 54:1--54:23.
[40]
Sandra Wienke, Paul L. Springer, Christian Terboven, and Dieter an Mey. 2012. OpenACC—first experiences with real-world applications. In Proceedings of the 2012 Euro-Par Conference (Euro-Par’12). 859--870.
[41]
M. J. Wolfe. 1995. High Performance Compilers for Parallel Computing. Addison-Wesley, Boston, MA.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 14, Issue 2
June 2017
259 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3086564
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 May 2017
Accepted: 01 April 2017
Revised: 01 March 2017
Received: 01 November 2016
Published in TACO Volume 14, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Automatic parallelization
  2. static analysis

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • LG Electronics
  • FAPEMIG, FAPESP, CNPq and CAPES

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)107
  • Downloads (Last 6 weeks)25
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media