research-article

Open access

Static placement of computation on heterogeneous devices

Authors:

Gabriel Poesia,

Breno Guimarães,

Fabrício Ferracioli,

Fernando Magno Quintão PereiraAuthors Info & Claims

Proceedings of the ACM on Programming Languages, Volume 1, Issue OOPSLA

Article No.: 50, Pages 1 - 28

https://doi.org/10.1145/3133874

Published: 12 October 2017 Publication History

Abstract

Heterogeneous architectures characterize today hardware ranging from super-computers to smartphones. However, in spite of this importance, programming such systems is still challenging. In particular, it is challenging to map computations to the different processors of a heterogeneous device. In this paper, we provide a static analysis that mitigates this problem. Our contributions are two-fold: first, we provide a semi-context-sensitive algorithm, which analyzes the program's call graph to determine the best processor for each calling context. This algorithm is parameterized by a cost model, which takes into consideration processor's characteristics and data transfer time. Second, we show how to use simulated annealing to calibrate this cost model for a given heterogeneous architecture. We have used our ideas to build Etino, a tool that annotates C programs with OpenACC or OpenMP 4.0 directives. Etino generates code for a CPU-GPU architecture without user intervention. Experiments on classic benchmarks reveal speedups of up to 75x. Moreover, our calibration process lets avoid slowdowns of up to 720x which trivial parallelization approaches would yield.

Supplementary Material

Auxiliary Archive (oopsla17-oopsla59-aux.zip)

Download
182.86 KB

References

[1]

Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison Wesley, Boston, MA, USA.

Digital Library

[2]

Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, and Fernando Magno Quintão Pereira. 2015. Runtime Pointer Disambiguation. SIGPLAN Not. 50, 10 (2015), 589–606.

Digital Library

[3]

M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton, F. Irigoin, P. Jouvelot, R. Keryell, and P. Villalon. 2012. PIPS Is not (only) Polyhedral Software. Technical Report. IMPACT.

[4]

Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2010. A View of Cloud Computing. CACM 53, 4 (2010), 50–58.

Digital Library

[5]

Cedric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andre Wacrenier. 2011. StarP U: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurr. Comput. : Pract. Exper. 23, 2 (2011), 187–198.

Digital Library

[6]

Eduard Ayguadé, Rosa M. Badia, Francisco D. Igual, Jesús Labarta, Rafael Mayo, and Enrique S. Quintana-Ortí. 2009. An Extension of the StarSs Programming Model for Platforms with Multiple GP Us. In Euro-Par. Springer, Heidelberg, Germany, 851–862.

[7]

David Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA Programming for the Masses. Queue 11, 2 (2013), 40:40–40:52.

[8]

Enes Bajrovic, Robert Mijakovic, Jiri Dokulil, Siegfried Benkner, and Michael Gerndt. 2016. Tuning OpenCL Applications with the Periscope Tuning Framework. In HICSS. IEEE, New York, NY, USA, 5752–5761.

Digital Library

[9]

Rajkishore Barik, Naila Farooqui, Brian T. Lewis, Chunling Hu, and Tatiana Shpeisman. 2016. A Black-box Approach to Energy-aware Scheduling on Integrated CP U-GP U Systems. In CGO. ACM, New York, NY, USA, 70–81.

[10]

Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA Code Generation for Affine Programs. In CC. Springer, Heidelberg, Germany, 244–263.

[11]

Protonu Basu, Mary Hall, Malik Khan, Suchit Maindola, Saurav Muralidharan, Shreyas Ramalingam, Axel Rivera, Manu Shantharam, and Anand Venkat. 2013. Towards Making Autotuning Mainstream. Int. J. High Perform. Comput. Appl. 27, 4 (2013), 379–393.

Digital Library

[12]

Richard Ernest Bellman. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ, USA.

Digital Library

[13]

Carlo Bertolli, Samuel Antao, Alexandre Eichenberger, Kevin O’Brien, Zehra Sura, Arpith Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GP U Threads for OpenMP 4.0 in LLVM. In LLVM-HPC. IEEE, New York, NY, USA, 12–21.

[14]

Ting Cao, Stephen M. Blackburn, Tiejun Gao, and Kathryn S. McKinley. 2012. The Yin and Yang of power and performance for asymmetric hardware and managed software. In ISCA. ACM, New York, NY, USA, 225–236.

[15]

Vlado Cerny. 1985. Thermodynamical approach to the traveling salesman problem. Journal of Optimization Theory and Applications 45, 1 (1985), 41–51.

Digital Library

[16]

Stephen Chong, Jed Liu, Andrew C Myers, Xin Qi, Krishnaprasad Vikram, Lantian Zheng, and Xin Zheng. 2007. Secure web applications via automatic partitioning. ACM SIGOPS Operating Systems Review 41, 6 (2007), 31–44.

Digital Library

[17]

Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An Analysis of Accelerator Coupling in Heterogeneous Architectures. In DAC. ACM, New York, NY, USA, 202:1–202:6.

[18]

Leonardo Luiz Padovani Da Mata, Fernando Magno QuintãO Pereira, and Renato Ferreira. 2013. Automatic Parallelization of Canonical Loops. Sci. Comput. Program. 78, 8 (2013), 1193–1206.

Digital Library

[19]

Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An Execution Model and Runtime for Heterogeneous Many Core Systems. In HPDC. ACM, New York, NY, USA, 197–200.

Digital Library

[20]

Michael Garland. 2008. Parallel Computing Experiences with CUDA. IEEE Micro 28 (2008), 13–27. Issue 4.

Digital Library

[21]

Alan Gibbons. 1988. Efficient Parallel Algorithms. Cambridge University Press, Cambridge, UK.

[22]

Serge Guelton, Mehdi Amini, and Béatrice Creusillet. 2012. Beyond Do Loops: Data Transfer Generation with Convex Array Regions. In LCPC. Springer, Heidelberg, Germany, 249–263.

[23]

Bhargav S. Gulavani and Sumit Gulwani. 2008. A Numerical Abstract Domain Based on Expression Abstraction and Max Operator with Application in Timing Analysis. In CAV. Springer, Heidelberg, Germany, 370–384.

Digital Library

[24]

Albert Hartono, Boyana Norris, and P. Sadayappan. 2009. Annotation-based Empirical Performance Tuning Using Orio. In IPDPS. IEEE, New York, NY, USA, 1–11.

Digital Library

[25]

Ivan Jibaja, Ting Cao, Stephen M. Blackburn, and Kathryn S. McKinley. 2016. Portable Performance on Asymmetric Multicore Processors. In CGO. ACM, New York, NY, USA, 24–35.

Digital Library

[26]

David Kirk. 2007. NVIDIA Cuda Software and GP U Parallel Computing Architecture. In ISMM. ACM, New York, NY, USA, 103–104.

Digital Library

[27]

Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and Ahmed Fasih. 2012. PyCUDA and PyOpenCL: A Scripting-based Approach to GP U Run-time Code Generation. Parallel Comput. 38, 3 (2012), 157–174.

Digital Library

[28]

Ron Kohavi. 1995. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI. Morgan Kaufmann, Burlington, MA, USA, 1137–1143.

[29]

Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi. 2014. IPMACC: Open Source OpenACC to CUDA/OpenCL Translator. CoRR abs/1412.1127 (2014), 1–9.

[30]

Chris Lattner and Vikram S. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO. IEEE, New York, NY, USA, 75–88.

[31]

Ondřej Lhoták and Laurie Hendren. 2006. Context-Sensitive Points-to Analysis: Is It Worth It?. In CC. Springer, Heidelberg, Germany, 47–64.

Digital Library

[32]

Chih-Sheng Lin, Shih-Meng Teng, and Pao-Ann Hsiung. 2016. Auto-tuning for GPGP U Applications Using Performance and Energy Model. J. Syst. Archit. 62, C (2016), 40–53.

Digital Library

[33]

Christos Margiolas and Michael F. P. O’Boyle. 2016. Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing. In CGO. ACM, New York, NY, USA, 82–93.

Digital Library

[34]

Gleison Mendonça, Breno Guimarães, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintão Pereira. 2017. DawnCC: Automatic Annotation for Data Parallelism and Offloading. TACO 14, 2 (2017), 13:1–13:25.

Digital Library

[35]

Gleison Mendonça, Breno Guimaraes, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quint ao Pereira. 2016. Automatic Insertion of Copy Annotation in Data-Parallel Programs. In SBAC-PAD. ACM, New York, NY, USA, 1–8.

[36]

Matthew Might, Yannis Smaragdakis, and David Van Horn. 2010. Resolving and Exploiting the k-CFA Paradox: Illuminating Functional vs. Object-oriented Program Analysis. In PLDI. ACM, New York, NY, USA, 305–315.

Digital Library

[37]

Saurav Muralidharan, Amit Roy, Mary Hall, Michael Garland, and Piyush Rai. 2016. Architecture-Adaptive Code Variant Tuning. SIGPLAN Not. 51, 4 (2016), 325–338.

Digital Library

[38]

John Nickolls and William J. Dally. 2010. The GP U Computing Era. Micro 30 (2010), 56–69.

Digital Library

[39]

John Nickolls and David Kirk. 2009. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy) (4th ed.). Elsevier, Amsterdam, Netherlands, Chapter A, A.1 – A.77.

[40]

Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. 2005. Principles of program analysis. Springer, Heidelberg, Germany.

Digital Library

[41]

Rajiv Nishtala, Paul Carpenter, Vinicius Petrucci, and Xavier Martorell. 2017. Hipster: Hybrid Task Manager for LatencyCritical Cloud Workloads. In HPCA. ACM, New York, NY, USA, 1–11.

[42]

Cedric Nugteren and Henk Corporaal. 2014. Bones: An Automatic Skeleton-Based C-to-CUDA Compiler for GP Us. TACO 11, 4 (2014), 35:1–35:25.

[43]

Vinicius Petrucci, Michael A. Laurenzano, John Doherty, Yunqi Zhang, Daniel Mossé, Jason Mars, and Lingjia Tang. 2015. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. In HPCA. ACM, New York, NY, USA, 246–258.

[44]

PGI. 2016. Compiler User’s Guide. (2016). http://www.pgroup.com/ doc/pgiug.pdf.

[45]

Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quintão Pereira. 2014. Compiler Support for Selective Page Migration in NUMA Architectures. In PACT. ACM, New York, NY, USA, 369–380.

Digital Library

[46]

Radu Rugina and Martin C. Rinard. 2005. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. TOPLAS 27, 2 (2005), 185–235.

Digital Library

[47]

Diogo Sampaio, Rafael Martins de Souza, Sylvain Collange, and Fernando Magno Quintão Pereira. 2014. Divergence Analysis. ACM Trans. Program. Lang. Syst. 35, 4, Article 13 (2014), 36 pages.

[48]

Mahadev Satyanarayanan. 2011. Mobile Computing: The Next Decade. SIGMOBILE 15, 2 (2011), 2–10.

Digital Library

[49]

Olin Shivers. 1988. Control-Flow Analysis in Scheme. In PLDI. ACM, New York, NY, USA, 164–174.

Digital Library

[50]

Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGP U Applications. In PPoPP. ACM, New York, NY, USA, 11–22.

[51]

Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and Jörg Henkel. 2013. Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends. In DAC. ACM, New York, NY, USA, Article 1, 10 pages.

[52]

Shuaiwen Song, Chunyi Su, Barry Rountree, and Kirk W. Cameron. 2013. A Simplified and Accurate Model of PowerPerformance Efficiency on Emergent GP U Architectures. In IPDPS. IEEE, New York, NY, USA, 673–686.

[53]

Victor Hugo Sperle Campos, Péricles Rafael Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira. 2016. Restrictification of Function Arguments. In CC. ACM, New York, NY, USA, 163–173.

[54]

John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Des. Test 12, 3 (2010), 66–73.

Digital Library

[55]

Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. TACO 9, 4 (2013), 54:1–54:23.

[56]

Yuan Wen and Michael F.P. O’Boyle. 2017. Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CP U/GP U Platforms. In GPGPU-10. ACM, New York, NY, USA, 22–31.

Digital Library

[57]

John Whaley and Monica S. Lam. 2004. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In PLDI. ACM, New York, NY, USA, 131–144.

Digital Library

[58]

Michael Wolfe. 1996. High Performance Compilers for Parallel Computing (1st ed.). Adison-Wesley, Boston, MA, USA.

[59]

Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, and Jack Dongarra. 2015. Hierarchical DAG Scheduling for Hybrid Distributed Systems. In IPDPS. IEEE, New York, NY, USA, 156–165.

Digital Library

[60]

Mohamed Zahran. 2016. Heterogeneous Computing: Here to Stay. Queue 14, 6 (2016), 40:31–40:42.

Digital Library

[61]

Hui Zeng, Matt Yourst, Kanad Ghose, and Dmitry Ponomarev. 2009. MPTLsim: A Cycle-accurate, Full-system Simulator for x86-64 Multicore Architectures with Coherent Caches. SIGARCH Comput. Archit. News 37, 2 (2009), 2–9.

Digital Library

Cited By

S. F. X. Teixeira THenzinger AYadav RAiken AMohror KArnold DBadia R(2023)Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous MachinesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607079(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607079
Mishra AMalik ALin MChapman B(2023)OpenMP Advisor: A Compiler Tool for Heterogeneous ArchitecturesOpenMP: Advanced Task-Based, Device and Compiler Programming10.1007/978-3-031-40744-4_3(34-48)Online publication date: 1-Sep-2023
https://doi.org/10.1007/978-3-031-40744-4_3
Mishra AChheda SSoto CMalik ALin MChapman B(2022)COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00074(391-400)Online publication date: May-2022
https://doi.org/10.1109/IPDPSW55747.2022.00074
Show More Cited By

Index Terms

Static placement of computation on heterogeneous devices

Recommendations

Heterogeneous multicore parallel programming for graphics processing units
Software Development for Multi-core Computing Systems

Hybrid parallel multicore architectures based on graphics processing units (GPUs) can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting ...
Heterogeneous programming with Single Operation Multiple Data

Heterogeneity is omnipresent in today's commodity computational systems, which comprise at least one Central Processing Unit (CPU) and one Graphics Processing Unit (GPU). Nonetheless, all this computing power is not being harnessed in mainstream ...
Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers
Highlights
- Generate parallel CUDA code from sequential C input code using a compiler-based tool for key operators in Geometric Multigrid.
Abstract
GPUs, with their high bandwidths and computational capabilities are an increasingly popular target for scientific computing. Unfortunately, to date, harnessing the power of the GPU has required use of a GPU-specific programming model ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages

Proceedings of the ACM on Programming Languages Volume 1, Issue OOPSLA

October 2017

1786 pages

EISSN:2475-1421

DOI:10.1145/3152284

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2017

Published in PACMPL Volume 1, Issue OOPSLA

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
823
Total Downloads

Downloads (Last 12 months)216
Downloads (Last 6 weeks)26

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

S. F. X. Teixeira THenzinger AYadav RAiken AMohror KArnold DBadia R(2023)Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous MachinesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607079(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607079
Mishra AMalik ALin MChapman B(2023)OpenMP Advisor: A Compiler Tool for Heterogeneous ArchitecturesOpenMP: Advanced Task-Based, Device and Compiler Programming10.1007/978-3-031-40744-4_3(34-48)Online publication date: 1-Sep-2023
https://doi.org/10.1007/978-3-031-40744-4_3
Mishra AChheda SSoto CMalik ALin MChapman B(2022)COMPOFF: A Compiler Cost model using Machine Learning to predict the Cost of OpenMP Offloading2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW55747.2022.00074(391-400)Online publication date: May-2022
https://doi.org/10.1109/IPDPSW55747.2022.00074
Da Silva JLeão LPetrucci VGamatié APereira F(2021)Mapping Computations in Heterogeneous Multicore Systems with Statistical Regression on Program InputsACM Transactions on Embedded Computing Systems10.1145/347828820:6(1-35)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3478288
da Silva AKind Bde Souza Magalhães JRocha JGuimarães BPereira FLee J(2021)AnghaBenchProceedings of the 2021 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO51591.2021.9370322(378-390)Online publication date: 27-Feb-2021
https://dl.acm.org/doi/10.1109/CGO51591.2021.9370322
Poesia GPereira F(2020)Dynamic dispatch of context-sensitive optimizationsProceedings of the ACM on Programming Languages10.1145/34282354:OOPSLA(1-28)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3428235
Zanella Ada Silva AQuintão FCavalcante EDantas FBatista T(2020)YACOSProceedings of the 24th Brazilian Symposium on Context-Oriented Programming and Advanced Modularity10.1145/3427081.3427089(56-63)Online publication date: 19-Oct-2020
https://dl.acm.org/doi/10.1145/3427081.3427089
Melo LRibeiro RGuimarães BPereira F(2020)Type Inference for CACM Transactions on Programming Languages and Systems10.1145/342147242:3(1-71)Online publication date: 13-Nov-2020
https://dl.acm.org/doi/10.1145/3421472
Mishra AMalik AChapman B(2020)Extending the LLVM/Clang Framework for OpenMP Metadirective Support2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/LLVMHPCHiPar51896.2020.00009(33-44)Online publication date: Nov-2020
https://doi.org/10.1109/LLVMHPCHiPar51896.2020.00009
Mishra AMalik AChapman B(2020)Data Transfer and Reuse Analysis Tool for GPU-Offloading Using OpenMPOpenMP: Portable Multi-Level Parallelism on Modern Systems10.1007/978-3-030-58144-2_18(280-294)Online publication date: 22-Sep-2020
https://dl.acm.org/doi/10.1007/978-3-030-58144-2_18
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents