skip to main content
research-article
Open access

Static placement of computation on heterogeneous devices

Published: 12 October 2017 Publication History

Abstract

Heterogeneous architectures characterize today hardware ranging from super-computers to smartphones. However, in spite of this importance, programming such systems is still challenging. In particular, it is challenging to map computations to the different processors of a heterogeneous device. In this paper, we provide a static analysis that mitigates this problem. Our contributions are two-fold: first, we provide a semi-context-sensitive algorithm, which analyzes the program's call graph to determine the best processor for each calling context. This algorithm is parameterized by a cost model, which takes into consideration processor's characteristics and data transfer time. Second, we show how to use simulated annealing to calibrate this cost model for a given heterogeneous architecture. We have used our ideas to build Etino, a tool that annotates C programs with OpenACC or OpenMP 4.0 directives. Etino generates code for a CPU-GPU architecture without user intervention. Experiments on classic benchmarks reveal speedups of up to 75x. Moreover, our calibration process lets avoid slowdowns of up to 720x which trivial parallelization approaches would yield.

Supplementary Material

Auxiliary Archive (oopsla17-oopsla59-aux.zip)

References

[1]
Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2nd Edition). Addison Wesley, Boston, MA, USA.
[2]
Péricles Alves, Fabian Gruber, Johannes Doerfert, Alexandros Lamprineas, Tobias Grosser, Fabrice Rastello, and Fernando Magno Quintão Pereira. 2015. Runtime Pointer Disambiguation. SIGPLAN Not. 50, 10 (2015), 589–606.
[3]
M. Amini, C. Ancourt, F. Coelho, B. Creusillet, S. Guelton, F. Irigoin, P. Jouvelot, R. Keryell, and P. Villalon. 2012. PIPS Is not (only) Polyhedral Software. Technical Report. IMPACT.
[4]
Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy Katz, Andy Konwinski, Gunho Lee, David Patterson, Ariel Rabkin, Ion Stoica, and Matei Zaharia. 2010. A View of Cloud Computing. CACM 53, 4 (2010), 50–58.
[5]
Cedric Augonnet, Samuel Thibault, Raymond Namyst, and Pierre-Andre Wacrenier. 2011. StarP U: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. Concurr. Comput. : Pract. Exper. 23, 2 (2011), 187–198.
[6]
Eduard Ayguadé, Rosa M. Badia, Francisco D. Igual, Jesús Labarta, Rafael Mayo, and Enrique S. Quintana-Ortí. 2009. An Extension of the StarSs Programming Model for Platforms with Multiple GP Us. In Euro-Par. Springer, Heidelberg, Germany, 851–862.
[7]
David Bacon, Rodric Rabbah, and Sunil Shukla. 2013. FPGA Programming for the Masses. Queue 11, 2 (2013), 40:40–40:52.
[8]
Enes Bajrovic, Robert Mijakovic, Jiri Dokulil, Siegfried Benkner, and Michael Gerndt. 2016. Tuning OpenCL Applications with the Periscope Tuning Framework. In HICSS. IEEE, New York, NY, USA, 5752–5761.
[9]
Rajkishore Barik, Naila Farooqui, Brian T. Lewis, Chunling Hu, and Tatiana Shpeisman. 2016. A Black-box Approach to Energy-aware Scheduling on Integrated CP U-GP U Systems. In CGO. ACM, New York, NY, USA, 70–81.
[10]
Muthu Manikandan Baskaran, J. Ramanujam, and P. Sadayappan. 2010. Automatic C-to-CUDA Code Generation for Affine Programs. In CC. Springer, Heidelberg, Germany, 244–263.
[11]
Protonu Basu, Mary Hall, Malik Khan, Suchit Maindola, Saurav Muralidharan, Shreyas Ramalingam, Axel Rivera, Manu Shantharam, and Anand Venkat. 2013. Towards Making Autotuning Mainstream. Int. J. High Perform. Comput. Appl. 27, 4 (2013), 379–393.
[12]
Richard Ernest Bellman. 1957. Dynamic Programming. Princeton University Press, Princeton, NJ, USA.
[13]
Carlo Bertolli, Samuel Antao, Alexandre Eichenberger, Kevin O’Brien, Zehra Sura, Arpith Jacob, Tong Chen, and Olivier Sallenave. 2014. Coordinating GP U Threads for OpenMP 4.0 in LLVM. In LLVM-HPC. IEEE, New York, NY, USA, 12–21.
[14]
Ting Cao, Stephen M. Blackburn, Tiejun Gao, and Kathryn S. McKinley. 2012. The Yin and Yang of power and performance for asymmetric hardware and managed software. In ISCA. ACM, New York, NY, USA, 225–236.
[15]
Vlado Cerny. 1985. Thermodynamical approach to the traveling salesman problem. Journal of Optimization Theory and Applications 45, 1 (1985), 41–51.
[16]
Stephen Chong, Jed Liu, Andrew C Myers, Xin Qi, Krishnaprasad Vikram, Lantian Zheng, and Xin Zheng. 2007. Secure web applications via automatic partitioning. ACM SIGOPS Operating Systems Review 41, 6 (2007), 31–44.
[17]
Emilio G. Cota, Paolo Mantovani, Giuseppe Di Guglielmo, and Luca P. Carloni. 2015. An Analysis of Accelerator Coupling in Heterogeneous Architectures. In DAC. ACM, New York, NY, USA, 202:1–202:6.
[18]
Leonardo Luiz Padovani Da Mata, Fernando Magno QuintãO Pereira, and Renato Ferreira. 2013. Automatic Parallelization of Canonical Loops. Sci. Comput. Program. 78, 8 (2013), 1193–1206.
[19]
Gregory F. Diamos and Sudhakar Yalamanchili. 2008. Harmony: An Execution Model and Runtime for Heterogeneous Many Core Systems. In HPDC. ACM, New York, NY, USA, 197–200.
[20]
Michael Garland. 2008. Parallel Computing Experiences with CUDA. IEEE Micro 28 (2008), 13–27. Issue 4.
[21]
Alan Gibbons. 1988. Efficient Parallel Algorithms. Cambridge University Press, Cambridge, UK.
[22]
Serge Guelton, Mehdi Amini, and Béatrice Creusillet. 2012. Beyond Do Loops: Data Transfer Generation with Convex Array Regions. In LCPC. Springer, Heidelberg, Germany, 249–263.
[23]
Bhargav S. Gulavani and Sumit Gulwani. 2008. A Numerical Abstract Domain Based on Expression Abstraction and Max Operator with Application in Timing Analysis. In CAV. Springer, Heidelberg, Germany, 370–384.
[24]
Albert Hartono, Boyana Norris, and P. Sadayappan. 2009. Annotation-based Empirical Performance Tuning Using Orio. In IPDPS. IEEE, New York, NY, USA, 1–11.
[25]
Ivan Jibaja, Ting Cao, Stephen M. Blackburn, and Kathryn S. McKinley. 2016. Portable Performance on Asymmetric Multicore Processors. In CGO. ACM, New York, NY, USA, 24–35.
[26]
David Kirk. 2007. NVIDIA Cuda Software and GP U Parallel Computing Architecture. In ISMM. ACM, New York, NY, USA, 103–104.
[27]
Andreas Klöckner, Nicolas Pinto, Yunsup Lee, Bryan Catanzaro, Paul Ivanov, and Ahmed Fasih. 2012. PyCUDA and PyOpenCL: A Scripting-based Approach to GP U Run-time Code Generation. Parallel Comput. 38, 3 (2012), 157–174.
[28]
Ron Kohavi. 1995. A Study of Cross-validation and Bootstrap for Accuracy Estimation and Model Selection. In IJCAI. Morgan Kaufmann, Burlington, MA, USA, 1137–1143.
[29]
Ahmad Lashgar, Alireza Majidi, and Amirali Baniasadi. 2014. IPMACC: Open Source OpenACC to CUDA/OpenCL Translator. CoRR abs/1412.1127 (2014), 1–9.
[30]
Chris Lattner and Vikram S. Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In CGO. IEEE, New York, NY, USA, 75–88.
[31]
Ondřej Lhoták and Laurie Hendren. 2006. Context-Sensitive Points-to Analysis: Is It Worth It?. In CC. Springer, Heidelberg, Germany, 47–64.
[32]
Chih-Sheng Lin, Shih-Meng Teng, and Pao-Ann Hsiung. 2016. Auto-tuning for GPGP U Applications Using Performance and Energy Model. J. Syst. Archit. 62, C (2016), 40–53.
[33]
Christos Margiolas and Michael F. P. O’Boyle. 2016. Portable and Transparent Software Managed Scheduling on Accelerators for Fair Resource Sharing. In CGO. ACM, New York, NY, USA, 82–93.
[34]
Gleison Mendonça, Breno Guimarães, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quintão Pereira. 2017. DawnCC: Automatic Annotation for Data Parallelism and Offloading. TACO 14, 2 (2017), 13:1–13:25.
[35]
Gleison Mendonça, Breno Guimaraes, Péricles Alves, Márcio Pereira, Guido Araújo, and Fernando Magno Quint ao Pereira. 2016. Automatic Insertion of Copy Annotation in Data-Parallel Programs. In SBAC-PAD. ACM, New York, NY, USA, 1–8.
[36]
Matthew Might, Yannis Smaragdakis, and David Van Horn. 2010. Resolving and Exploiting the k-CFA Paradox: Illuminating Functional vs. Object-oriented Program Analysis. In PLDI. ACM, New York, NY, USA, 305–315.
[37]
Saurav Muralidharan, Amit Roy, Mary Hall, Michael Garland, and Piyush Rai. 2016. Architecture-Adaptive Code Variant Tuning. SIGPLAN Not. 51, 4 (2016), 325–338.
[38]
John Nickolls and William J. Dally. 2010. The GP U Computing Era. Micro 30 (2010), 56–69.
[39]
John Nickolls and David Kirk. 2009. Graphics and Computing GPUs. Computer Organization and Design, (Patterson and Hennessy) (4th ed.). Elsevier, Amsterdam, Netherlands, Chapter A, A.1 – A.77.
[40]
Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. 2005. Principles of program analysis. Springer, Heidelberg, Germany.
[41]
Rajiv Nishtala, Paul Carpenter, Vinicius Petrucci, and Xavier Martorell. 2017. Hipster: Hybrid Task Manager for LatencyCritical Cloud Workloads. In HPCA. ACM, New York, NY, USA, 1–11.
[42]
Cedric Nugteren and Henk Corporaal. 2014. Bones: An Automatic Skeleton-Based C-to-CUDA Compiler for GP Us. TACO 11, 4 (2014), 35:1–35:25.
[43]
Vinicius Petrucci, Michael A. Laurenzano, John Doherty, Yunqi Zhang, Daniel Mossé, Jason Mars, and Lingjia Tang. 2015. Octopus-Man: QoS-driven task management for heterogeneous multicores in warehouse-scale computers. In HPCA. ACM, New York, NY, USA, 246–258.
[44]
PGI. 2016. Compiler User’s Guide. (2016). http://www.pgroup.com/ doc/pgiug.pdf.
[45]
Guilherme Piccoli, Henrique N. Santos, Raphael E. Rodrigues, Christiane Pousa, Edson Borin, and Fernando M. Quintão Pereira. 2014. Compiler Support for Selective Page Migration in NUMA Architectures. In PACT. ACM, New York, NY, USA, 369–380.
[46]
Radu Rugina and Martin C. Rinard. 2005. Symbolic bounds analysis of pointers, array indices, and accessed memory regions. TOPLAS 27, 2 (2005), 185–235.
[47]
Diogo Sampaio, Rafael Martins de Souza, Sylvain Collange, and Fernando Magno Quintão Pereira. 2014. Divergence Analysis. ACM Trans. Program. Lang. Syst. 35, 4, Article 13 (2014), 36 pages.
[48]
Mahadev Satyanarayanan. 2011. Mobile Computing: The Next Decade. SIGMOBILE 15, 2 (2011), 2–10.
[49]
Olin Shivers. 1988. Control-Flow Analysis in Scheme. In PLDI. ACM, New York, NY, USA, 164–174.
[50]
Jaewoong Sim, Aniruddha Dasgupta, Hyesoon Kim, and Richard Vuduc. 2012. A Performance Analysis Framework for Identifying Potential Benefits in GPGP U Applications. In PPoPP. ACM, New York, NY, USA, 11–22.
[51]
Amit Kumar Singh, Muhammad Shafique, Akash Kumar, and Jörg Henkel. 2013. Mapping on Multi/Many-core Systems: Survey of Current and Emerging Trends. In DAC. ACM, New York, NY, USA, Article 1, 10 pages.
[52]
Shuaiwen Song, Chunyi Su, Barry Rountree, and Kirk W. Cameron. 2013. A Simplified and Accurate Model of PowerPerformance Efficiency on Emergent GP U Architectures. In IPDPS. IEEE, New York, NY, USA, 673–686.
[53]
Victor Hugo Sperle Campos, Péricles Rafael Alves, Henrique Nazaré Santos, and Fernando Magno Quintão Pereira. 2016. Restrictification of Function Arguments. In CC. ACM, New York, NY, USA, 163–173.
[54]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Des. Test 12, 3 (2010), 66–73.
[55]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral Parallel Code Generation for CUDA. TACO 9, 4 (2013), 54:1–54:23.
[56]
Yuan Wen and Michael F.P. O’Boyle. 2017. Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CP U/GP U Platforms. In GPGPU-10. ACM, New York, NY, USA, 22–31.
[57]
John Whaley and Monica S. Lam. 2004. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In PLDI. ACM, New York, NY, USA, 131–144.
[58]
Michael Wolfe. 1996. High Performance Compilers for Parallel Computing (1st ed.). Adison-Wesley, Boston, MA, USA.
[59]
Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, and Jack Dongarra. 2015. Hierarchical DAG Scheduling for Hybrid Distributed Systems. In IPDPS. IEEE, New York, NY, USA, 156–165.
[60]
Mohamed Zahran. 2016. Heterogeneous Computing: Here to Stay. Queue 14, 6 (2016), 40:31–40:42.
[61]
Hui Zeng, Matt Yourst, Kanad Ghose, and Dmitry Ponomarev. 2009. MPTLsim: A Cycle-accurate, Full-system Simulator for x86-64 Multicore Architectures with Coherent Caches. SIGARCH Comput. Archit. News 37, 2 (2009), 2–9.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Programming Languages
Proceedings of the ACM on Programming Languages  Volume 1, Issue OOPSLA
October 2017
1786 pages
EISSN:2475-1421
DOI:10.1145/3152284
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2017
Published in PACMPL Volume 1, Issue OOPSLA

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Compiler
  2. GPUs
  3. Heterogenous Architecture
  4. Static Analysis

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)216
  • Downloads (Last 6 weeks)26
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media