skip to main content
10.5555/3019120.3019122acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Towards achieving performance portability using directives for accelerators

Published: 13 November 2016 Publication History

Abstract

In this paper we explore the performance portability of directives provided by OpenMP 4 and OpenACC to program various types of node architectures with attached accelerators, both self-hosted multicore and offload multicore/GPU. Our goal is to examine how successful OpenACC and the newer offload features of OpenMP 4.5 are for moving codes between architectures, how much tuning might be required and what lessons we can learn from this experience. To do this, we use examples of algorithms with varying computational intensities for our evaluation, as both compute and data access efficiency are important considerations for overall application performance. We implement these kernels using various methods provided by newer OpenACC and OpenMP implementations, and we evaluate their performance on various platforms including both X86_64 with attached NVIDIA GPUs, self-hosted Intel Xeon Phi KNL, as well as an X86_64 host system with Intel Xeon Phi coprocessors. In this paper, we explain what factors affected the performance portability such as how to pick the right programming model, its programming style, its availability on different platforms, and how well compilers can optimize and target to multiple platforms.

References

[1]
T. Williams, K. Antypas, and T. Straatsma, "2015 workshop on portability among hpc architectures for scientific applications," in The International Conference for High Performance Computing, Networking, Storage, and Analysis, November 2015. {Online}. Available: http://hpcport.alcf.anl.gov/
[2]
R. Neely, J. Reinders, M. Glass, R. Hartman-Baker, J. Levesque, H. Ah Nam, J. Sexton, T. Straatsma, T. Williams, and C. Zeller, "DOE centers of excellence performance portability meeting," https://asc.llnl.gov/DOE-COE-Mtg-2016/, 2016.
[3]
"CORAL fact sheet," http://www.anl.gov/sites/anl.gov/files/CORAL
[4]
"Summit: Scale new heights. discover new solutions." https://www.olcf.ornl.gov/summit/.
[5]
"Sierra advanced technology system," http://computation.llnl.gov/computers/sierra-advanced-technology-system.
[6]
"Aurora," http://aurora.alcf.anl.gov/.
[7]
V. Vergara Larrea, W. Joubert, M. G. Lopez, and O. Hernandez, "Early experiences writing performance portable OpenMP 4 codes," in Proc. Cray User Group Meeting, London, England. Cray User Group Incorporated, May 2016. {Online}. Available: https://cug.org/proceedings/cug2016_proceedings/includes/files/pap161.pdf
[8]
J. Reid, "The new features of fortran 2008," SIGPLAN Fortran Forum, vol. 27, no. 2, pp. 8--21, Aug. 2008. {Online}. Available:
[9]
J. Hoberock, "Working draft, technical specification for c++ extensions for parallelism," http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4071.htm, 2014.
[10]
H. C. Edwards, C. R. Trott, and D. Sunderland, "Kokkos: Enabling manycore performance portability through polymorphic memory access patterns," Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202 -- 3216, 2014, domain-Specific Languages and High-Level Frameworks for High-Performance Computing. {Online}. Available: http://www.sciencedirect.com/science/article/pii/S0743731514001257
[11]
R. D. Hornung and J. A. Keasler, "The RAJA portability layer: Overview and status," https://e-reports-ext.llnl.gov/pdf/782261.pdf, 2014.
[12]
E. Calore, S. F. Schifano, and R. Tripiccione, "On Portability, Performance and Scalability of an MPI OpenCL Lattice Boltzmann Code," Euro-Par 2014: Parallel Processing Workshops, Pt Ii, vol. 8806, pp. 438--449, 2014.
[13]
S. J. Pennycook and S. A. Jarvis, "Developing Performance-Portable Molecular Dynamics Kernels in OpenCL," 2012 Sc Companion: High Performance Computing, Networking, Storage and Analysis (Scc), pp. 386--395, 2012.
[14]
C. Cao, M. Gates, A. Haidar, P. Luszczek, S. Tomov, I. Yamazaki, and J. Dongarra, "Performance and portability with opencl for throughput-oriented hpc workloads across accelerators, coprocessors, and multicore processors," in Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ser. ScalA '14. Piscataway, NJ, USA: IEEE Press, 2014, pp. 61--68. {Online}. Available:
[15]
J. Jeffers and J. Reinders, Intel Xeon Phi Coprocessor High Performance Programming. Burlington, MA: Morgan Kaufmann, 2013.
[16]
S. Lee and R. Eigenmann, "OpenMPC: extended OpenMP programming and tuning for GPUs," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. IEEE Computer Society, 2010, pp. 1--11.
[17]
T. D. Han and T. S. Abdelrahman, "hi CUDA: a high-level directive-based language for GPU programming," in Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units. ACM, 2009, pp. 52--61.
[18]
O. Hernandez, W. Ding, B. Chapman, C. Kartsaklis, R. Sankaran, and R. Graham, "Experiences with high-level programming directives for porting applications to GPUs," in Facing the Multicore-Challenge II. Springer, 2012, pp. 96--107.
[19]
S. Lee and J. S. Vetter, "Early evaluation of directive-based GPU programming models for productive exascale computing," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society Press, 2012, p. 23.
[20]
S. Wienke, C. Terboven, J. C. Beyer, and M. S. Müller, Euro-Par 2014 Parallel Processing: 20th International Conference, Porto, Portugal, August 25--29, 2014. Proceedings. Cham: Springer International Publishing, 2014, ch. A Pattern-Based Comparison of OpenACC and OpenMP for Accelerator Computing, pp. 812--823. {Online}. Available:
[21]
G. Juckeland, W. Brantley, S. Chandrasekaran, B. Chapman, S. Che, M. Colgrove, H. Feng, A. Grund, R. Henschel, W.-M. W. Hwu, H. Li, M. S. Müller, W. E. Nagel, M. Perminov, P. Shelepugin, K. Skadron, J. Stratton, A. Titov, K. Wang, M. Waveren, B. Whitney, S. Wienke, R. Xu, and K. Kumaran, High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation: 5th International Workshop, PMBS 2014, New Orleans, LA, USA, November 16, 2014. Revised Selected Papers. Cham: Springer International Publishing, 2015, ch. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance, pp. 46--67. {Online}. Available:
[22]
G. Juckeland, A. Grund, and W. E. Nagel, "Performance Portable Applications for Hardware Accelerators: Lessons Learned from SPEC ACCEL," in Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International, May 2015, pp. 689--698.
[23]
M. Martineau, S. McIntosh-Smith, M. Boulton, and W. Gaudin, "An Evaluation of Emerging Many-Core Parallel Programming Models," in Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores, ser. PMAM'16. New York, NY, USA: ACM, 2016, Conference Proceedings, pp. 1--10.
[24]
G. J. et. al, "From describing to prescribing parallelism: Translating the SPEC ACCEL OpenACC suite to OpenMP target directives," in ISC High Performance 2016 International Workshops, P3MA, June 2016.
[25]
OpenMP Architecture Review Board, "OpenMP Application Program Interface. Version 4.5," http://www.openmp.org/mp-documents/openmp-4.5.pdf, November 2015.
[26]
"HACCmk," https://asc.llnl.gov/CORAL-benchmarks/Summaries/HACCmk_Summary_v1.0.pdf.
[27]
J. Robicheaux, "Program to solve a finite difference equation using Jacobi iterative method," http://www.openmp.org/samples/jacobi.f.
[28]
W. Joubert, R. K. Archibald, M. A. Berrill, W. M. Brown, M. Eisenbach, R. Grout, J. Larkin, J. Levesque, B. Messer, M. R. Norman, and et al., "Accelerated application development: The ORNL Titan experience," Computers and Electrical Engineering, vol. 46, May 2015.
[29]
R. G. Brook, A. Heinecke, A. B. Costa, P. Peltz Jr., V. C. Betro, T. Baer, M. Bader, and P. Dubey, "Beacon: Deployment and application of Intel Xeon Phi coprocessors for scientific computing," Computing in Science and Engineering, vol. 17, no. 2, pp. 65--72, 2015.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WACCPD '16: Proceedings of the Third International Workshop on Accelerator Programming Using Directives
November 2016
94 pages
ISBN:9781509061525

Sponsors

In-Cooperation

Publisher

IEEE Press

Publication History

Published: 13 November 2016

Check for updates

Qualifiers

  • Research-article

Conference

SC16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 7 of 14 submissions, 50%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media