research-article

New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code

Authors:

Michel Müller,

Takayuki AokiAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 5, Issue 2

Article No.: 7, Pages 1 - 42

https://doi.org/10.1145/3291523

Published: 19 December 2018 Publication History

Abstract

We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran–based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation—an achievement comparable to more invasive GPGPU rewrites of other weather models.

References

[1]

Andrey Bokhanko. 2014. OpenMP 4 Support in Clang / LLVM. (2014). Retrieved December 22, 2017 from OpenMP.org: http://mail.openmp.org/sc14/BoF_Intel_Andrey_Clang.pdf

[2]

Ben Cumming, Carlos Osuna, Tobias Gysi, Mauro Bianco, Xavier Lapillonne, Oliver Fuhrer, and Thomas C. Schulthess. 2013. A review of the challenges and results of refactoring the community climate code COSMO for hybrid Cray HPC systems. In Proceedings of Cray User Group 2013 (2013), 1--11.

[3]

Craig C. Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Rüde, and Christian Weiß. 2000. Cache optimization for structured and unstructured grid multigrid. Electr. Trans. Numer. Anal. 10 (2000), 21--40.

[4]

Hikmet Dursun, Ken-ichi Nomura, Weiqiang Wang, Manaschai Kunaseth, Liu Peng, Richard Seymour, Rajiv K Kalia, Aiichiro Nakano, and Priya Vashishta. 2009. In-core optimization of high-order stencil computations. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’09). CSCE, 533--538.

[5]

H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74, 12 (2014), 3202--3216.

Digital Library

[6]

Oliver Fuhrer. 2014. Grid Tools: Towards a library for hardware oblivious implementation of stencil-based codes. Retrieved December 22, 2017 from http://www.pasc-ch.org/projects/2013-2016/grid-tools.

[7]

Oliver Fuhrer, Carlos Osuna, Xavier Lapillonne, Tobias Gysi, Ben Cumming, Mauro Bianco, Andrea Arteaga, and Thomas Christoph Schulthess. 2014. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomput. Front. Innovat. 1, 1 (2014), 45--62.

Digital Library

[8]

Mark Govett. 2012. F2C-ACC Users Guide, Version 4.2. (2012). Retrieved December 22, 2017 from NOAA: http://www.esrl.noaa.gov/gsd/ ab/ac/Accelerators.html.

[9]

Mark Govett, Jacques Middlecoff, and Tom Henderson. 2010. Running the NIM next-generation weather model on GPUs. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE Computer Society, 792--796.

Digital Library

[10]

Mark Govett, Jacques Middlecoff, and Tom Henderson. 2014. Directive-based parallelization of the NIM weather model for GPUs. In Proceedings of the 1st Workshop on Accelerator Programming Using Directives (WACCPD’14). IEEE Press, Los Alamitos, CA, 55--61.

Digital Library

[11]

Mark Govett, Jim Rosinski, Jacques Middlecoff, Tom Henderson, Jin Lee, Alexander MacDonald, Ning Wang, Paul Madden, Julie Schramm, and Antonio Duarte. 2017. Parallelization and performance of the NIM weather model on CPU, GPU and MIC processors. Bull. Am. Meteorol. Soc. 98, 10 (2017), 2201--2213.

[12]

The Portland Group. 2012. PGI Accelerator Compilers with OpenACC Directives. (2012). Retrieved December 22, 2017 from http://www.pgroup.com/resources/accel.htm.

[13]

Mark Harris. 2007. Optimizing CUDA. SC07: High Performance Computing with CUDA (2007).

[14]

Intel. 2010. Intel Xeon Processor X5670. (2010). Retrieved December 22, 2017 from http://ark.intel.com/products/47920/Intel-Xeon-Processor-X5670-12M-Cache-2_93-GHz-6_40-GTs-Intel-QPI.

[15]

Intel. 2012. Intel Xeon Processor E5-2670. (2012). Retrieved December 22, 2017 from http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI.

[16]

Intel. 2016. Intel Xeon Processor E5-2695 v4. (2016). Retrieved December 22, 2017 from http://ark.intel.com/products/91316/Intel-Xeon-Processor-E5-2695-v4-45M-Cache-2_10-GHz.

[17]

Junichi Ishida, Chiashi Muroi, Kohei Kawano, and Yuji Kitamura. 2010. Development of a new nonhydrostatic model ASUCA at JMA. CAS/JSC WGNE Res. Activ. Atmos. Oceanic Model. 40 (2010), 0511--0512.

[18]

NVIDIA Inc. James Beyer. 2016. Targeting GPUs with OpenMP 4.5 Device Directives. (2016). Retrieved December 22, 2017 from http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf.

[19]

Cray Inc. James C. Beyer. 2013. The use of OpenACC and OpenMP Accelerator directives with the Cray Compilation Environment (CCE). (2013). Retrieved December 22, 2017 from http://on-demand.gputechconf.com/gtc/2013/presentations/S3084-OpenACC-OpenMP-Directives-CCE.pdf.

[20]

Jan Kwiatkowski. 2001. Evaluation of parallel programs by measurement of its granularity. In International Conference on Parallel Processing and Applied Mathematics. Springer, 145--153.

Digital Library

[21]

Xavier Lapillonne and Oliver Fuhrer. 2014. Using compiler directives to port large scientific applications to GPUs: An example from atmospheric science. Parallel Process. Lett. 24, 1 (2014).

[22]

Piotr R. Luszczek, David H. Bailey, Jack J. Dongarra, Jeremy Kepner, Robert F. Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC challenge (HPCC) benchmark suite. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. 213.

Digital Library

[23]

Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, Article 11, 12 pages.

Digital Library

[24]

John Michalakes and Manish Vachharajani. 2008. GPU acceleration of numerical weather prediction. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’08). IEEE, 1--7.

[25]

Jarno Mielikainen, Bormin Huang, and Allen Huang. 2014. Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme. In SPIE Sensing Technology+ Applications. International Society for Optics and Photonics, 91240T--91240T.

[26]

Jarno Mielikainen, Bormin Huang, Hung-Lung Allen Huang, and Mitchell D. Goldberg. 2012. GPU acceleration of the updated Goddard shortwave radiation scheme in the weather research and forecasting (WRF) model. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens. 5, 2 (2012), 555--562.

[27]

Michel Müller and Takayuki Aoki. 2018. Hybrid Fortran: High productivity GPU porting framework applied to Japanese weather prediction model. In Accelerator Programming Using Directives, Sunita Chandrasekaran and Guido Juckeland (Eds.). Springer International Publishing, Cham, 20--41.

[28]

Matthew R. Norman, Azamat Mametjanov, and Mark Taylor. 2017. Exascale Programming Approaches for the Accelerated Model for Climate and Energy. Technical Report P7001-0117. Argonne National Laboratory, Argonne, USA.

[29]

OpenACC. 2015. The OpenACC Application Programming Interface Version 2.5. (2015). Retrieved December 22, 2017 from http://www.openacc.org/sites/default/files/inline-files/OpenACC_2pt5.pdf.

[30]

Jeff Preshing. 2012. A Look Back at Single-Threaded CPU Performance. (2012). Retrieved December 22, 2017 http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance.

[31]

Timothy Prickett Morgan. 2012. NVIDIA launches not one but two Kepler2 GPU coprocessors. (2012). December 22, 2017 from http://www.theregister.co.uk/2012/11/12/nvidia_tesla_k20_k20x_gpu_coprocessors/?page=2.

[32]

Alistair P. Rendell, Barbara M. Chapman, and Mathias S. Müller. 2013. OpenMP in the Era of Low Power Devices and Accelerators. In Proceedings of the 9th International Workshop on OpenMP (IWOMP'13). Springer.

[33]

Greg Ruetsch, Everett Phillips, and Massimiliano Fatica. 2010. GPU acceleration of the long-wave rapid radiative transfer model in WRF using CUDA Fortran. In Proceedings of the Many-Core and Reconfigurable Supercomputing Conference.

[34]

M. Sakamoto, J. Ishida, K Kawano, K. Matsubayashi, K. Aranami, T. Hara, H. Kusabiraki, C. Muroi, and Y. Kitamura. 2014. Development of Yin-Yang Grid Global Model Using a New Dynamical Core ASUCA. (2014).

[35]

Takashi Shimokawabe, Takayuki Aoki, Junichi Ishida, Kohei Kawano, and Chiashi Muroi. 2011. 145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction. Proc. Comput. Sci. 4 (2011), 1535--1544.

[36]

Takashi Shimokawabe, Takayuki Aoki, Chiashi Muroi, Junichi Ishida, Kohei Kawano, Toshio Endo, Akira Nukada, Naoya Maruyama, and Satoshi Matsuoka. 2010. An 80-fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--11.

Digital Library

[37]

Takashi Shimokawabe, Takayuki Aoki, and Naoyuki Onodera. 2014. High-productivity framework on GPU-rich supercomputers for operational weather prediction code ASUCA. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE Press, Los Alamitos, CA, 251--261.

Digital Library

[38]

Herb Sutter. 2005. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb’s J. 30, 3 (2005), 202--210.

[39]

Irina Tezaur, Jerry Watkins, and Irina Demeshko. {n.d.}. Towards performance-portability of the albany/FELIX land-ice solver to new and emerging architectures using Kokkos (unpublished).

[40]

The OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface Version 4.0. (July 2013). Retrieved December 22, 2017 from OpenMP.org: http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.

[41]

TOP500. 2016. Top500 List November 2016. (2016). Retrieved December 22, 2017 from https://www.top500.org/lists/2016/11.

[42]

Keith D. Underwood, Steven J. Plimpton, Ronald B. Brightwell, Courtenay T. Vaughan, and Mike Davis. 2006. A Simple Synchronous Distributed-Memory Algorithm for the HPCC RandomAccess Benchmark. Technical Report. Sandia National Laboratories, Albuquerque, NM.

[43]

Tokyo University. 2017. Reedbush Introduction (in Japanese). (2017). Retrieved December 22, 2017 from http://www.cc.u-tokyo.ac.jp/system/reedbush/reedbush_intro.html.

[44]

Wim Vanderbauwhede and Tetsuya Takemi. 2013. An investigation into the feasibility and benefits of GPU/multicore acceleration of the weather research and forecasting model. In Proceedings of the 2013 International Conference on High Performance Computing and Simulation (HPCS’13). IEEE, 482--489.

[45]

Louis J. Wicker and William C. Skamarock. 2002. Time-splitting methods for elastic models using forward time schemes. Month. Weather Rev. 130, 8 (2002), 2088--2097.

[46]

Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76.

Digital Library

[47]

Rengan Xu, Frank Han, and Nishanth Dandapanthu. 2017. Application Performance on P100-PCIe GPUs. (2017). Retrieved December 22, 2017 from http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2017/03/14/application-performance-on-p100-pcie-gpus.

Cited By

Agrawal NDas AModani M(2022)Scalability Analysis of Weather Research Forecast Model on NVIDIA Ampere based Dense GPU Cluster2022 International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS)10.1109/IC3SIS54991.2022.9885396(1-6)Online publication date: 23-Jun-2022
https://doi.org/10.1109/IC3SIS54991.2022.9885396
He BXu SDong YWang SYue JJi L(2022)A robust visual SLAM system for low-texture and semi-static environmentsMultimedia Tools and Applications10.1007/s11042-022-14013-583:22(61559-61583)Online publication date: 24-Oct-2022
https://doi.org/10.1007/s11042-022-14013-5
Maruyama NAoki TTaura KYokota RWahib MMatsuda MFukuda KShimokawabe TOnodera NMüller MIwasaki S(2018)Highly Productive, High-Performance Application Frameworks for Post-Petascale ComputingAdvanced Software Technologies for Post-Peta Scale Computing10.1007/978-981-13-1924-2_5(77-98)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-981-13-1924-2_5
Show More Cited By

Index Terms

New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code

Recommendations

A unified optimizing compiler framework for different GPGPU architectures

This article presents a novel optimizing compiler for general purpose computation on graphics processing units (GPGPU). It addresses two major challenges of developing high performance GPGPU programs: effective utilization of GPU memory hierarchy and ...
Performance analysis of accelerated image registration using GPGPU
GPGPU-2: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units

This paper presents a performance analysis of an accelerated 2-D rigid image registration implementation that employs the Compute Unified Device Architecture (CUDA) programming environment to take advantage of the parallel processing capabilities of ...
Accelerating the multi-zone scalar pentadiagonal CFD algorithm with OpenACC
WACCPD '15: Proceedings of the Second Workshop on Accelerator Programming using Directives

The multi-zone scalar pentadiagonal (SP-MZ) benchmark, part of the multi-zone NAS Parallel Benchmark suite, is ported to graphics processing units (GPUs) using OpenACC compiler directives. The sequence of optimizations necessary to transform the SP-MZ ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 5, Issue 2

June 2018

113 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3299751

Editor:
David Bader
Georgia Institute of Technology, USA

Issue’s Table of Contents

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 December 2018

Accepted: 01 February 2018

Revised: 01 January 2018

Received: 01 July 2017

Published in TOPC Volume 5, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Advanced Computation and I/O Methods for Earth-System Simulations (AIMES)
Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN)
Japan Science and Technology Agency (JST) Core Research of Evolutional Science and Technology (CREST)
Scientific Research(S)
Highly Productive, High Performance Application Frameworks for Post Peta-scale Computing
High Performance Computing Infrastructure (HPCI)
KAKENHI
Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan
Software for Exascale Computing (SPPEXA)

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
146
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Agrawal NDas AModani M(2022)Scalability Analysis of Weather Research Forecast Model on NVIDIA Ampere based Dense GPU Cluster2022 International Conference on Computing, Communication, Security and Intelligent Systems (IC3SIS)10.1109/IC3SIS54991.2022.9885396(1-6)Online publication date: 23-Jun-2022
https://doi.org/10.1109/IC3SIS54991.2022.9885396
He BXu SDong YWang SYue JJi L(2022)A robust visual SLAM system for low-texture and semi-static environmentsMultimedia Tools and Applications10.1007/s11042-022-14013-583:22(61559-61583)Online publication date: 24-Oct-2022
https://doi.org/10.1007/s11042-022-14013-5
Maruyama NAoki TTaura KYokota RWahib MMatsuda MFukuda KShimokawabe TOnodera NMüller MIwasaki S(2018)Highly Productive, High-Performance Application Frameworks for Post-Petascale ComputingAdvanced Software Technologies for Post-Peta Scale Computing10.1007/978-981-13-1924-2_5(77-98)Online publication date: 7-Dec-2018
https://doi.org/10.1007/978-981-13-1924-2_5
Müller MAoki T(2018)Hybrid Fortran: High Productivity GPU Porting Framework Applied to Japanese Weather Prediction ModelAccelerator Programming Using Directives10.1007/978-3-319-74896-2_2(20-41)Online publication date: 31-Jan-2018
https://doi.org/10.1007/978-3-319-74896-2_2

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents