skip to main content
research-article

New High Performance GPGPU Code Transformation Framework Applied to Large Production Weather Prediction Code

Published: 19 December 2018 Publication History

Abstract

We introduce “Hybrid Fortran,” a new approach that allows a high-performance GPGPU port for structured grid Fortran codes. This technique only requires minimal changes for a CPU targeted codebase, which is a significant advancement in terms of productivity. It has been successfully applied to both dynamical core and physical processes of ASUCA, a Japanese mesoscale weather prediction model with more than 150k lines of code. By means of a minimal weather application that resembles ASUCA’s code structure, Hybrid Fortran is compared to both a performance model as well as today’s commonly used method, OpenACC. As a result, the Hybrid Fortran implementation is shown to deliver the same or better performance than OpenACC, and its performance agrees with the model both on CPU and GPU. In a full-scale production run, using an ASUCA grid with 1581 × 1301 × 58 cells and real-world weather data in 2km resolution, 24 NVIDIA Tesla P100 running the Hybrid Fortran–based GPU port are shown to replace more than fifty 18-core Intel Xeon Broadwell E5-2695 v4 running the reference implementation—an achievement comparable to more invasive GPGPU rewrites of other weather models.

References

[1]
Andrey Bokhanko. 2014. OpenMP 4 Support in Clang / LLVM. (2014). Retrieved December 22, 2017 from OpenMP.org: http://mail.openmp.org/sc14/BoF_Intel_Andrey_Clang.pdf
[2]
Ben Cumming, Carlos Osuna, Tobias Gysi, Mauro Bianco, Xavier Lapillonne, Oliver Fuhrer, and Thomas C. Schulthess. 2013. A review of the challenges and results of refactoring the community climate code COSMO for hybrid Cray HPC systems. In Proceedings of Cray User Group 2013 (2013), 1--11.
[3]
Craig C. Douglas, Jonathan Hu, Markus Kowarschik, Ulrich Rüde, and Christian Weiß. 2000. Cache optimization for structured and unstructured grid multigrid. Electr. Trans. Numer. Anal. 10 (2000), 21--40.
[4]
Hikmet Dursun, Ken-ichi Nomura, Weiqiang Wang, Manaschai Kunaseth, Liu Peng, Richard Seymour, Rajiv K Kalia, Aiichiro Nakano, and Priya Vashishta. 2009. In-core optimization of high-order stencil computations. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’09). CSCE, 533--538.
[5]
H. Carter Edwards, Christian R. Trott, and Daniel Sunderland. 2014. Kokkos: Enabling manycore performance portability through polymorphic memory access patterns. J. Parallel Distrib. Comput. 74, 12 (2014), 3202--3216.
[6]
Oliver Fuhrer. 2014. Grid Tools: Towards a library for hardware oblivious implementation of stencil-based codes. Retrieved December 22, 2017 from http://www.pasc-ch.org/projects/2013-2016/grid-tools.
[7]
Oliver Fuhrer, Carlos Osuna, Xavier Lapillonne, Tobias Gysi, Ben Cumming, Mauro Bianco, Andrea Arteaga, and Thomas Christoph Schulthess. 2014. Towards a performance portable, architecture agnostic implementation strategy for weather and climate models. Supercomput. Front. Innovat. 1, 1 (2014), 45--62.
[8]
Mark Govett. 2012. F2C-ACC Users Guide, Version 4.2. (2012). Retrieved December 22, 2017 from NOAA: http://www.esrl.noaa.gov/gsd/ ab/ac/Accelerators.html.
[9]
Mark Govett, Jacques Middlecoff, and Tom Henderson. 2010. Running the NIM next-generation weather model on GPUs. In Proceedings of the 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing. IEEE Computer Society, 792--796.
[10]
Mark Govett, Jacques Middlecoff, and Tom Henderson. 2014. Directive-based parallelization of the NIM weather model for GPUs. In Proceedings of the 1st Workshop on Accelerator Programming Using Directives (WACCPD’14). IEEE Press, Los Alamitos, CA, 55--61.
[11]
Mark Govett, Jim Rosinski, Jacques Middlecoff, Tom Henderson, Jin Lee, Alexander MacDonald, Ning Wang, Paul Madden, Julie Schramm, and Antonio Duarte. 2017. Parallelization and performance of the NIM weather model on CPU, GPU and MIC processors. Bull. Am. Meteorol. Soc. 98, 10 (2017), 2201--2213.
[12]
The Portland Group. 2012. PGI Accelerator Compilers with OpenACC Directives. (2012). Retrieved December 22, 2017 from http://www.pgroup.com/resources/accel.htm.
[13]
Mark Harris. 2007. Optimizing CUDA. SC07: High Performance Computing with CUDA (2007).
[14]
Intel. 2010. Intel Xeon Processor X5670. (2010). Retrieved December 22, 2017 from http://ark.intel.com/products/47920/Intel-Xeon-Processor-X5670-12M-Cache-2_93-GHz-6_40-GTs-Intel-QPI.
[15]
Intel. 2012. Intel Xeon Processor E5-2670. (2012). Retrieved December 22, 2017 from http://ark.intel.com/products/64595/Intel-Xeon-Processor-E5-2670-20M-Cache-2_60-GHz-8_00-GTs-Intel-QPI.
[16]
Intel. 2016. Intel Xeon Processor E5-2695 v4. (2016). Retrieved December 22, 2017 from http://ark.intel.com/products/91316/Intel-Xeon-Processor-E5-2695-v4-45M-Cache-2_10-GHz.
[17]
Junichi Ishida, Chiashi Muroi, Kohei Kawano, and Yuji Kitamura. 2010. Development of a new nonhydrostatic model ASUCA at JMA. CAS/JSC WGNE Res. Activ. Atmos. Oceanic Model. 40 (2010), 0511--0512.
[18]
NVIDIA Inc. James Beyer. 2016. Targeting GPUs with OpenMP 4.5 Device Directives. (2016). Retrieved December 22, 2017 from http://on-demand.gputechconf.com/gtc/2016/presentation/s6510-jeff-larkin-targeting-gpus-openmp.pdf.
[19]
Cray Inc. James C. Beyer. 2013. The use of OpenACC and OpenMP Accelerator directives with the Cray Compilation Environment (CCE). (2013). Retrieved December 22, 2017 from http://on-demand.gputechconf.com/gtc/2013/presentations/S3084-OpenACC-OpenMP-Directives-CCE.pdf.
[20]
Jan Kwiatkowski. 2001. Evaluation of parallel programs by measurement of its granularity. In International Conference on Parallel Processing and Applied Mathematics. Springer, 145--153.
[21]
Xavier Lapillonne and Oliver Fuhrer. 2014. Using compiler directives to port large scientific applications to GPUs: An example from atmospheric science. Parallel Process. Lett. 24, 1 (2014).
[22]
Piotr R. Luszczek, David H. Bailey, Jack J. Dongarra, Jeremy Kepner, Robert F. Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC challenge (HPCC) benchmark suite. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. 213.
[23]
Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11). ACM, New York, NY, Article 11, 12 pages.
[24]
John Michalakes and Manish Vachharajani. 2008. GPU acceleration of numerical weather prediction. In Proceedings of the IEEE International Parallel 8 Distributed Processing Symposium (IPDPS’08). IEEE, 1--7.
[25]
Jarno Mielikainen, Bormin Huang, and Allen Huang. 2014. Using Intel Xeon Phi to accelerate the WRF TEMF planetary boundary layer scheme. In SPIE Sensing Technology+ Applications. International Society for Optics and Photonics, 91240T--91240T.
[26]
Jarno Mielikainen, Bormin Huang, Hung-Lung Allen Huang, and Mitchell D. Goldberg. 2012. GPU acceleration of the updated Goddard shortwave radiation scheme in the weather research and forecasting (WRF) model. IEEE J. Select. Top. Appl. Earth Observ. Remote Sens. 5, 2 (2012), 555--562.
[27]
Michel Müller and Takayuki Aoki. 2018. Hybrid Fortran: High productivity GPU porting framework applied to Japanese weather prediction model. In Accelerator Programming Using Directives, Sunita Chandrasekaran and Guido Juckeland (Eds.). Springer International Publishing, Cham, 20--41.
[28]
Matthew R. Norman, Azamat Mametjanov, and Mark Taylor. 2017. Exascale Programming Approaches for the Accelerated Model for Climate and Energy. Technical Report P7001-0117. Argonne National Laboratory, Argonne, USA.
[29]
OpenACC. 2015. The OpenACC Application Programming Interface Version 2.5. (2015). Retrieved December 22, 2017 from http://www.openacc.org/sites/default/files/inline-files/OpenACC_2pt5.pdf.
[30]
Jeff Preshing. 2012. A Look Back at Single-Threaded CPU Performance. (2012). Retrieved December 22, 2017 http://preshing.com/20120208/a-look-back-at-single-threaded-cpu-performance.
[31]
Timothy Prickett Morgan. 2012. NVIDIA launches not one but two Kepler2 GPU coprocessors. (2012). December 22, 2017 from http://www.theregister.co.uk/2012/11/12/nvidia_tesla_k20_k20x_gpu_coprocessors/?page=2.
[32]
Alistair P. Rendell, Barbara M. Chapman, and Mathias S. Müller. 2013. OpenMP in the Era of Low Power Devices and Accelerators. In Proceedings of the 9th International Workshop on OpenMP (IWOMP'13). Springer.
[33]
Greg Ruetsch, Everett Phillips, and Massimiliano Fatica. 2010. GPU acceleration of the long-wave rapid radiative transfer model in WRF using CUDA Fortran. In Proceedings of the Many-Core and Reconfigurable Supercomputing Conference.
[34]
M. Sakamoto, J. Ishida, K Kawano, K. Matsubayashi, K. Aranami, T. Hara, H. Kusabiraki, C. Muroi, and Y. Kitamura. 2014. Development of Yin-Yang Grid Global Model Using a New Dynamical Core ASUCA. (2014).
[35]
Takashi Shimokawabe, Takayuki Aoki, Junichi Ishida, Kohei Kawano, and Chiashi Muroi. 2011. 145 TFlops performance on 3990 GPUs of TSUBAME 2.0 supercomputer for an operational weather prediction. Proc. Comput. Sci. 4 (2011), 1535--1544.
[36]
Takashi Shimokawabe, Takayuki Aoki, Chiashi Muroi, Junichi Ishida, Kohei Kawano, Toshio Endo, Akira Nukada, Naoya Maruyama, and Satoshi Matsuoka. 2010. An 80-fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--11.
[37]
Takashi Shimokawabe, Takayuki Aoki, and Naoyuki Onodera. 2014. High-productivity framework on GPU-rich supercomputers for operational weather prediction code ASUCA. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE Press, Los Alamitos, CA, 251--261.
[38]
Herb Sutter. 2005. The free lunch is over: A fundamental turn toward concurrency in software. Dr. Dobb’s J. 30, 3 (2005), 202--210.
[39]
Irina Tezaur, Jerry Watkins, and Irina Demeshko. {n.d.}. Towards performance-portability of the albany/FELIX land-ice solver to new and emerging architectures using Kokkos (unpublished).
[40]
The OpenMP Architecture Review Board. 2013. OpenMP Application Program Interface Version 4.0. (July 2013). Retrieved December 22, 2017 from OpenMP.org: http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.
[41]
TOP500. 2016. Top500 List November 2016. (2016). Retrieved December 22, 2017 from https://www.top500.org/lists/2016/11.
[42]
Keith D. Underwood, Steven J. Plimpton, Ronald B. Brightwell, Courtenay T. Vaughan, and Mike Davis. 2006. A Simple Synchronous Distributed-Memory Algorithm for the HPCC RandomAccess Benchmark. Technical Report. Sandia National Laboratories, Albuquerque, NM.
[43]
Tokyo University. 2017. Reedbush Introduction (in Japanese). (2017). Retrieved December 22, 2017 from http://www.cc.u-tokyo.ac.jp/system/reedbush/reedbush_intro.html.
[44]
Wim Vanderbauwhede and Tetsuya Takemi. 2013. An investigation into the feasibility and benefits of GPU/multicore acceleration of the weather research and forecasting model. In Proceedings of the 2013 International Conference on High Performance Computing and Simulation (HPCS’13). IEEE, 482--489.
[45]
Louis J. Wicker and William C. Skamarock. 2002. Time-splitting methods for elastic models using forward time schemes. Month. Weather Rev. 130, 8 (2002), 2088--2097.
[46]
Samuel Williams, Andrew Waterman, and David Patterson. 2009. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM 52, 4 (Apr. 2009), 65--76.
[47]
Rengan Xu, Frank Han, and Nishanth Dandapanthu. 2017. Application Performance on P100-PCIe GPUs. (2017). Retrieved December 22, 2017 from http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2017/03/14/application-performance-on-p100-pcie-gpus.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 5, Issue 2
June 2018
113 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3299751
  • Editor:
  • David Bader
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 December 2018
Accepted: 01 February 2018
Revised: 01 January 2018
Received: 01 July 2017
Published in TOPC Volume 5, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. CUDA
  2. GPGPU
  3. OpenACC
  4. fortran
  5. performance models
  6. weather prediction

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Advanced Computation and I/O Methods for Earth-System Simulations (AIMES)
  • Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures (JHPCN)
  • Japan Science and Technology Agency (JST) Core Research of Evolutional Science and Technology (CREST)
  • Scientific Research(S)
  • Highly Productive, High Performance Application Frameworks for Post Peta-scale Computing
  • High Performance Computing Infrastructure (HPCI)
  • KAKENHI
  • Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan
  • Software for Exascale Computing (SPPEXA)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media