research-article

Optimization strategies for geophysics models on manycore systems

Authors:

Matheus S Serpa, Eduardo HM Cruz, Matthias Diener, Arthur M Krause, Philippe OA Navaux, Jairo Panetta, Albert Farrés,

Claudia Rosas, Mauricio HanzichAuthors Info & Claims

The International Journal of High Performance Computing Applications, Volume 33, Issue 3

Pages 473 - 486

https://doi.org/10.1177/1094342018824150

Published: 01 May 2019 Publication History

Abstract

Many software mechanisms for geophysics exploration in oil and gas industries are based on wave propagation simulation. To perform such simulations, state-of-the-art high-performance computing architectures are employed, generating results faster with more accuracy at each generation. The software must evolve to support the new features of each design to keep performance scaling. Furthermore, it is important to understand the impact of each change applied to the software to improve the performance as most as possible. In this article, we propose several optimization strategies for a wave propagation model for six architectures: Intel Broadwell, Intel Haswell, Intel Knights Landing, Intel Knights Corner, NVIDIA Pascal, and NVIDIA Kepler. We focus on improving the cache memory usage, vectorization, load balancing, portability, and locality in the memory hierarchy. We analyze the hardware impact of the optimizations, providing insights of how each strategy can improve the performance. The results show that NVIDIA Pascal outperforms the other considered architectures by up to 8.5 × .

References

[1]

Andreolli C, Thierry P, Borges L, et al. (2015) Characterization and optimization methodology applied to stencil computations. In: Reinders J and Jeffers J (eds.), High Performance Parallelism Pearls. Boston: Morgan Kaufmann, pp. 377–396. ISBN 978-0-12-802118-7.

Google Scholar

[2]

Bauer M, Cook H, and Khailany B (2011) Cudadma: optimizing GPU memory bandwidth via warp specialization. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ‘11 (eds Lathrop S, Costa J, and Kramer W), Seatle, US, 12–18 November 2011, pp. 12:1–12:11. New York, NY: ACM. ISBN 978-1-4503-0771-0. DOI:10.1145/2063384. 2063400.

Crossref

Google Scholar

[3]

Caballero D, Farrés A, Duran A, et al. (2015) Optimizing fully anisotropic elastic propagation on Intel Xeon Phi Coprocessors. In: 2nd EAGE Workshop on HPC for Upstream, Madrid, Spain, pp. 1–6.

Google Scholar

[4]

Carrijo Nasciutti T, Panetta J, and Pais Lopes P (2018) Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUS. Concurrency and Computation: Practice and Experience e4929. Available at: https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.4929.

Google Scholar

[5]

Casey SD (2011) How to determine the effectiveness of hyper-threading technology with an application. Available at: https://software.intel.com/en-us/articles/how-to-determine-the-effectiveness-of-hyper-threading-technology-with-/an-application/ (accessed October 2017).

Google Scholar

[6]

Castro M, Francesquini E, Dupros F, et al. (2016) Seismic wave propagation simulations on low-power and performance-centric manycores. Parallel Computing 54: 108–120.

Digital Library

Google Scholar

[7]

Chrysos G (2012) Intel Xeon Phi X100 Family Coprocessor - the Architecture. Available at: https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner (accessed October 2017).

Google Scholar

[8]

Clapp RG (2015) Seismic processing and the computer revolution(s). In: Society of Exploration Geophysicists (SEG) Technical Program Expanded Abstracts 2015, pp. 4832–4837.

Google Scholar

[9]

Clapp RG, Fu H, and Lindtjorn O (2010) Selecting the right hardware for reverse time migration. The Leading Edge 29(1): 48–58.

Crossref

Google Scholar

[10]

Corbet J (2012) Toward better NUMA scheduling. Available at: http://lwn.net/Articles/486858/ (accessed October 2017).

Google Scholar

[11]

Cruz EHM, Diener M, Pilla LL, et al. (2016) Hardware-assisted thread and data mapping in hierarchical multicore architectures. ACM Transactions on Architecture and Code Optimization 13(3): 28:01–28:28.

Google Scholar

[12]

Cruz EH, Diener M, Serpa MS, et al. (2018) Improving communication and load balancing with thread mapping in manycore systems. In: 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (eds Merelli I, Liò P, and Kotenko I), Cambridge, UK, 2018. pp. 93–100. IEEE.

Google Scholar

[13]

de Melo AC (2010) The New Linux ‘perf’ Tools. In: Linux Kongress. Nuremberg, Germany, pp. 1–42.

Google Scholar

[14]

DeRose L, Homerl B, and Johnson D (2007) Detecting application load imbalance on high end massively parallel systems. In: European Conference on Parallel Processing (eds Kermarrec A-M, Priol T, and Bougé L), pp. 150–159. Springer.

Google Scholar

[15]

Diener M, Cruz EEHM, Alves MAZM, et al. (2016) Affinity-based thread and data mapping in shared memory systems. ACM Computing Surveys (CSUR) 49(4): 1–38. DOI: 10.1145/3006385.

Digital Library

Google Scholar

[16]

Eichenberger AE, Terboven C, Wong M, et al. (2012) The design of OpenMP thread affinity. Lecture Notes in Computer Science 7312 LNCS: 15–28. DOI: 10.1007/978-3-642-30961-8_2.

Crossref

Google Scholar

[17]

Intel (2012) Intel performance counter monitor – a better way to measure CPU utilization. Available at: http://www.intel.com/software/pcm (accessed October 2017).

Google Scholar

[18]

Intel (2014) OpenMP thread affinity control. Available at: https://software.intel.com/en-us/articles/openmp-thread-affinity-control (accessed October 2017).

Google Scholar

[19]

Kukreja N, Louboutin M, Vieira F, et al. (2016) Devito: automated fast finite difference computation. In: Procs. of the 6th Intl. Workshop on Domain-Spec. Lang. and High-Level Frameworks for HPC, WOLFHPC ‘16 (eds Krishnamoorthy S, Ramanujam J, and Sadayappan P), Utah, USA, pp. 11–19. IEEE Press. ISBN 978-1-5090-6156-3.

Google Scholar

[20]

Lee VW, Kim C, Chhugani J, et al. (2010) Debunking the 100X GPU vs. CPU Myth: An evaluation of throughput computing on CPU and GPU. ACM SIGARCH Computer Architecture News 38(3): 451–460.

Digital Library

Google Scholar

[21]

Liu G, Schmidt T, Dömer R, et al. (2015) Optimizing thread-to-core mapping on manycore platforms with distributed tag directories. In: 20th Asia and South Pacific Design Automation Conference (ASP-DAC) (eds Chang N, Hwang TT, and Takashima Y), Chiba, Japan, 2015, pp. 429–434. IEEE.

Google Scholar

[22]

Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU- 2 (eds Kaeli D and Leeser M), Washington, DC, USA, 08 March 2009, pp. 79–84. New York, NY, USA: ACM. ISBN 978-1-60558-517-8. DOI:10.1145/1513895.1513905.

Crossref

Google Scholar

[23]

Niu X, Jin Q, Luk W, et al. (2014) A self-aware tuning and self-aware evaluation method for finite-difference applications in reconfigurable systems. ACM Transactions on Reconfigurable Technology and Systems 7(2): 1–19.

Google Scholar

[24]

Rubio F, Farrés A, Hanzich M, et al. (2013) Optimizing isotropic and fully-anisotropic elastic modelling on multi-GPU platforms. In: 75th EAGE Conference and Exhibition, London, UK, 10–13 June 2013, pp. 10–13. EAGE.

Google Scholar

[25]

Satish N, Kim C, Chhugani J, et al. (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Association for Computing Machinery (ACM) Special Interest Group on computer architecture (SIGARCH) Computer Architecture News (ed Lu S-L), vol. 40, Portland, USA, 09–13 June 2012, pp. 440–451. IEEE.

Google Scholar

[26]

Serpa MS, Cruz EH, Diener M, et al. (2017) Strategies to improve the performance of a geophysics model for different manycore systems. In: 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) (eds Tadonki C, Bentes C, Araujo G, Drummond L, Pilla M, Navaux P, and Farias R), Campinas, Brazil, 17–20 October 2017, pp. 49–54. IEEE.

Google Scholar

[27]

Serpa MS, Krause AM, Cruz EHM, et al. (2018) Optimizing machine learning algorithms on multi-core and many-core architectures using thread and data mapping. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (eds Merelli I, Liò P, and Kotenko I), Cambridge, UK, 21–23 March 2018, pp. 329–333. DOI: 10.1109/PDP2018.2018.00058.

Crossref

Google Scholar

[28]

Stone JE, Gohara D, and Shi G (2010) Opencl: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering 12(3): 66–73.

Digital Library

Google Scholar

[29]

Tousimojarad A and Vanderbauwhede W (2014) An efficient thread mapping strategy for multiprogramming on manycore processors. Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing 25: 63–71.

Google Scholar

[30]

Wong CS, Tan I, Kumari RD, et al. (2008) Towards achieving fairness in the Linux scheduler. ACM SIGOPS Operating Systems Review 42(5): 34–43. DOI: 10.1145/1400097.1400102.

Digital Library

Google Scholar

[31]

Zhebel E, Minisini S, Kononov A, et al. (2013) Performance and scalability of finite-difference and finite-element wave-propagation modeling on Intel’s Xeon Phi. In: Society of Exploration Geophysicists (SEG) Technical Program Expanded Abstracts 2013 (ed Schuelke J), Houston, USA, 22–27 September 2013, pp. 3386–3390.

Google Scholar

Author biographies

Matheus S Serpa graduated in computer science at the Federal University of Pampa (UNIPAMPA), Brazil, and received his master’s degree at the Federal University of Rio Grande do Sul (UFRGS), Brazil, where he is currently a PhD student. His research focuses on improving the performance of parallel applications.

Eduardo HM Cruz received his master’s and PhD degrees at Federal University of Rio Grande do Sul. Currently, he is a professor at Federal Institute of Parana. His research focuses on improving data locality on parallel architectures.

Matthias Diener received his PhD degree in computer science from the Federal University of Rio Grande do Sul and the TU Berlin. He is currently a postdoctoral researcher at the University of Illinois at Urbana–Champaign and works on optimizing parallel applications for heterogeneous systems.

Arthur M Krause is a computer engineering student at the Federal University of Rio Grande do Sul. He is a member of the Parallel and Distributed Processing Group since 2016, acting as an undergraduate researcher on many topics such as compilers, computer architecture, and GPGPU.

Philippe OA Navaux is a professor at the Federal University of Rio Grande do Sul (UFRGS), Brazil, since 1973. He graduated in electronic engineering from UFRGS in 1970. He received his master’s degree in applied physics from UFRGS in 1973 and his PhD in computer science from INPG, France, in 1979. He is the head of the Parallel and Distributed Processing Group at UFRGS and a consultant to various national and international funding agencies such as DoE (USA), ANR (France), CNPq (Brazil), CAPES (Brazil), and others.

Jairo Panetta is a professor at ITA/IEC, the Computer Sciences Division of the Aeronautics Technological Institute, Brazil. He graduated in electronic engineering at ITA in 1974, received the MSc degree at applied mathematics from ITA in 1978 and the PhD degree in computer sciences from Purdue University in 1985. He works with scientific applications that require a deep knowledge of computer architecture and high-performance computing. His main interest is in industrial strength daily production applications. He has been a consultant for the oil and gas industry, for the weather forecast center and the stock exchange in Brazil for decades.

Albert Farrés is an engineer at Barcelona Supercomputing Center, the Spanish National Supercomputing Institute. He is currently researching and developing seismic imaging tools for the oil industry. He has an MSc degree and a bachelor’s in computer science from the Universitat Politcnica de Catalunya.

Claudia Rosas is a postdoctoral researcher at Barcelona Supercomputing Center since 2012. She has experience in performance evaluation of scientific applications running in high-performance computing infrastructures, with relevant participation in international projects such as HPC4E by researching on optimizing seismic imaging tools for new architectures, and as a performance analyst in the POP CoE, the Intel-BSC Exascale Lab, and the PRACE project. She received a PhD (2012) and an MSc (2009) degrees from the Universitat Autnoma de Barcelona and a BSc (2008) from Universidad Valle del Momboy (Valera, Venezuela).

Mauricio Hanzich is a team leader at Barcelona Supercomputing Center (BSC), working at the institution since 2007. He graduated in computer science from Universidad Nacional del Comahue in 2002. He received his master’s degree in computer science from Universidad Autnoma de Barcelona (UAB) in 2004 and his PhD in computer science from UAB in 2006. Currently, he is the head of the high-performance computing software engineering group at CASE Department in BSC.

Index Terms

Optimization strategies for geophysics models on manycore systems

Index terms have been assigned to the content through auto-classification.

Recommendations

Optimizing Performance of ROMS on Intel Xeon Phi

ROMS (Regional Oceanic Modeling System) is an open-source ocean modeling system that is widely used by the scientific community. It uses a coarse-grained parallelization scheme which partitions the computational domain into tiles. ROMS operates on a lot ...
Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs
IWOCL '21: Proceedings of the 9th International Workshop on OpenCL

The Portable Computing Language (PoCL) is a vendor independent open-source OpenCL implementation that aims to support a variety of compute devices in a single platform. Evaluating PoCL versus the Intel OpenCL implementation reveals significant ...
Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores
ICS '14: Proceedings of the 28th ACM international conference on Supercomputing

While the growing number of cores per chip allows researchers to solve larger scientific and engineering problems, the parallel efficiency of the deployed parallel software starts to decrease. This unscalability problem happens to both vendor-provided ...

Comments

Information & Contributors

Information

Published In

cover image International Journal of High Performance Computing Applications

International Journal of High Performance Computing Applications Volume 33, Issue 3

May 2019

140 pages

ISSN:1094-3420

Issue’s Table of Contents

Publisher

Sage Publications, Inc.

United States

Publication History

Published: 01 May 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 15 Sep 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Abstract

References

Author biographies

Index Terms

Recommendations

Optimizing Performance of ROMS on Intel Xeon Phi

Performance Evaluation and Improvements of the PoCL Open-Source OpenCL Implementation on Intel CPUs

Scaling up matrix computations on shared-memory manycore systems with 1000 CPU cores

Comments

Information

Published In

Publisher

Publication History

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations