skip to main content
research-article

Optimization strategies for geophysics models on manycore systems

Published: 01 May 2019 Publication History

Abstract

Many software mechanisms for geophysics exploration in oil and gas industries are based on wave propagation simulation. To perform such simulations, state-of-the-art high-performance computing architectures are employed, generating results faster with more accuracy at each generation. The software must evolve to support the new features of each design to keep performance scaling. Furthermore, it is important to understand the impact of each change applied to the software to improve the performance as most as possible. In this article, we propose several optimization strategies for a wave propagation model for six architectures: Intel Broadwell, Intel Haswell, Intel Knights Landing, Intel Knights Corner, NVIDIA Pascal, and NVIDIA Kepler. We focus on improving the cache memory usage, vectorization, load balancing, portability, and locality in the memory hierarchy. We analyze the hardware impact of the optimizations, providing insights of how each strategy can improve the performance. The results show that NVIDIA Pascal outperforms the other considered architectures by up to 8.5 × .

References

[1]
Andreolli C, Thierry P, Borges L, et al. (2015) Characterization and optimization methodology applied to stencil computations. In: Reinders J and Jeffers J (eds.), High Performance Parallelism Pearls. Boston: Morgan Kaufmann, pp. 377–396. ISBN 978-0-12-802118-7.
[2]
Bauer M, Cook H, and Khailany B (2011) Cudadma: optimizing GPU memory bandwidth via warp specialization. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ‘11 (eds Lathrop S, Costa J, and Kramer W), Seatle, US, 12–18 November 2011, pp. 12:1–12:11. New York, NY: ACM. ISBN 978-1-4503-0771-0. DOI:10.1145/2063384. 2063400.
[3]
Caballero D, Farrés A, Duran A, et al. (2015) Optimizing fully anisotropic elastic propagation on Intel Xeon Phi Coprocessors. In: 2nd EAGE Workshop on HPC for Upstream, Madrid, Spain, pp. 1–6.
[4]
Carrijo Nasciutti T, Panetta J, and Pais Lopes P (2018) Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUS. Concurrency and Computation: Practice and Experience e4929. Available at: https://onlinelibrary.wiley.com/doi/full/10.1002/cpe.4929.
[5]
Casey SD (2011) How to determine the effectiveness of hyper-threading technology with an application. Available at: https://software.intel.com/en-us/articles/how-to-determine-the-effectiveness-of-hyper-threading-technology-with-/an-application/ (accessed October 2017).
[6]
Castro M, Francesquini E, Dupros F, et al. (2016) Seismic wave propagation simulations on low-power and performance-centric manycores. Parallel Computing 54: 108–120.
[7]
Chrysos G (2012) Intel Xeon Phi X100 Family Coprocessor - the Architecture. Available at: https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner (accessed October 2017).
[8]
Clapp RG (2015) Seismic processing and the computer revolution(s). In: Society of Exploration Geophysicists (SEG) Technical Program Expanded Abstracts 2015, pp. 4832–4837.
[9]
Clapp RG, Fu H, and Lindtjorn O (2010) Selecting the right hardware for reverse time migration. The Leading Edge 29(1): 48–58.
[10]
Corbet J (2012) Toward better NUMA scheduling. Available at: http://lwn.net/Articles/486858/ (accessed October 2017).
[11]
Cruz EHM, Diener M, Pilla LL, et al. (2016) Hardware-assisted thread and data mapping in hierarchical multicore architectures. ACM Transactions on Architecture and Code Optimization 13(3): 28:01–28:28.
[12]
Cruz EH, Diener M, Serpa MS, et al. (2018) Improving communication and load balancing with thread mapping in manycore systems. In: 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (eds Merelli I, Liò P, and Kotenko I), Cambridge, UK, 2018. pp. 93–100. IEEE.
[13]
de Melo AC (2010) The New Linux ‘perf’ Tools. In: Linux Kongress. Nuremberg, Germany, pp. 1–42.
[14]
DeRose L, Homerl B, and Johnson D (2007) Detecting application load imbalance on high end massively parallel systems. In: European Conference on Parallel Processing (eds Kermarrec A-M, Priol T, and Bougé L), pp. 150–159. Springer.
[15]
Diener M, Cruz EEHM, Alves MAZM, et al. (2016) Affinity-based thread and data mapping in shared memory systems. ACM Computing Surveys (CSUR) 49(4): 1–38. DOI: 10.1145/3006385.
[16]
Eichenberger AE, Terboven C, Wong M, et al. (2012) The design of OpenMP thread affinity. Lecture Notes in Computer Science 7312 LNCS: 15–28. DOI: 10.1007/978-3-642-30961-8_2.
[17]
Intel (2012) Intel performance counter monitor – a better way to measure CPU utilization. Available at: http://www.intel.com/software/pcm (accessed October 2017).
[18]
Intel (2014) OpenMP thread affinity control. Available at: https://software.intel.com/en-us/articles/openmp-thread-affinity-control (accessed October 2017).
[19]
Kukreja N, Louboutin M, Vieira F, et al. (2016) Devito: automated fast finite difference computation. In: Procs. of the 6th Intl. Workshop on Domain-Spec. Lang. and High-Level Frameworks for HPC, WOLFHPC ‘16 (eds Krishnamoorthy S, Ramanujam J, and Sadayappan P), Utah, USA, pp. 11–19. IEEE Press. ISBN 978-1-5090-6156-3.
[20]
Lee VW, Kim C, Chhugani J, et al. (2010) Debunking the 100X GPU vs. CPU Myth: An evaluation of throughput computing on CPU and GPU. ACM SIGARCH Computer Architecture News 38(3): 451–460.
[21]
Liu G, Schmidt T, Dömer R, et al. (2015) Optimizing thread-to-core mapping on manycore platforms with distributed tag directories. In: 20th Asia and South Pacific Design Automation Conference (ASP-DAC) (eds Chang N, Hwang TT, and Takashima Y), Chiba, Japan, 2015, pp. 429–434. IEEE.
[22]
Micikevicius P (2009) 3D finite difference computation on GPUs using CUDA. In: Proceedings of 2Nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU- 2 (eds Kaeli D and Leeser M), Washington, DC, USA, 08 March 2009, pp. 79–84. New York, NY, USA: ACM. ISBN 978-1-60558-517-8. DOI:10.1145/1513895.1513905.
[23]
Niu X, Jin Q, Luk W, et al. (2014) A self-aware tuning and self-aware evaluation method for finite-difference applications in reconfigurable systems. ACM Transactions on Reconfigurable Technology and Systems 7(2): 1–19.
[24]
Rubio F, Farrés A, Hanzich M, et al. (2013) Optimizing isotropic and fully-anisotropic elastic modelling on multi-GPU platforms. In: 75th EAGE Conference and Exhibition, London, UK, 10–13 June 2013, pp. 10–13. EAGE.
[25]
Satish N, Kim C, Chhugani J, et al. (2012) Can traditional programming bridge the ninja performance gap for parallel computing applications? In: Association for Computing Machinery (ACM) Special Interest Group on computer architecture (SIGARCH) Computer Architecture News (ed Lu S-L), vol. 40, Portland, USA, 09–13 June 2012, pp. 440–451. IEEE.
[26]
Serpa MS, Cruz EH, Diener M, et al. (2017) Strategies to improve the performance of a geophysics model for different manycore systems. In: 2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) (eds Tadonki C, Bentes C, Araujo G, Drummond L, Pilla M, Navaux P, and Farias R), Campinas, Brazil, 17–20 October 2017, pp. 49–54. IEEE.
[27]
Serpa MS, Krause AM, Cruz EHM, et al. (2018) Optimizing machine learning algorithms on multi-core and many-core architectures using thread and data mapping. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) (eds Merelli I, Liò P, and Kotenko I), Cambridge, UK, 21–23 March 2018, pp. 329–333. DOI: 10.1109/PDP2018.2018.00058.
[28]
Stone JE, Gohara D, and Shi G (2010) Opencl: A parallel programming standard for heterogeneous computing systems. Computing in Science and Engineering 12(3): 66–73.
[29]
Tousimojarad A and Vanderbauwhede W (2014) An efficient thread mapping strategy for multiprogramming on manycore processors. Parallel Computing: Accelerating Computational Science and Engineering (CSE), Advances in Parallel Computing 25: 63–71.
[30]
Wong CS, Tan I, Kumari RD, et al. (2008) Towards achieving fairness in the Linux scheduler. ACM SIGOPS Operating Systems Review 42(5): 34–43. DOI: 10.1145/1400097.1400102.
[31]
Zhebel E, Minisini S, Kononov A, et al. (2013) Performance and scalability of finite-difference and finite-element wave-propagation modeling on Intel’s Xeon Phi. In: Society of Exploration Geophysicists (SEG) Technical Program Expanded Abstracts 2013 (ed Schuelke J), Houston, USA, 22–27 September 2013, pp. 3386–3390.

Author biographies

Author biographies
Matheus S Serpa graduated in computer science at the Federal University of Pampa (UNIPAMPA), Brazil, and received his master’s degree at the Federal University of Rio Grande do Sul (UFRGS), Brazil, where he is currently a PhD student. His research focuses on improving the performance of parallel applications.
Eduardo HM Cruz received his master’s and PhD degrees at Federal University of Rio Grande do Sul. Currently, he is a professor at Federal Institute of Parana. His research focuses on improving data locality on parallel architectures.
Matthias Diener received his PhD degree in computer science from the Federal University of Rio Grande do Sul and the TU Berlin. He is currently a postdoctoral researcher at the University of Illinois at Urbana–Champaign and works on optimizing parallel applications for heterogeneous systems.
Arthur M Krause is a computer engineering student at the Federal University of Rio Grande do Sul. He is a member of the Parallel and Distributed Processing Group since 2016, acting as an undergraduate researcher on many topics such as compilers, computer architecture, and GPGPU.
Philippe OA Navaux is a professor at the Federal University of Rio Grande do Sul (UFRGS), Brazil, since 1973. He graduated in electronic engineering from UFRGS in 1970. He received his master’s degree in applied physics from UFRGS in 1973 and his PhD in computer science from INPG, France, in 1979. He is the head of the Parallel and Distributed Processing Group at UFRGS and a consultant to various national and international funding agencies such as DoE (USA), ANR (France), CNPq (Brazil), CAPES (Brazil), and others.
Jairo Panetta is a professor at ITA/IEC, the Computer Sciences Division of the Aeronautics Technological Institute, Brazil. He graduated in electronic engineering at ITA in 1974, received the MSc degree at applied mathematics from ITA in 1978 and the PhD degree in computer sciences from Purdue University in 1985. He works with scientific applications that require a deep knowledge of computer architecture and high-performance computing. His main interest is in industrial strength daily production applications. He has been a consultant for the oil and gas industry, for the weather forecast center and the stock exchange in Brazil for decades.
Albert Farrés is an engineer at Barcelona Supercomputing Center, the Spanish National Supercomputing Institute. He is currently researching and developing seismic imaging tools for the oil industry. He has an MSc degree and a bachelor’s in computer science from the Universitat Politcnica de Catalunya.
Claudia Rosas is a postdoctoral researcher at Barcelona Supercomputing Center since 2012. She has experience in performance evaluation of scientific applications running in high-performance computing infrastructures, with relevant participation in international projects such as HPC4E by researching on optimizing seismic imaging tools for new architectures, and as a performance analyst in the POP CoE, the Intel-BSC Exascale Lab, and the PRACE project. She received a PhD (2012) and an MSc (2009) degrees from the Universitat Autnoma de Barcelona and a BSc (2008) from Universidad Valle del Momboy (Valera, Venezuela).
Mauricio Hanzich is a team leader at Barcelona Supercomputing Center (BSC), working at the institution since 2007. He graduated in computer science from Universidad Nacional del Comahue in 2002. He received his master’s degree in computer science from Universidad Autnoma de Barcelona (UAB) in 2004 and his PhD in computer science from UAB in 2006. Currently, he is the head of the high-performance computing software engineering group at CASE Department in BSC.

Index Terms

  1. Optimization strategies for geophysics models on manycore systems
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image International Journal of High Performance Computing Applications
          International Journal of High Performance Computing Applications  Volume 33, Issue 3
          May 2019
          140 pages

          Publisher

          Sage Publications, Inc.

          United States

          Publication History

          Published: 01 May 2019

          Author Tags

          1. Geophysics
          2. manycore systems
          3. vectorization
          4. memory hierarchy
          5. HPC

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 0
            Total Downloads
          • Downloads (Last 12 months)0
          • Downloads (Last 6 weeks)0
          Reflects downloads up to 15 Sep 2024

          Other Metrics

          Citations

          View Options

          View options

          Get Access

          Login options

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media