research-article

Optimizing MPI Collectives on Shared Memory Multi-Cores

Authors:

Zheng WangAuthors Info & Claims

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 34, Pages 1 - 15

https://doi.org/10.1145/3581784.3607074

Published: 11 November 2023 Publication History

Abstract

Message Passing Interface (MPI) programs often experience performance slowdowns due to collective communication operations, like broadcasting and reductions. As modern CPUs integrate more processor cores, running multiple MPI processes on shared-memory machines to take advantage of hardware parallelism is becoming increasingly common. In this context, it is crucial to optimize MPI collective communications for shared-memory execution. However, existing MPI collective implementations on shared-memory systems have two primary drawbacks. The first is extensive redundant data movements when performing reduction collectives, and the second is the ineffective use of non-temporal instructions to optimize streamed data processing. To address these limitations, this paper proposes two optimization techniques that minimize data movements and enhance the use of non-temporal instructions. We evaluated our techniques by integrating them into the OpenMPI library and tested their performance using micro-benchmarks and real-world applications running on two multi-core clusters. Experimental results show that our approach significantly outperforms existing techniques, yielding a 1.2--6.4x performance improvement.

Supplemental Material

MP4 File - SC23 paper presentation recording for "Optimizing MPI Collectives on Shared Memory Multi-Cores"

SC23 paper presentation recording for "Optimizing MPI Collectives on Shared Memory Multi-Cores", by Jintao Peng, Jianbin Fang, Jie Liu, Min Xie, Yi Dai, Bo Yang, Shengguo Li and Zheng Wang

Download
90.31 MB

References

[1]

[n. d.]. Build OpenMPI with UCX. https://openucx.readthedocs.io/en/master/running.html

[2]

[n. d.]. Improve Performance and Stability with Intel® MPI Library on Infini-Band. https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html

[3]

[n. d.]. MVAPICH2 2.3.7 User Guide. http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-userguide.pdf

[4]

[n. d.]. NVIDIA: Enabling HCOLL in Open MPI. https://docs.nvidia.com/networking/display/HPCXv29/HCOLL

[5]

2023-01-02. Intel MPI. https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html

[6]

2023-01-02. MPICH. https://www.mpich.org/

[7]

2023-01-02. MVAPICH Home. https://mvapich.cse.ohio-state.edu

[8]

2023-01-02. Open MPI. https://www.open-mpi.org/

[9]

2023-01-02. OSU Micro-Benchmarks. https://mvapich.cse.ohio-state.edu/benchmarks/

[10]

2023-03-14. Intel® 64 and IA-32 Architectures Software Developer Manuals. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

[11]

2023-03-14. Intel® Xeon® Scalable Processor: The Foundation of Data Centre Innovation. https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf

[12]

2023-y-02. Linux kernel v6.4.2. https://elixir.bootlin.com/linux/latest/source/mm/process_vm_access.c

[13]

Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K Panda. 2017. Scalable reduction collectives with data partitioning-based multi-leader design. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11.

Digital Library

[14]

Mohammadreza Bayatpour, Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Hari Subramoni, Pouya Kousha, and Dhabaleswar K Panda. 2018. Salar: Scalable and adaptive designs for large message reduction collectives. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 12--23.

[15]

Amanda Bienz, Luke Olson, and William Gropp. 2019. Node-aware improvements to allreduce. In 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI). IEEE, 19--28.

[16]

L. Chai, A. Hartono, and D. K. Panda. 2006. Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters. In IEEE International Conference on Cluster Computing.

[17]

Sourav Chakraborty, Hari Subramoni, and Dhabaleswar K Panda. 2017. Contention-aware kernel-assisted MPI collectives for multi-/many-core systems. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 13--24.

[18]

Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of mpi usage on a production supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 386--400.

Digital Library

[19]

Sylvain Didelot et al. 2014. Improving MPI communication overlap with collaborative polling. Computing 96, 4 (2014), 263--278.

Digital Library

[20]

Abhishek Dutta, Ankush Gupta, and Andrew Zissermann. 2016. VGG image annotator (VIA). URL: http://www.robots.ox.ac.uk/vgg/software/via 2 (2016).

[21]

Jianbin Fang, Chun Huang, Tao Tang, and Zheng Wang. 2020. Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans. High Perform. Comput. 2, 4 (2020), 382--400.

[22]

Jianbin Fang, Xiangke Liao, Chun Huang, and Dezun Dong. 2021. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+. J. Comput. Sci. Technol. 36, 1 (2021), 33--43.

Digital Library

[23]

Andrew Friedley, Greg Bronevetsky, Torsten Hoefler, and Andrew Lumsdaine. 2013. Hybrid MPI: efficient message passing for multi-core systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1--11.

Digital Library

[24]

Andrew Friedley, Torsten Hoefler, Greg Bronevetsky, Andrew Lumsdaine, and Ching-Chen Ma. 2013. Ownership passing: Efficient distributed memory programming on multi-core systems. ACM SIGPLAN Notices 48, 8 (2013), 177--186.

Digital Library

[25]

Juan Antonio Rico Gallego. 2016. t-Lop: Scalably and accurately modeling contention and mapping effects in multi-corec clusters. (2016).

[26]

Sabela Ramos Garea and Torsten Hoefler. 2013. Modelling communications in cache coherent systems. Technical Report (2013).

[27]

Brice Goglin and Stephanie Moreaud. 2013. KNEM: A generic and scalable kernelassisted intra-node MPI communication framework. J. Parallel and Distrib. Comput. 73, 2 (2013), 176--188.

Digital Library

[28]

Richard L Graham and Galen Shipman. 2008. MPI support for multi-core architectures: Optimized shared memory collectives. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7--10, 2008. Proceedings 15. Springer, 130--140.

Digital Library

[29]

William Gropp. 2002. MPICH2: A new start for MPI implementations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 9th European PVM/MPI Users' Group Meeting Linz, Austria, September 29--Oktober 2, 2002 Proceedings 9. Springer, 7--7.

[30]

Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Mohammadreza Bayatpour, Hari Subramoni, and Dhabaleswar K Panda. 2018. Designing efficient shared address space reduction collectives for multi-/many-cores. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1020--1029.

[31]

Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Mohammadreza Bayatpour, Hari Subramoni, and Dhabaleswar K Panda. 2019. Design and characterization of shared address space mpi collectives on modern architectures. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 410--419.

[32]

John L Hennessy and David A Patterson. 2011. Computer architecture: a quantitative approach. Elsevier.

Digital Library

[33]

Johannes Hofmann, Dietmar Fey, Jan Eitzinger, Georg Hager, and Gerhard Wellein. 2016. Analysis of intel's haswell microarchitecture using the ecm model and microbenchmarks. In International conference on architecture of computing systems. Springer, 210--222.

Digital Library

[34]

Surabhi Jain, Rashid Kaleem, Marc Gamell Balmana, Akhil Langer, Dmitry Durnov, Alexander Sannikov, and Maria Garzaran. 2018. Framework for scalable intra-node collective operations using shared memory. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 374--385.

Digital Library

[35]

Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 463--479.

Digital Library

[36]

H-W Jin, Sayantan Sur, Lei Chai, and Dhabaleswar K Panda. 2005. Limic: Support for high-performance mpi intra-node communication on linux cluster. In 2005 International Conference on Parallel Processing (ICPP'05). IEEE, 184--191.

[37]

Norman P Jouppi. 1993. Cache write policies and performance. ACM SIGARCH Computer Architecture News 21, 2 (1993), 191--201.

Digital Library

[38]

Giorgos Kappes and Stergios V Anastasiadis. 2021. Asterope: A Cross-Platform Optimization Method for Fast Memory Copy. In Proceedings of the 11th Workshop on Programming Languages and Operating Systems. 9--16.

Digital Library

[39]

Shigang Li, Torsten Hoefler, Chungjin Hu, and Marc Snir. 2014. Improved MPI collectives for MPI processes in shared address spaces. Cluster computing 17, 4 (2014), 1139--1155.

[40]

Shigang Li, Torsten Hoefler, and Marc Snir. 2013. NUMA-aware shared-memory collective communication for MPI. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. 85--96.

Digital Library

[41]

Shigang Li, Yunquan Zhang, and Torsten Hoefler. 2017. Cache-oblivious MPI all-to-all communications based on Morton order. IEEE Transactions on Parallel and Distributed Systems 29, 3 (2017), 542--555.

[42]

Amith R Mamidala, Rahul Kumar, Debraj De, and Dhabaleswar K Panda. 2008. MPI collectives on modern multicore clusters: Performance optimizations and communication characteristics. In 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID). IEEE, 130--137.

Digital Library

[43]

Amith R Mamidala, Abhinav Vishnu, and Dhabaleswar K Panda. 2006. Efficient shared memory and RDMA based design for mpi_allgather over InfiniBand. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 13th European PVM/MPI User's Group Meeting Bonn, Germany, September 17--20, 2006 Proceedings 13. Springer, 66--75.

Digital Library

[44]

Benjamin S Parsons. 2015. Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness. (2015).

[45]

Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117--124.

Digital Library

[46]

Silvius Rus, Raksit Ashok, and David Xinliang Li. 2011. Automated locality optimization based on the reuse distance of string operations. In International Symposium on Code Generation and Optimization (CGO 2011). IEEE, 181--190.

[47]

Andreas Sandberg, David Eklöv, and Erik Hagersten. 2010. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.

Digital Library

[48]

Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. 2015. Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In Proceedings of the 29th ACM on International Conference on Supercomputing. 207--216.

Digital Library

[49]

Sasha Targ, Diogo Almeida, and Kevin Lyman. 2016. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029 (2016).

[50]

Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.

Digital Library

[51]

Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast collective operations using shared and remote memory access protocols on clusters. In Proceedings International Parallel and Distributed Processing Symposium. IEEE, 10--pp.

[52]

Jesper Larsson Träff and Sascha Hunold. 2020. Decomposing MPI collectives for exploiting multi-lane communication. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 270--280.

[53]

Markus Velten, Robert Schöne, Thomas Ilsche, and Daniel Hackenberg. 2022. Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering. 165--175.

Digital Library

[54]

Jerome Vienne. 2014. Benefits of cross memory attach for mpi libraries on hpc clusters. In Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. 1--6.

Digital Library

[55]

Markus Wittmann, Thomas Zeiser, Georg Hager, and Gerhard Wellein. 2013. Comparison of different propagation steps for lattice Boltzmann methods. Computers & Mathematics with Applications 65, 6 (2013), 924--935.

Digital Library

[56]

Michael Woodacre, Derek Robb, Dean Roe, and Karl Feind. 2005. The SGI® AltixTM 3000 global shared-memory architecture. Silicon Graphics, Inc 44 (2005).

[57]

Meng-Shiou Wu, Ricky A Kendall, and Kyle Wright. 2005. Optimizing collective communications on SMP clusters. In 2005 International Conference on Parallel Processing (ICPP'05). IEEE, 399--407.

[58]

Charlene Yang. 2020. Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs. CoRR abs/2009.02449 (2020). arXiv:2009.02449 https://arxiv.org/abs/2009.02449

[59]

Doe Hyun Yoon, Naveen Muralimanohar, Jichuan Chang, Parthasarathy Ranganathan, Norman P Jouppi, and Mattan Erez. 2011. FREE-p: Protecting nonvolatile memory against both hard and soft errors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 466--477.

[60]

Jie Zhang, Xiaoyi Lu, and Dhabaleswar K Panda. 2017. Designing locality and NUMA aware MPI runtime for nested virtualization based HPC cloud with SR-IOV enabled InfiniBand. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 187--200.

Digital Library

[61]

Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. 2010. Parallelized stochastic gradient descent. Advances in neural information processing systems 23 (2010).

Cited By

Huang HJin YXue W(2024)BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core SystemProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673131(262-272)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673131
Ahn HKim SPark YHan WAhn STran TRamesh BSubramoni HPanda D(2024)MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825804(332-337)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825804
Zhu QDai YPeng JLiang CYang BLiu J(2023)Efficient Approaches to Mitigate Communication Bottlenecks in MPI Communicator Splitting by Type2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC)10.1109/ICFTIC59930.2023.10455835(217-222)Online publication date: 17-Nov-2023
https://doi.org/10.1109/ICFTIC59930.2023.10455835

Index Terms

Optimizing MPI Collectives on Shared Memory Multi-Cores
1. Software and its engineering
  1. Software notations and tools
    1. Software libraries and repositories

Recommendations

Improved MPI collectives for MPI processes in shared address spaces

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimization of MPI collectives for clusters of NUMA nodes. We ...
NUMA-aware shared-memory collective communication for MPI
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We ...
NUMA-aware shared-memory collective communication for MPI
HPDC '13: Proceedings of the 22nd international symposium on High-performance parallel and distributed computing

As the number of cores per node keeps increasing, it becomes increasingly important for MPI to leverage shared memory for intranode communication. This paper investigates the design and optimizations of MPI collectives for clusters of NUMA nodes. We ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2023

1428 pages

ISBN:9798400701092

DOI:10.1145/3581784

Chair:
Dorian Arnold,
Program Chair:
Rosa M Badia,
Program Co-chair:
Kathryn Mohror

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Artifacts Available / v1.1

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '23

Sponsor:

SIGHPC

SC '23: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2023

CO, Denver, USA

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
610
Total Downloads

Downloads (Last 12 months)347
Downloads (Last 6 weeks)21

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang HJin YXue W(2024)BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core SystemProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673131(262-272)Online publication date: 12-Aug-2024
https://dl.acm.org/doi/10.1145/3673038.3673131
Ahn HKim SPark YHan WAhn STran TRamesh BSubramoni HPanda D(2024)MPI Allgather Utilizing CXL Shared Memory Pool in Multi-Node Computing Systems2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825804(332-337)Online publication date: 15-Dec-2024
https://doi.org/10.1109/BigData62323.2024.10825804
Zhu QDai YPeng JLiang CYang BLiu J(2023)Efficient Approaches to Mitigate Communication Bottlenecks in MPI Communicator Splitting by Type2023 5th International Conference on Frontiers Technology of Information and Computer (ICFTIC)10.1109/ICFTIC59930.2023.10455835(217-222)Online publication date: 17-Nov-2023
https://doi.org/10.1109/ICFTIC59930.2023.10455835

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten