skip to main content
10.1145/3581784.3607074acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Optimizing MPI Collectives on Shared Memory Multi-Cores

Published: 11 November 2023 Publication History

Abstract

Message Passing Interface (MPI) programs often experience performance slowdowns due to collective communication operations, like broadcasting and reductions. As modern CPUs integrate more processor cores, running multiple MPI processes on shared-memory machines to take advantage of hardware parallelism is becoming increasingly common. In this context, it is crucial to optimize MPI collective communications for shared-memory execution. However, existing MPI collective implementations on shared-memory systems have two primary drawbacks. The first is extensive redundant data movements when performing reduction collectives, and the second is the ineffective use of non-temporal instructions to optimize streamed data processing. To address these limitations, this paper proposes two optimization techniques that minimize data movements and enhance the use of non-temporal instructions. We evaluated our techniques by integrating them into the OpenMPI library and tested their performance using micro-benchmarks and real-world applications running on two multi-core clusters. Experimental results show that our approach significantly outperforms existing techniques, yielding a 1.2--6.4x performance improvement.

Supplemental Material

MP4 File - SC23 paper presentation recording for "Optimizing MPI Collectives on Shared Memory Multi-Cores"
SC23 paper presentation recording for "Optimizing MPI Collectives on Shared Memory Multi-Cores", by Jintao Peng, Jianbin Fang, Jie Liu, Min Xie, Yi Dai, Bo Yang, Shengguo Li and Zheng Wang

References

[1]
[n. d.]. Build OpenMPI with UCX. https://openucx.readthedocs.io/en/master/running.html
[2]
[n. d.]. Improve Performance and Stability with Intel® MPI Library on Infini-Band. https://www.intel.com/content/www/us/en/developer/articles/technical/improve-performance-and-stability-with-intel-mpi-library-on-infiniband.html
[3]
[n. d.]. MVAPICH2 2.3.7 User Guide. http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-userguide.pdf
[4]
[n. d.]. NVIDIA: Enabling HCOLL in Open MPI. https://docs.nvidia.com/networking/display/HPCXv29/HCOLL
[5]
2023-01-02. Intel MPI. https://www.intel.com/content/www/us/en/developer/tools/oneapi/mpi-library.html
[6]
2023-01-02. MPICH. https://www.mpich.org/
[7]
2023-01-02. MVAPICH Home. https://mvapich.cse.ohio-state.edu
[8]
2023-01-02. Open MPI. https://www.open-mpi.org/
[9]
2023-01-02. OSU Micro-Benchmarks. https://mvapich.cse.ohio-state.edu/benchmarks/
[10]
2023-03-14. Intel® 64 and IA-32 Architectures Software Developer Manuals. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html
[11]
2023-03-14. Intel® Xeon® Scalable Processor: The Foundation of Data Centre Innovation. https://simplecore-ger.intel.com/swdevcon-uk/wp-content/uploads/sites/5/2017/10/UK-Dev-Con_Toby-Smith-Track-A_1000.pdf
[12]
2023-y-02. Linux kernel v6.4.2. https://elixir.bootlin.com/linux/latest/source/mm/process_vm_access.c
[13]
Mohammadreza Bayatpour, Sourav Chakraborty, Hari Subramoni, Xiaoyi Lu, and Dhabaleswar K Panda. 2017. Scalable reduction collectives with data partitioning-based multi-leader design. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11.
[14]
Mohammadreza Bayatpour, Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Hari Subramoni, Pouya Kousha, and Dhabaleswar K Panda. 2018. Salar: Scalable and adaptive designs for large message reduction collectives. In 2018 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 12--23.
[15]
Amanda Bienz, Luke Olson, and William Gropp. 2019. Node-aware improvements to allreduce. In 2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI). IEEE, 19--28.
[16]
L. Chai, A. Hartono, and D. K. Panda. 2006. Designing High Performance and Scalable MPI Intra-node Communication Support for Clusters. In IEEE International Conference on Cluster Computing.
[17]
Sourav Chakraborty, Hari Subramoni, and Dhabaleswar K Panda. 2017. Contention-aware kernel-assisted MPI collectives for multi-/many-core systems. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 13--24.
[18]
Sudheer Chunduri, Scott Parker, Pavan Balaji, Kevin Harms, and Kalyan Kumaran. 2018. Characterization of mpi usage on a production supercomputer. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 386--400.
[19]
Sylvain Didelot et al. 2014. Improving MPI communication overlap with collaborative polling. Computing 96, 4 (2014), 263--278.
[20]
Abhishek Dutta, Ankush Gupta, and Andrew Zissermann. 2016. VGG image annotator (VIA). URL: http://www.robots.ox.ac.uk/vgg/software/via 2 (2016).
[21]
Jianbin Fang, Chun Huang, Tao Tang, and Zheng Wang. 2020. Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans. High Perform. Comput. 2, 4 (2020), 382--400.
[22]
Jianbin Fang, Xiangke Liao, Chun Huang, and Dezun Dong. 2021. Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+. J. Comput. Sci. Technol. 36, 1 (2021), 33--43.
[23]
Andrew Friedley, Greg Bronevetsky, Torsten Hoefler, and Andrew Lumsdaine. 2013. Hybrid MPI: efficient message passing for multi-core systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1--11.
[24]
Andrew Friedley, Torsten Hoefler, Greg Bronevetsky, Andrew Lumsdaine, and Ching-Chen Ma. 2013. Ownership passing: Efficient distributed memory programming on multi-core systems. ACM SIGPLAN Notices 48, 8 (2013), 177--186.
[25]
Juan Antonio Rico Gallego. 2016. t-Lop: Scalably and accurately modeling contention and mapping effects in multi-corec clusters. (2016).
[26]
Sabela Ramos Garea and Torsten Hoefler. 2013. Modelling communications in cache coherent systems. Technical Report (2013).
[27]
Brice Goglin and Stephanie Moreaud. 2013. KNEM: A generic and scalable kernelassisted intra-node MPI communication framework. J. Parallel and Distrib. Comput. 73, 2 (2013), 176--188.
[28]
Richard L Graham and Galen Shipman. 2008. MPI support for multi-core architectures: Optimized shared memory collectives. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7--10, 2008. Proceedings 15. Springer, 130--140.
[29]
William Gropp. 2002. MPICH2: A new start for MPI implementations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 9th European PVM/MPI Users' Group Meeting Linz, Austria, September 29--Oktober 2, 2002 Proceedings 9. Springer, 7--7.
[30]
Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Mohammadreza Bayatpour, Hari Subramoni, and Dhabaleswar K Panda. 2018. Designing efficient shared address space reduction collectives for multi-/many-cores. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 1020--1029.
[31]
Jahanzeb Maqbool Hashmi, Sourav Chakraborty, Mohammadreza Bayatpour, Hari Subramoni, and Dhabaleswar K Panda. 2019. Design and characterization of shared address space mpi collectives on modern architectures. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 410--419.
[32]
John L Hennessy and David A Patterson. 2011. Computer architecture: a quantitative approach. Elsevier.
[33]
Johannes Hofmann, Dietmar Fey, Jan Eitzinger, Georg Hager, and Gerhard Wellein. 2016. Analysis of intel's haswell microarchitecture using the ecm model and microbenchmarks. In International conference on architecture of computing systems. Springer, 210--222.
[34]
Surabhi Jain, Rashid Kaleem, Marc Gamell Balmana, Akhil Langer, Dmitry Durnov, Alexander Sannikov, and Maria Garzaran. 2018. Framework for scalable intra-node collective operations using shared memory. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 374--385.
[35]
Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Cui, and Chuanxiong Guo. 2020. A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. 463--479.
[36]
H-W Jin, Sayantan Sur, Lei Chai, and Dhabaleswar K Panda. 2005. Limic: Support for high-performance mpi intra-node communication on linux cluster. In 2005 International Conference on Parallel Processing (ICPP'05). IEEE, 184--191.
[37]
Norman P Jouppi. 1993. Cache write policies and performance. ACM SIGARCH Computer Architecture News 21, 2 (1993), 191--201.
[38]
Giorgos Kappes and Stergios V Anastasiadis. 2021. Asterope: A Cross-Platform Optimization Method for Fast Memory Copy. In Proceedings of the 11th Workshop on Programming Languages and Operating Systems. 9--16.
[39]
Shigang Li, Torsten Hoefler, Chungjin Hu, and Marc Snir. 2014. Improved MPI collectives for MPI processes in shared address spaces. Cluster computing 17, 4 (2014), 1139--1155.
[40]
Shigang Li, Torsten Hoefler, and Marc Snir. 2013. NUMA-aware shared-memory collective communication for MPI. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing. 85--96.
[41]
Shigang Li, Yunquan Zhang, and Torsten Hoefler. 2017. Cache-oblivious MPI all-to-all communications based on Morton order. IEEE Transactions on Parallel and Distributed Systems 29, 3 (2017), 542--555.
[42]
Amith R Mamidala, Rahul Kumar, Debraj De, and Dhabaleswar K Panda. 2008. MPI collectives on modern multicore clusters: Performance optimizations and communication characteristics. In 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID). IEEE, 130--137.
[43]
Amith R Mamidala, Abhinav Vishnu, and Dhabaleswar K Panda. 2006. Efficient shared memory and RDMA based design for mpi_allgather over InfiniBand. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 13th European PVM/MPI User's Group Meeting Bonn, Germany, September 17--20, 2006 Proceedings 13. Springer, 66--75.
[44]
Benjamin S Parsons. 2015. Accelerating MPI collective communications through hierarchical algorithms with flexible inter-node communication and imbalance awareness. (2015).
[45]
Pitch Patarasuk and Xin Yuan. 2009. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel and Distrib. Comput. 69, 2 (2009), 117--124.
[46]
Silvius Rus, Raksit Ashok, and David Xinliang Li. 2011. Automated locality optimization based on the reuse distance of string operations. In International Symposium on Code Generation and Optimization (CGO 2011). IEEE, 181--190.
[47]
Andreas Sandberg, David Eklöv, and Erik Hagersten. 2010. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In SC'10: Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.
[48]
Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. 2015. Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In Proceedings of the 29th ACM on International Conference on Supercomputing. 207--216.
[49]
Sasha Targ, Diogo Almeida, and Kevin Lyman. 2016. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029 (2016).
[50]
Rajeev Thakur, Rolf Rabenseifner, and William Gropp. 2005. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications 19, 1 (2005), 49--66.
[51]
Vinod Tipparaju, Jarek Nieplocha, and Dhabaleswar Panda. 2003. Fast collective operations using shared and remote memory access protocols on clusters. In Proceedings International Parallel and Distributed Processing Symposium. IEEE, 10--pp.
[52]
Jesper Larsson Träff and Sascha Hunold. 2020. Decomposing MPI collectives for exploiting multi-lane communication. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 270--280.
[53]
Markus Velten, Robert Schöne, Thomas Ilsche, and Daniel Hackenberg. 2022. Memory Performance of AMD EPYC Rome and Intel Cascade Lake SP Server Processors. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering. 165--175.
[54]
Jerome Vienne. 2014. Benefits of cross memory attach for mpi libraries on hpc clusters. In Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. 1--6.
[55]
Markus Wittmann, Thomas Zeiser, Georg Hager, and Gerhard Wellein. 2013. Comparison of different propagation steps for lattice Boltzmann methods. Computers & Mathematics with Applications 65, 6 (2013), 924--935.
[56]
Michael Woodacre, Derek Robb, Dean Roe, and Karl Feind. 2005. The SGI® AltixTM 3000 global shared-memory architecture. Silicon Graphics, Inc 44 (2005).
[57]
Meng-Shiou Wu, Ricky A Kendall, and Kyle Wright. 2005. Optimizing collective communications on SMP clusters. In 2005 International Conference on Parallel Processing (ICPP'05). IEEE, 399--407.
[58]
Charlene Yang. 2020. Hierarchical Roofline Analysis: How to Collect Data using Performance Tools on Intel CPUs and NVIDIA GPUs. CoRR abs/2009.02449 (2020). arXiv:2009.02449 https://arxiv.org/abs/2009.02449
[59]
Doe Hyun Yoon, Naveen Muralimanohar, Jichuan Chang, Parthasarathy Ranganathan, Norman P Jouppi, and Mattan Erez. 2011. FREE-p: Protecting nonvolatile memory against both hard and soft errors. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 466--477.
[60]
Jie Zhang, Xiaoyi Lu, and Dhabaleswar K Panda. 2017. Designing locality and NUMA aware MPI runtime for nested virtualization based HPC cloud with SR-IOV enabled InfiniBand. In Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 187--200.
[61]
Martin Zinkevich, Markus Weimer, Lihong Li, and Alex Smola. 2010. Parallelized stochastic gradient descent. Advances in neural information processing systems 23 (2010).

Cited By

View all

Index Terms

  1. Optimizing MPI Collectives on Shared Memory Multi-Cores

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2023
    1428 pages
    ISBN:9798400701092
    DOI:10.1145/3581784
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 November 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. MPI
    2. collective communication
    3. memory access
    4. optimization

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SC '23
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)347
    • Downloads (Last 6 weeks)21
    Reflects downloads up to 10 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media