article

Hierarchical redesign of classic MPI reduction algorithms

Authors:

Khalid Hasanov,

Alexey LastovetskyAuthors Info & Claims

The Journal of Supercomputing, Volume 73, Issue 2

Pages 713 - 725

https://doi.org/10.1007/s11227-016-1779-7

Published: 01 February 2017 Publication History

Abstract

Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in 1990s. Many general and architecture-specific collective algorithms have been proposed and implemented in the state-of-the-art MPI implementations. Hierarchical topology-oblivious transformation of existing communication algorithms has been recently proposed as a new promising approach to optimization of MPI collective communication algorithms and MPI-based applications. This approach has been successfully applied to the most popular parallel matrix multiplication algorithm, SUMMA, and the state-of-the-art MPI broadcast algorithms, demonstrating significant multifold performance gains, especially for large-scale HPC systems. In this paper, we apply this approach to optimization of the MPI Reduce and Allreduce operations. Theoretical analysis and experimental results on a cluster of Grid'5000 platform are presented.

References

[1]

Message passing interface forum. http://www.mpi-forum.org/. Accessed 20 Feb 2016

[2]

Rabenseifner R (2004) Optimization of collective reduction operations. In: 2004 International conference on computational science, pp 1---9

[3]

Venkata MG et al (2013) Optimizing blocking and nonblocking reduction operations for multicore systems: hierarchical design and implementation. In: 2013 IEEE international conference on cluster computing, pp 1---8

[4]

Hasanov K, Quintin JN, Lastovetsky A (2015) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J Supercomput 71(11):3991---4014

Digital Library

[5]

Hasanov K, Quintin JN, Lastovetsky A (2014) High-level topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms. In: Euro-Par 2014: parallel processing workshops, lecture notes in computer science, vol 8806, Springer, New York, pp 412---424

[6]

de Geijn RA, Jerrell W (1997) SUMMA: scalable universal matrix multiplication algorithm. Concurr Pract Exp 9(4):255---274

[7]

Hasanov K, Lastovetsky A (2015) Hierarchical optimization of MPI reduce algorithms. In: PaCT 2015, lecture notes in computer science, vol 9251, Springer, New York, pp 21---34

Digital Library

[8]

Gabriel E, Fagg G, Bosilca G, Angskun T, Dongarra J et al (2004) Open MPI: goals, concept, and design of a next generation MPI implementation. In: EuroPVM/MPI 2004, lecture notes in computer science, vol 3241, Springer, New York, pp 97---104

[9]

Bala V, Bruck J, Cypher R, Elustondo P, Ho C-T, Ho C-T, Kipnis S, Snir M (1995) CCL: a portable and tunable collective communication library for scalable parallel computers. IEEE Trans Parallel Distrib Syst 6(2):154---164

Digital Library

[10]

Barnett M, Shuler L, van De Geijn R, Gupta S, Payne DG, Watts J (1994) Interprocessor collective communication library (InterCom). In: IEEE scalable high-performance computing conference, pp 357---364

[11]

Kielmann T, Hofman RF, Bal HE, Plaat A, Bhoedjang RA (1999) MagPIe: MPI's collective communication operations for clustered wide area systems. ACM Sigplan Notices 34(8):131---140

Digital Library

[12]

Chan EW, Heimlich MF, Purkayastha A, Van de Geijn RA (2004) On optimizing collective communication. In: 2004 IEEE international conference on cluster computing, pp 145---155

Digital Library

[13]

Vadhiyar SS, Fagg GE, Dongarra J (2000) Automatically tuned collective communications. In: ACM/IEEE conference on supercomputing, p 3

Digital Library

[14]

Hockney RW (1994) The communication challenge for MPP: intel paragon and Meiko CS-2. Parallel Comput 20(3):389---398

Digital Library

[15]

Pjes¿ivac-Grbović J (2007) Towards automatic and adaptive optimizations of MPI collective operations. PhD thesis, University of Tennessee, Knoxville

[16]

Lastovetsky A, Rychkov V, O'Flynn M (2008) MPIBlib: Benchmarking MPI communications for parallel computing on homogeneous and heterogeneous clusters. In: EuroPVM/MPI 2008, lecture notes in computer science, vol 5205, Springer, New York, pp 227---238

Digital Library

[17]

Hasanov K, Quintin JN, Lastovetsky A (2015) Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms. Simul Model Pract Theory 58:30---39

[18]

Hasanov K (2015) Hierarchical approach to optimization of MPI collective communication algorithms. PhD. thesis, University College Dublin

[19]

MPICH-A Portable Implementation of MPI. http://www.mpich.org/. Accessed 01 March 2016

Cited By

Gan XLi TXiong FYang BChen XGong CLi SLu KLi QZhang Y(2024)MST: Topology-Aware Message Aggregation for Exascale Graph Processing of Traversal-Centric AlgorithmsACM Transactions on Architecture and Code Optimization10.1145/367684621:4(1-22)Online publication date: 3-Jul-2024
https://dl.acm.org/doi/10.1145/3676846
Roa Perdomo DCeccato RNeveu RYviquel HLi XMonsalve Diaz JDoerfert J(2023)Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware StrategiesProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624609(1958-1967)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624609
Sun CLuo HJiang HZhang JLi K(2023)COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327791534:7(2167-2179)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3277915
Show More Cited By

Recommendations

Hierarchical Optimization of MPI Reduce Algorithms
Proceedings of the 13th International Conference on Parallel Computing Technologies - Volume 9251

Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in 1990s. Many general and architecture-specific collective algorithms have been proposed and implemented in the state-of-the-art MPI ...
MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory

Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While ...
MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics
CCGRID '08: Proceedings of the 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid

The advances in multicore technology and modern interconnects is rapidly accelerating the number of cores deployed in today's commodity clusters. A majority of parallel applications written in MPI employ collective operations in their communication ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 73, Issue 2

February 2017

316 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2017 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 February 2017

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gan XLi TXiong FYang BChen XGong CLi SLu KLi QZhang Y(2024)MST: Topology-Aware Message Aggregation for Exascale Graph Processing of Traversal-Centric AlgorithmsACM Transactions on Architecture and Code Optimization10.1145/367684621:4(1-22)Online publication date: 3-Jul-2024
https://dl.acm.org/doi/10.1145/3676846
Roa Perdomo DCeccato RNeveu RYviquel HLi XMonsalve Diaz JDoerfert J(2023)Memory Transfer Decomposition: Exploring Smart Data Movement Through Architecture-Aware StrategiesProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624609(1958-1967)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624609
Sun CLuo HJiang HZhang JLi K(2023)COFFEE: Cross-Layer Optimization for Fast and Efficient Executions of Sinkhorn-Knopp Algorithm on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.327791534:7(2167-2179)Online publication date: 1-Jul-2023
https://dl.acm.org/doi/10.1109/TPDS.2023.3277915
Zhou SHuang HLi RLiu JZheng Z(2023)ComAvgComputer Communications10.1016/j.comcom.2023.09.004211:C(147-156)Online publication date: 1-Nov-2023
https://dl.acm.org/doi/10.1016/j.comcom.2023.09.004
Alizadeh PSojoodi AHassan Temucin YAfsahi A(2022)Efficient Process Arrival Pattern Aware Collective Communication for Deep LearningProceedings of the 29th European MPI Users' Group Meeting10.1145/3555819.3555857(68-78)Online publication date: 14-Sep-2022
https://dl.acm.org/doi/10.1145/3555819.3555857
Wang ZChen HDong XCai WKang YZhang X(2022)Extending -Lop to model MPI blocking primitives on shared memoryThe Journal of Supercomputing10.1007/s11227-022-04352-378:9(12046-12069)Online publication date: 1-Jun-2022
https://dl.acm.org/doi/10.1007/s11227-022-04352-3
Castelló ACatalán MDolz MQuintana-Ortí EDuato J(2022)Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networksComputing10.1007/s00607-021-01029-2105:5(1101-1119)Online publication date: 10-Jan-2022
https://dl.acm.org/doi/10.1007/s00607-021-01029-2
Wang DLei YXie JWang G(2021)HSAC-ALADMM: an asynchronous lazy ADMM algorithm based on hierarchical sparse allreduce communicationThe Journal of Supercomputing10.1007/s11227-020-03590-777:8(8111-8134)Online publication date: 1-Aug-2021
https://dl.acm.org/doi/10.1007/s11227-020-03590-7
Castelló AQuintana-Ortí EDuato J(2021)Accelerating distributed deep neural network training with pipelined MPI allreduceCluster Computing10.1007/s10586-021-03370-924:4(3797-3813)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1007/s10586-021-03370-9
Barrachina SCastelló ACatalán MDolz MMestre J(2021)Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUsComputing10.1007/s00607-021-00997-9105:5(915-934)Online publication date: 30-Aug-2021
https://dl.acm.org/doi/10.1007/s00607-021-00997-9
Show More Cited By

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents