article

A novel MPI reduction algorithm resilient to imbalances in process arrival times

Authors:

P. SchelkensAuthors Info & Claims

The Journal of Supercomputing, Volume 72, Issue 5

Pages 1973 - 2013

https://doi.org/10.1007/s11227-016-1707-x

Published: 01 May 2016 Publication History

Abstract

Reduction algorithms are optimized only under the assumption that all processes commence the reduction simultaneously. Research on process arrival times has shown that this is rarely the case. Thus, all benchmarking methodologies that take into account only balanced arrival times might not portray a true picture of real-world algorithm performance. In this paper, we select a subset of four reduction algorithms frequently used by library implementations and evaluate their performance for both balanced and imbalanced process arrival times. The main contribution of this paper is a novel imbalance robust algorithm that uses pre-knowledge of process arrival times to construct reduction schedules. The performance of selected algorithms was empirically evaluated on a 128 node subset of the Partnership for Advanced Computing in Europe CURIE supercomputer. The reported results show that the new imbalance robust algorithm universally outperforms all the selected algorithms, whenever the reduction schedule is precomputed. We find that when the cost of schedule construction is included in the total runtime, the new algorithm outperforms the selected algorithms for problem sizes greater than 1 MiB.

References

[1]

Meneses E, Kal LV (2015) Camel: collective-aware message logging. J Supercomput 71(7):2516---2538.

Digital Library

[2]

Ferreira KB, Bridges P, Brightwell R (2008) Characterizing application sensitivity to OS interference using kernel-level noise injection. In: Proceedings of the 2008 ACM/IEEE conference on supercomputing (SC'08). IEEE Press, Piscataway, pp 19:1---19:12

[3]

Faraj A, Patarasuk P, Yuan X (2008) A study of process arrival patterns for MPI collective operations. Int J Parallel Program 36(6):571---591

[4]

Huang C, Lawlor O, Kale LV (2004) Adaptive MPI. In: Languages and compilers for parallel computing. Springer, New York, pp 306---322

[5]

Mamidala A, Liu J, Panda DK (2004) Efficient barrier and allreduce on infiniband clusters using multicast and adaptive algorithms. In: Proceedings of the 2004 IEEE international conference on cluster computing (CLUSTER'04). IEEE Computer Society, Washington, DC, pp 135---144

[6]

Patarasuk P, Yuan X (2008) Efficient MPI bcast across different process arrival patterns. Parallel Distrib Process Symp Int 0:1---11. http://doi.ieeecomputersociety.org/10.1109/IPDPS.2008.4536308

[7]

Qian Y (2010) Design and evaluation of efficient collective communications on modern interconnects and multi-core clusters. Ph.D. thesis, Queen's University, Kingston

[8]

Message Passing Interface Forum, $${\sf MPI}$$MPI (2016) A Message-Passing Interface Standard. Version 3.1. http://www.mpi-forum.org/docs/mpi-3.1/. Accessed 4 June 2015

[9]

Karp RM, Sahay A, Santos EE, Schauser KE (1993) Optimal broadcast and summation in the LogP model. In: Proceedings of the fifth annual ACM symposium on parallel algorithms and architectures. ACM, New York, pp 142---153

[10]

Louis-Claude Canon GA (2012) Scheduling associative reductions with homogenous costs when overlapping communications and computations. Tech. Rep. 7898, Inria

[11]

Rabenseifner R (2004) Optimization of collective reduction operations. In: Procs. of int. conf. on computational science (ICCS), pp 1---9

[12]

Rabenseifner R, Trff JL (2004) More efficient reduction algorithms for non-power-of-two number of processors in message-passing parallel systems. In: EuroPVM/MPI, pp 36---46

[13]

Patarasuk P, Yuan X (2009) Bandwidth optimal all-reduce algorithms for clusters of workstations. J Parallel Distrib Comput 69(2):117---124

Digital Library

[14]

Jain N, Sabharwal Y (2010) Optimal bucket algorithms for large MPI collectives on torus interconnects. In: Proceedings of the 24th ACM international conference on supercomputing. ACM, New York, pp 27---36

[15]

Chan E, Heimlich M, Purkayastha A, van de Geijn R (2007) Collective communication: theory, practice, and experience. Concurr Comput Pract Exp 19(13):1749---1783

[16]

Peterka T, Goodell D, Ross R, Shen H-W, Thakur R (2009) A configurable algorithm for parallel image-compositing applications. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC'09). ACM, New York, pp 4:1---4:10.

[17]

Kendall W, Peterka T, Huang J, Shen H-W, Ross R (2010) Accelerating and benchmarking Radix-$$k$$k image compositing at large scale. In: Proceedings of the 10th eurographics conference on parallel graphics and visualization (EG PGV'10). Eurographics Association, Aire-la-Ville, pp 101---110.

[18]

Pjesivac-Grbovic J, Angskun T, Bosilca G, Fagg GE, Dongarra GJJ (2005) Performance analysis of MPI collective operations. In: IEEE international parallel and distributed processing symposium

[19]

Sanders P, Speck J, Trff JL (2009) Two-tree algorithms for full bandwidth broadcast, reduction and scan. Parallel Comput 35(12):581---594. (Selected papers from the 14th European PVM/MPI users group meeting)

Digital Library

[20]

Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in MPICH. Int J High Perform Comput Appl 19(1):49---66

Digital Library

[21]

Hoefler T, Moor D (2014) Energy, memory, and runtime tradeoffs for implementing collective communication operations. J Supercomput Front Innov 1(2):58---75

Digital Library

[22]

Fabrizio Petrini SP, Kerbyson Darren J (2003) The case of the missing supercomputer performance: achieving optimal performance on the 8,192 processors of ASCI Q. In: Proceedings of the 2003 ACM/IEEE conference on supercomputing (SC'03), p 55

[23]

Agarwal S, Garg R, Vishnoi NK (2005) The impact of noise on the scaling of collectives: a theoretical approach. In: Proceedings of the 12th international conference on high performance computing (HiPC'05). Springer, Berlin, pp 280---289.

[24]

Hoefler T, Schneider T, Lumsdaine A (2010) Characterizing the influence of system noise on large-scale applications by simulation. In: Proceedings of the 2010 ACM/IEEE international conference for high performance computing, networking, storage and analysis (SC'10). IEEE Computer Society, Washington, DC, pp 1---11.

Digital Library

[25]

Ghysels P, Ashby TJ, Meerbergen K, Vanroose W (2013) Hiding global communication latency in the gmres algorithm on massively parallel machines. SIAM J Sci Comput 35(1):C48---C71

Digital Library

[26]

Ferreira KB, Bridges PG, Brightwell R, Pedretti KT (2013) The impact of system design parameters on application noise sensitivity. Clust Comput 16(1):117---129

Digital Library

[27]

Eichenberger AE, Abraham SG (1995) Impact of load imbalance on the design of software barriers. In: Proceedings of the 1995 international conference on parallel processing, pp 63---72

[28]

Marendic P, Lemeire J, Haber T, Vucinic D, Schelkens P (2012) An investigation into the performance of reduction algorithms under load imbalance. In: Kaklamanis C, Papatheodorou T, Spirakis P (eds) Euro-Par 2012 parallel processing. Lecture notes in computer science, vol 7484. Springer, Berlin, pp 439---450.

[29]

Chan EW, Heimlich MF, Purkayastha A, Van De Geijn RA (2004) On optimizing collective communication. In: 2004 IEEE international conference on cluster computing. IEEE, pp 145---155

[30]

Träff JL, Ripke A (2008) Optimal broadcast for fully connected processor-node networks. J Parallel Distrib Comput 68(7):887---901

Digital Library

[31]

Lastovetsky A, Rychkov V, OFlynn M, Mpiblib (2008) Benchmarking MPI communications for parallel computing on homogeneous and heterogeneous clusters. In: Recent advances in parallel virtual machine and message passing interface. Springer, New York, pp 227---238

[32]

Hockney RW (1994) The communication challenge for MPP: Intel Paragon and Meiko CS-2. Parallel Comput 20(3):389---398.

Digital Library

[33]

Fredman ML, Sedgewick R, Sleator DD, Tarjan RE (1986) The pairing heap: a new form of self-adjusting heap. Algorithmica 1(1---4):111---129

[34]

Pettie S (2005) Towards a final analysis of pairing heaps. In: 46th annual IEEE symposium on foundations of computer science (FOCS'05). IEEE, pp 174---183

[35]

Ma K-L, Painter JS, Hansen CD, Krogh MF (1994) Parallel volume rendering using binary-swap compositing. Comput Graph Appl IEEE 14(4):59---68

Digital Library

[36]

Yang D-L, Yu J-C, Chung Y-C (2001) Efficient compositing methods for the sort-last-sparse parallel volume rendering system on distributed memory multicomputers. J Supercomput 18(2):201---220.

[37]

Gropp W, Lusk E (1999) Reproducible measurements of MPI performance characteristics. Springer, New York, pp 11---18

[38]

Corporation I (2013) Intel MPI benchmarks 4.1. https://software.intel.com/en-us/articles/intel-mpi-benchmarks. Accessed 1 April 2016

[39]

Hoefler T, Schneider T, Lumsdaine A (2010) Accurately measuring overhead, communication time and progression of blocking and nonblocking collective operations at massive scale. Int J Parallel Emerg Distrib Syst 25(4):241---258.

Digital Library

[40]

Hoefler TST, Lumsdaine A (2010) Accurately measuring overhead, communication time and progression of blocking and nonblocking collective operations at massive scale. Int J Parallel Emerg Distrib Syst 25(4):241---258

Digital Library

[41]

Gropp W, Lusk E (1999) Reproducible measurements of mpi performance characteristics. In: Recent advances in parallel virtual machine and message passing interface. Springer, New York, pp 11---18

[42]

Reussner R, Sanders P, Träff JL, Skampi (2002) A comprehensive benchmark for public benchmarking of MPI. Sci Program 10(1):55---65

[43]

Grove DA, Coddington PD (2005) Communication benchmarking and performance modelling of mpi programs on cluster computers. J Supercomput 34(2):201---217

Digital Library

[44]

Träff JL (2012) Mpicroscope: towards an MPI benchmark tool for performance guideline verification. In: Recent advances in the message passing interface--proceedings of 19th European MPI users' group meeting (EuroMPI'12), Austria, pp 100---109.

[45]

Hunold S, Carpen-Amarie A, Träff JL (2014) Reproducible MPI micro-benchmarking isn't as easy as you think. In: Proceedings of the 21st European MPI users' group meeting. ACM, New York, p 69

[46]

NIST/SEMATECH (2012) E-handbook of statistical methods. http://www.itl.nist.gov/div898/handbook/. Accessed June 2015

[47]

Buranapanichkit D, Deligiannis N, Andreopoulos Y (2015) Convergence of desynchronization primitives in wireless sensor networks: stochastic modeling approach. CoRR. arXiv:1411.2862

[48]

Deligiannis N, Mota JF, Smart G, Andreopoulos Y (2015) Fast desynchronization for decentralized multichannel medium access control. IEEE Trans Commun 63(9):3336---3349

Cited By

Proficz JOcetkiewicz K(2020)Improving Clairvoyant: reduction algorithm resilient to imbalanced process arrival patternsThe Journal of Supercomputing10.1007/s11227-020-03499-177:6(6145-6177)Online publication date: 20-Nov-2020
https://dl.acm.org/doi/10.1007/s11227-020-03499-1
Proficz J(2020)Process arrival pattern aware algorithms for acceleration of scatter and gather operationsCluster Computing10.1007/s10586-019-03040-x23:4(2735-2751)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1007/s10586-019-03040-x
Proficz J(2018)Improving all-reduce collective operations for imbalanced process arrival patternsThe Journal of Supercomputing10.1007/s11227-018-2356-z74:7(3071-3092)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1007/s11227-018-2356-z
Show More Cited By

A novel MPI reduction algorithm resilient to imbalances in process arrival times
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Implementation and performance analysis of non-blocking collective operations for MPI
SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we ...
A study of process arrival patterns for MPI collective operations
ICS '07: Proceedings of the 21st annual international conference on Supercomputing

Process arrival pattern, which denotes the timing when different processes arrive at an MPI collective operation, can have a significant impact on the performance of the operation. In this work, we characterize the process arrival patterns in a set of ...
Hierarchical redesign of classic MPI reduction algorithms

Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in 1990s. Many general and architecture-specific collective algorithms have been proposed and implemented in the state-of-the-art MPI ...

Comments

Information & Contributors

Information

Published In

cover image The Journal of Supercomputing

The Journal of Supercomputing Volume 72, Issue 5

May 2016

380 pages

ISSN:0920-8542

Issue’s Table of Contents

Copyright © Copyright © 2016 Springer Science+Business Media New York.

Publisher

Kluwer Academic Publishers

United States

Publication History

Published: 01 May 2016

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Proficz JOcetkiewicz K(2020)Improving Clairvoyant: reduction algorithm resilient to imbalanced process arrival patternsThe Journal of Supercomputing10.1007/s11227-020-03499-177:6(6145-6177)Online publication date: 20-Nov-2020
https://dl.acm.org/doi/10.1007/s11227-020-03499-1
Proficz J(2020)Process arrival pattern aware algorithms for acceleration of scatter and gather operationsCluster Computing10.1007/s10586-019-03040-x23:4(2735-2751)Online publication date: 1-Dec-2020
https://dl.acm.org/doi/10.1007/s10586-019-03040-x
Proficz J(2018)Improving all-reduce collective operations for imbalanced process arrival patternsThe Journal of Supercomputing10.1007/s11227-018-2356-z74:7(3071-3092)Online publication date: 1-Jul-2018
https://dl.acm.org/doi/10.1007/s11227-018-2356-z
Kouetcha DRamézani HCohaut N(2017)Ultrafast scalable parallel algorithm for the radial distribution function histogramming using MPI mapsThe Journal of Supercomputing10.1007/s11227-016-1854-073:4(1629-1653)Online publication date: 1-Apr-2017
https://dl.acm.org/doi/10.1007/s11227-016-1854-0

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents