research-article

Software combining to mitigate multithreaded MPI contention

Authors:

Abdelhalim Amer,

Charles Archer,

Michael Blocksome,

Michael Chuvelev,

Maria Garzaran,

Jeff R. Hammond,

Shintaro Iwasaki,

Kenneth J. Raffenetti,

Mikhail Shiryaev,

Sagar Thapaliya,

Pavan BalajiAuthors Info & Claims

ICS '19: Proceedings of the ACM International Conference on Supercomputing

Pages 367 - 379

https://doi.org/10.1145/3330345.3330378

Published: 26 June 2019 Publication History

Abstract

Efforts to mitigate lock contention from concurrent threaded accesses to MPI have reduced contention through fine-grained locking, avoided locking altogether by offloading communication to dedicated threads, or alleviated negative side effects from contention by using better lock management protocols. The blocking nature of lock-based methods, however, wastes the asynchrony benefits of nonblocking MPI operations, and the offloading model sacrifices CPU resources and incurs unnecessary software offloading overheads under low contention.

We propose new thread safety models, CSync and LockQ, based on software combining, a form of software offloading without the requirement for dedicated threads; a thread holding the lock combines work of threads that failed their lock acquisitions. We demonstrate that CSync, a direct application of software combining, improves scalability but suffers from lack of asynchrony and incurs unnecessary offloading. LockQ alleviates these shortcomings by leveraging MPI semantics to relax synchronization and reduce offloading requirements. We present the implementation, analysis, and evaluation of these models on a modern network fabric and show that LockQ outperforms most existing thread safety models in low- and high-contention regimes.

References

[1]

Abdelhalim Amer, Huiwei Lu, Pavan Balaji, Milind Chabbi, Yanjie Wei, Jeff Hammond, and Satoshi Matsuoka. 2019. Lock Contention Management in Multithreaded MPI. ACM Transactions on Parallel Computing (TOPC) 5, 3 (2019), 12.

[2]

Abdelhalim Amer, Huiwei Lu, Pavan Balaji, and Satoshi Matsuoka. 2015. Characterizing MPI and Hybrid MPI+Threads Applications at Scale: Case Study with BFS. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 1075--1083.

Digital Library

[3]

Abdelhalim Amer, Huiwei Lu, Yanjie Wei, Pavan Balaji, and Satoshi Matsuoka. 2015. MPI+ Threads: Runtime Contention and Remedies. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'15). 239--248.

Digital Library

[4]

Randal S Baker and Kenneth R Koch. 1998. An S<sub>n</sub> Algorithm for the Massively Parallel CM-200 Computer. Nuclear Science and Engineering 128, 3 (1998), 312--320.

[5]

Pavan Balaji, Darius Buntinas, D. Goodell, W. D. Gropp, and Rajeev Thakur. 2010. Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming. International Journal of High Performance Computing Applications (IJHPCA) 24 (2010), 49--57.

Digital Library

[6]

Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High Performance Locks for Multi-Level NUMA Systems. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'15). 215--226.

Digital Library

[7]

Milind Chabbi and John Mellor-Crummey. 2016. Contention-Conscious, Locality-Preserving Locks. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles andPractice of Parallel Programming (PPoPP'16). 22:1--22:14.

Digital Library

[8]

Hoang-Vu Dang, Sangmin Seo, Abdelhalim Amer, and Pavan Balaji. 2017. Advanced Thread Synchronization for Multithreaded MPI Implementations. In 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). IEEE, 314--324.

Digital Library

[9]

Gábor Dózsa, Sameer Kumar, Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Joe Ratterman, and Rajeev Thakur. 2010. Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems. In Proceedings of the 17th European MPI Users' Group Meeting Conference on Recent Advances in the Message Passing Interface (EuroMPI'10). Springer-Verlag, Berlin, Heidelberg, 11--20.

Digital Library

[10]

Wataru Endo and Kenjiro Taura. 2018. Parallelized Software Offloading of Low-Level Communication with User-Level Threads. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. ACM, 289--298.

Digital Library

[11]

Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the Combining Synchronization Technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'12). 257--266.

Digital Library

[12]

Sanjay Ghemawat and Paul Menage. 2009. Tcmalloc: Thread-Caching Malloc.

[13]

Paul Grun, Sean Hefty, Sayantan Sur, David Goodell, Robert D Russell, Howard Pritchard, and Jeffrey M Squyres. 2015. A Brief Introduction to the OpenFabrics Interfaces - A New Network API for Maximizing High Performance Application Efficiency. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI'15). 34--39.

Digital Library

[14]

Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat Combining and the Synchronization-Parallelism Tradeoff. In Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 355--364.

Digital Library

[15]

Maurice Herlihy and Nir Shavit. 2011. The Art of Multiprocessor Programming. Morgan Kaufmann.

Digital Library

[16]

Nathan Hjelm, Matthew GF Dosanjh, Ryan E Grant, Taylor Groves, Patrick Bridges, and Dorian Arnold. 2018. Improving MPI Multi-Threaded RMA Communication Performance. In Proceedings of the 47th International Conference on Parallel Processing. ACM, 58.

Digital Library

[17]

Torsten Hoefler, James Dinan, Darius Buntinas, Pavan Balaji, Brian Barrett, Ron Brightwell, William Gropp, Vivek Kale, and Rajeev Thakur. 2013. MPI+MPI: A New Hybrid Approach to Parallel Programming with MPI plus Shared Memory. Computing 95, 12 (2013), 1121--1136.

Digital Library

[18]

Torsten Hoefler, Christian Siebert, and Andrew Lumsdaine. 2010. Scalable Communication Protocols for Dynamic Sparse Data Exchange. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'10). 159--168.

Digital Library

[19]

Krishna Kandalla, Peter Mendygral, Nick Radcliffe, Bob Cernohous, David Knaak, Kim McMahon, and Mark Pagel. 2016. Optimizing Cray MPI and SHMEM Software Stacks for Cray-XC Supercomputers based on Intel KNL Processors. Cray User Group (2016).

[20]

Alex Kogan and Erez Petrank. 2011. Wait-Free Queues with Multiple Enqueuers and Dequeuers. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP '11). 223--234.

Digital Library

[21]

Sameer Kumar, Amith R Mamidala, Daniel A Faraj, Brian Smith, Michael Blocksome, Bob Cernohous, Douglas Miller, Jeff Parker, Joseph Ratterman, Philip Heidelberger, et al. 2012. PAMI: A Parallel Active Message Interface for the Blue Gene/Q Supercomputer. In 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS '12). 763--773.

Digital Library

[22]

Peter Magnusson, Anders Landin, and Erik Hagersten. 1994. Queue Locks on Cache Coherent Multiprocessors. In Parallel Processing Symposium, 1994. Proceedings., Eighth International. IEEE, 165--171.

Digital Library

[23]

John M Mellor-Crummey and Michael L Scott. 1991. Algorithms for Scalable Synchronization on Shared-memory Multiprocessors. ACM Transactions on Computer Systems (TOCS) 9, 1 (1991), 21--65.

Digital Library

[24]

Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. 1999. Executing Parallel Programs with Synchronization Bottlenecks Efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, Vol. 16. Citeseer.

[25]

GF Pfister, WC Brantley, DA George, SL Harvey, WJ Kleinfelder, KP McAuliffe, EA Melton, VA Norton, and J Weiss. 1985. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. In Proceedings of the 1985 International Conference on Parallel Processing: August 20--23, 1985. IEEE Computer Society Press, Washington, DC.

[26]

Ken Raffenetti, Abdelhalim Amer, Lena Oden, Charles Archer, Wesley Bland, Hajime Fujita, Yanfei Guo, Tomislav Janjusic, Dmitry Durnov, Michael Blocksome, Min Si, Sangmin Seo, Akhil Langer, Gengbin Zheng, Masamichi Takagi, Paul Coffman, Jithin Jose, Sayantan Sur, Alexander Sannikov, Sergey Oblomov, Michael Chuvelev, Masayuki Hatanaka, Xin Zhao, Paul Fischer, Thilina Rathnayake, Matt Otten, Misun Min, and Pavan Balaji. 2017. Why is MPI So Slow?: Analyzing the Fundamental Limits in Implementing MPI-3.1. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). 62:1--62:12.

Digital Library

[27]

Pavel Shamis, Manjunath Gorentla Venkata, M Graham Lopez, Matthew B Baker, Oscar Hernandez, Yossi Itigin, Mike Dubman, Gilad Shainer, Richard L Graham, Liran Liss, et al. 2015. UCX: An Open Source Framework for HPC Network APIs and Beyond. In 2015 IEEE 23rd Annual Symposium on High-Performance Interconnects (HOTI '15). 40--43.

[28]

Karthikeyan Vaidyanathan, Dhiraj D. Kalamkar, Kiran Pamnany, Jeff R. Hammond, Pavan Balaji, Dipankar Das, Jongsoo Park, and Bálint Joó. 2015. Improving Concurrency and Asynchrony in Multithreaded MPI Applications Using Software Offloading. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). 30:1--30:12.

Digital Library

[29]

Chaoran Yang and John Mellor-Crummey. 2016. A Wait-Free Queue as Aast as Fetch-and-Add. In ACM SIGPLAN Notices, Vol. 51. ACM, 16.

Digital Library

[30]

Pen-Chung Yew, Nian-Feng Tzeng, et al. 1987. Distributing Hot-Spot Addressing in Large-Scale Multiprocessors. IEEE Trans. Comput. 100, 4 (1987), 388--395.

Digital Library

Cited By

Zambre RChandramowlishwaran AWolf FShende SCulhane CAlam SJagode H(2022)Lessons learned on MPI+threads communicationProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571987(1-16)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571987
Fatourou PKallimanis NKosmas ELee JAgrawal KSpear M(2022)The performance power of software combining in persistenceProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508426(337-352)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508426
Zambre RChandramowlishwaran A(2022)Lessons Learned on MPI+Threads CommunicationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00082(1-16)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00082
Show More Cited By

Index Terms

Software combining to mitigate multithreaded MPI contention
1. Social and professional topics
  1. Professional topics
    1. History of computing
      1. History of programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages

Recommendations

Lock Contention Management in Multithreaded MPI

In this article, we investigate contention management in lock-based thread-safe MPI libraries. Specifically, we make two assumptions: (1) locks are the only form of synchronization when protecting communication paths; and (2) contention occurs, and thus ...
Analyzing lock contention in multithreaded applications
PPoPP '10

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...
Analyzing lock contention in multithreaded applications
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '19: Proceedings of the ACM International Conference on Supercomputing

June 2019

533 pages

ISBN:9781450360791

DOI:10.1145/3330345

General Chair:
Rudolf Eigenmann
University of Delaware
,
Program Chairs:
Chen Ding
University of Rochester
,
Sally A. McKee
Clemson University

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ICS '19

Sponsor:

SIGARCH

ICS '19: 2019 International Conference on Supercomputing

June 26 - 28, 2019

Arizona, Phoenix

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
238
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)0

Reflects downloads up to 24 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zambre RChandramowlishwaran AWolf FShende SCulhane CAlam SJagode H(2022)Lessons learned on MPI+threads communicationProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571987(1-16)Online publication date: 13-Nov-2022
https://dl.acm.org/doi/10.5555/3571885.3571987
Fatourou PKallimanis NKosmas ELee JAgrawal KSpear M(2022)The performance power of software combining in persistenceProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508426(337-352)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508426
Zambre RChandramowlishwaran A(2022)Lessons Learned on MPI+Threads CommunicationSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00082(1-16)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00082
Zambre RSahasrabudhe DZhou HBerzins MChandramowlishwaran ABalaji P(2021)Logically Parallel Communication for Fast MPI+Threads ApplicationsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.307515732:12(3038-3052)Online publication date: 1-Dec-2021
https://doi.org/10.1109/TPDS.2021.3075157
Zambre RChandramowliswharan ABalaji PAyguadé EHwu WBadia RHofstee H(2020)How I learned to stop worrying about user-visible endpoints and love MPIProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392773(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392773

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents