Article

High performance RDMA-based MPI implementation over InfiniBand

Authors:

Sushmitha P. Kini,

Dhabaleswar K. PandaAuthors Info & Claims

ICS '03: Proceedings of the 17th annual international conference on Supercomputing

Pages 295 - 304

https://doi.org/10.1145/782814.782855

Published: 23 June 2003 Publication History

Abstract

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) operations. In this paper, we propose a new design of MPI over InfiniBand which brings the benefit of RDMA to not only large messages, but also small and control messages. We also achieve better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations. Our RDMA-based MPI implementation currently delivers a latency of 6.8 microseconds for small messages and a peak bandwidth of 871 Million Bytes (831 Mega Bytes) per second. Performance evaluation at the MPI level shows that for small messages, our RDMA-based design can reduce the latency by 24%, increase the bandwidth by over 104%, and reduce the host overhead by up to 22%. For large messages, we improve performance by reducing the time for transferring control messages. We have also shown that our new design is beneficial to MPI collective communication and NAS Parallel Benchmarks.

References

[1]

F. J. Alfaro, J. L. Sanchez, J. Duato, and C. R. Das. A Strategy to Compute the Infiniband Arbitration Tables. In Int'l Parallel and Distributed Processing Symposium (IPDPS'02), April 2002.

Digital Library

[2]

M. Banikazemi, R. K. Govindaraju, R. Blackmore, and D. K. Panda. MPI-LAPI: An Efficeint Implementation of MPI for IBM RS/6000 SP Systems. IEEE Transactions on Parallel and Distributed Systems, pages 1081--1093, October 2001.

Digital Library

[3]

E. V. Carrera, S. Rao, L. Iftode, and R. Bianchini. User-level communication in cluster-based servers. In Proceedings of the Eighth Symposium on High-Performance Architecture (HPCA'02), Februry 2002.

Digital Library

[4]

D. E. Culler, R. M. Karp, D. A. Patterson, A. Shy, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. Logp: Towards realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1--12, 1993.

Digital Library

[5]

R. Dimitrov and A. Skjellum. An Efficient MPI Implementation for Virtual Interface (VI) Architecture-Enabled Cluster Computing. http://www.mpi-softtech.com/publications/, 1998.

[6]

D. Dunning, G. Regnier, G. McAlpine, D. Cameron, B. Shubert, F. Berry, A. Merritt, E. Gronke, and C. Dodd. The Virtual Interface Architecture. IEEE Micro, pages 66--76, March/April 1998.

Digital Library

[7]

W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing, 22(6):789--828, 1996.

Digital Library

[8]

H. Tezuk and F. O'Carroll and A. Hori and Y. Ishikawa. Pin-down Cache: A Virtual Memory Management Technique for Zero-copy Communication. In Proceedings of 12th International Parallel Processing Symposium, 1998.

Digital Library

[9]

P. Husbnds and J. C. Hoe. MPI-StrT: Delivering Network Performance to Numerical Applications. In Proceedings of the Supercomputing, 1998.

Digital Library

[10]

InfiniBand Trade Association. InfiniBand Architecture Specification, Release 1.0, October 24, 2000.

[11]

Lawrence Livermore National Laboratory. MVICH: MPI for Virtual Interface Architecture, August, 2001.

[12]

J. Liu, J. Wu, S. P. Kinis, D. Buntins, W. Yu, B. Chandrasekaran, R. Noronha, P. Wyckoff, and D. K. Panda. MPI over InfiniBand: Early Experiences. Technical Report, OSU-CISRC-10/02-TR25, Computer and Information Science, the Ohio State University, January, 2003.

[13]

K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, A. Gallatin, R. Kisley, R. Wickremesinghe, and E. Gabber. Structure and performance of the direct access file system. In Proceedings of USENIX 2002 Annual Technical Conference, Monterey, CA, pages 1--14, June, 2002.

Digital Library

[14]

R. Martin, A. Vahdat, D. Culler, and T. Anderson. Effects of Communication Latency, Overhead, and Bandwidth in Cluster Architecture. In Proceedings of the International Symposium on Computer Architecture, 1997.

Digital Library

[15]

Mellanox Technologies. Mellanox InfiniBand InfiniHost Adapters, July, 2002.

[16]

NASA. NAS Parallel Benchmarks.

[17]

Pallas. Pallas MPI Benchmarks. http://www.pallas.com/e/products/pmb/.

[18]

R. Gupta, P. Balaji, D. K. Panda, and J. Nieplocha. Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters. In Int'l Parallel and Distriauted Processing Symposium (IPDPS'03), April, 2003.

Digital Library

[19]

S. J. Sistare and C. J. Jackson. Ultra-High Performance Communication with MPI and the Sun Fire Link Interconnect. In Proceedings of the Supercomputing, 2002.

Digital Library

[20]

J. S. Vetter and F. Mueller. Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures. In Int'l Parallel and Distributed Processing Symposium (IPDPS'02), April, 2002.

Digital Library

[21]

J. Wu, J. Liu, P. Wyckoff, and D. K. Panda. Impact of On-Demand Connection Management in MPI over VIA. In Proceedings of the IEEE International Conference on Cluster Computing, 2002.

Digital Library

[22]

Y. Zhou, A. Bilas, S. Jagannathan, C. Dubnicki, J. F. Philbin, and K. Li. Expericences with vi communication for database storage. In In Proceedings of International Symposium on Computer Architecture'02, 2002.

Digital Library

Cited By

Lee JKim SKim WJin H(2024)RDMA-Based Sampling Port of ARINC-653IEEE Embedded Systems Letters10.1109/LES.2024.337320016:4(437-440)Online publication date: Dec-2024
https://doi.org/10.1109/LES.2024.3373200
Sun HTan YWu YZhu JHuang QYao XZhang G(2024) RB 2 : Narrow the Gap between RDMA Abstraction and Performance via a Middle Layer IEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621169(1071-1080)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621169
Palma MGonzalez JCarrasco MRubio-Noriega RBergman KAzevedo R(2024)Inter-Node Message Passing Through Optical Reconfigurable Memory ChannelIEEE Access10.1109/ACCESS.2024.341287812(83057-83071)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3412878
Show More Cited By

Index Terms

High performance RDMA-based MPI implementation over InfiniBand
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

Message Passing Interface (MPI) is a popular parallel programming model for scientific applications. Most high-performance MPI implementations use Rendezvous Protocol for efficient transfer of large messages. This protocol can be designed using either ...
High performance RDMA-based MPI implementation over infiniBand
Special issue I: The 17th annual international conference on supercomputing (ICS'03)

Although InfiniBand Architecture is relatively new in the high performance computing area, it offers many features which help us to improve the performance of communication subsystems. One of these features is Remote Direct Memory Access (RDMA) ...
High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters
ICS '07: Proceedings of the 21st annual international conference on Supercomputing

High-performance clusters have been growing rapidly in scale. Most of these clusters deploy a high-speed interconnect, such as Infini-Band, to achieve higher performance. Most scientific applications executing on these clusters use the Message Passing ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '03: Proceedings of the 17th annual international conference on Supercomputing

June 2003

380 pages

ISBN:1581137338

DOI:10.1145/782814

General Chair:
Utpal Banerjee
Intel Corporation
,
Program Chairs:
Kyle A. Gallivan
Florida State University
,
Antonio Gonzalez
Intel Labs & Univ. Politècnica de Catalunya

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICS03

Sponsor:

ICS03: International Conference on Supercomputing 2003

June 23 - 26, 2003

CA, San Francisco, USA

Acceptance Rates

ICS '03 Paper Acceptance Rate 36 of 171 submissions, 21%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

150
Total Citations
View Citations
1,324
Total Downloads

Downloads (Last 12 months)133
Downloads (Last 6 weeks)15

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lee JKim SKim WJin H(2024)RDMA-Based Sampling Port of ARINC-653IEEE Embedded Systems Letters10.1109/LES.2024.337320016:4(437-440)Online publication date: Dec-2024
https://doi.org/10.1109/LES.2024.3373200
Sun HTan YWu YZhu JHuang QYao XZhang G(2024) RB 2 : Narrow the Gap between RDMA Abstraction and Performance via a Middle Layer IEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621169(1071-1080)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621169
Palma MGonzalez JCarrasco MRubio-Noriega RBergman KAzevedo R(2024)Inter-Node Message Passing Through Optical Reconfigurable Memory ChannelIEEE Access10.1109/ACCESS.2024.341287812(83057-83071)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3412878
Fridman YMutalik Desai SSingh NWillhalm TOren G(2023)CXL Memory as Persistent Memory for Disaggregated HPC: A Practical ApproachProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624175(983-994)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624175
Du XMeng ZSiguenza-Torres AKnoll APimpini APiccione ABortoli SPellegrini A(2023)Autonomic Orchestration of in-Situ and in-Transit Data Analytics For Simulation Studies2023 Winter Simulation Conference (WSC)10.1109/WSC60868.2023.10408191(781-792)Online publication date: 10-Dec-2023
https://doi.org/10.1109/WSC60868.2023.10408191
Wu JWang ZMa TKong LLiu YSong ZYu JYang YMa TChen G(2023)DySched: Relieving Large-Scale Incast for Cloud-Native RDMA Applications2023 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom)10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00035(18-25)Online publication date: 21-Dec-2023
https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom59178.2023.00035
Xu ZYin JLi W(2023)A Load-balancing method for high performance cluster computing system2023 IEEE 3rd International Conference on Electronic Technology, Communication and Information (ICETCI)10.1109/ICETCI57876.2023.10176628(1775-1779)Online publication date: 26-May-2023
https://doi.org/10.1109/ICETCI57876.2023.10176628
Xu SKuncham GAbduljabbar MSubramoni HPanda D(2023)Optimized All-to-All Connection Establishment for High-Performance MPI Libraries Over InfiniBand2023 IEEE 30th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC58850.2023.00021(41-50)Online publication date: 18-Dec-2023
https://doi.org/10.1109/HiPC58850.2023.00021
Temuçin YLevy SSchonbein WGrant RAfsahi A(2023)A Dynamic Network-Native MPI Partitioned Aggregation Over InfiniBand Verbs2023 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER52292.2023.00029(259-270)Online publication date: 31-Oct-2023
https://doi.org/10.1109/CLUSTER52292.2023.00029
Thanh Chung MWeidendorfer JFürlinger KKranzlmüller D(2023)From reactive to proactive load balancing for task‐based parallel applications in distributed memory machinesConcurrency and Computation: Practice and Experience10.1002/cpe.782835:24Online publication date: 26-Jun-2023
https://doi.org/10.1002/cpe.7828
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents