skip to main content
10.1145/514191.514205acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
Article

A network-failure-tolerant message-passing system for terascale clusters

Published: 22 June 2002 Publication History

Abstract

The Los Alamos Message Passing Interface (LA-MPI) is an end-to-end network-failure-tolerant message-passing system designed for terascale clusters. LA-MPI is a standard-compliant implementation of MPI designed to tolerate network-related failures including I/O bus errors, network card errors, and wire-transmission errors. This paper details the distinguishing features of LA-MPI, including support for concurrent use of multiple types of network interface, and reliable message transmission utilizing multiple network paths and routes between a given source and destination. In addition, performance measurements on production-grade platforms are presented.

References

[1]
LAM/MPI parallel computing, http://www.lam-mpi.org.]]
[2]
Edinburgh Parallel~Computing Centre, CRI/EPCC T3D/E MPI.]]
[3]
IBM Corporation, IBM Parallel Environment for AIX (PE).]]
[4]
A. Denis, Variable reliability protocol in Globus-Nexus, Tech. report, Information Science Institute (ISI), University of Southern California, 1999.]]
[5]
Jack J. Dongarra and David Walker, MPI: a standard message passing interface, Supercomputer 12 (1996), no. 1, 56--68.]]
[6]
W. Gropp and E. Lusk, Installation guide for mpich, a portable implementation of MPI, Mathematics and Computer Science Division, Argonne National Laboratory, 1996, ANL-96/5.]]
[7]
Erik A. Hendriks, BProc: The Beowulf distributed process space, 16th Annual ACM International Conference on Supercomputing, 2002.]]
[8]
E. C. Hunke and W. H. Lipscomb, CICE: the Los Alamos sea ice model, Tech. Report LA-CC-98-16, Los Alamos National Laboratory, 1999.]]
[9]
Silicon Graphics Inc., SGI message-passing toolkit.]]
[10]
Sun Microsystems, Sun HPC ClusterTools.]]
[11]
Ron Minnich and Karen Reid, Supermon: High performance monitoring for linux clusters, The Fifth Annual Linux Showcase and Conference, November 2001.]]
[12]
J. Postel, Transmission Control Protocol, Internet Engineering Task Force, RFC 793, 1981.]]
[13]
Mitsuhisa Sato, PM: An operating system coordinated high performance communication library, High-Performance Computing and Networking, 1997.]]
[14]
W. R. Stevens, TCP/IP illustrated, volume 2; the implementation, Addison Wesley, Reading, 1995.]]
[15]
Scali: Scalable Linux Systems, Scali MPI, http://www.scali.com.]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '02: Proceedings of the 16th international conference on Supercomputing
June 2002
338 pages
ISBN:1581134835
DOI:10.1145/514191
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 June 2002

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MPI
  2. fault tolerance
  3. message passing

Qualifiers

  • Article

Conference

ICS02
Sponsor:
ICS02: International Conference on Supercomputing
June 22 - 26, 2002
New York, New York, USA

Acceptance Rates

ICS '02 Paper Acceptance Rate 31 of 144 submissions, 22%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media