skip to main content
research-article

A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers

Published: 01 April 2004 Publication History

Abstract

A new method for fault-tolerant wormhole routing in arbitrary dimensional meshes is introduced. The method was motivated by certain routing requirements of an initial design of the Blue Gene supercomputer at IBM Research. The machine is organized as a three-dimensional mesh containing many thousands of nodes and the routing method should tolerate a few percent of the nodes being faulty. There has been much work on routing methods for meshes that route messages around faults or regions of faults. The new method is to declare certain nonfaulty nodes to be "lambs. A lamb is used for routing but not processing, so a lamb is neither the source nor the destination of a message. The lambs are chosen so that every "survivor node, a node that is neither faulty nor a lamb, can reach every survivor node by at most two rounds of dimension-ordered (such as e{\hbox{-}}{\rm cube}) routing. An algorithm for finding a set of lambs is presented. The results of simulations on 2D and 3D meshes of various sizes with various numbers of random node faults are given. For example, on a 32 \times 32 \times 32 3D mesh with 3 percent random faults and using at most two rounds of e{\hbox{-}}{\rm cube} routing for each message, the average number of lambs is less than 68, which is less than 7 percent of the number 983 of faults and less than 0.21 percent of the number 32,768 of nodes.

References

[1]
F. Allen, et al., “Blue Gene: A Vision for Protein Science Using a Petaflop Supercomputer,” IBM Systems J., vol. 40, pp. 310-327, 2001.
[2]
G. S. Almasi C. Cascaval J.G. Castaños M. Denneau W. Donath M. Eleftheriou M. Giampapa H. Ho D. Lieber J.E. Moreira D. Newns M. Snir and H.S. Warren Jr., “Demonstrating the Scalability of a Molecular Dynamics Application on a Petaflop Computer,” Proc. 15th ACM Int'l Conf. Supercomputing (ICS), 2001.
[3]
R.V. Boppana and S. Chalasani, “Fault-Tolerant Wormhole Routing Algorithms for Mesh Networks,” IEEE Trans. Computers, vol. 44, pp. 848-864, 1995.
[4]
Y.M. Boura and C.R. Das, “Fault-Tolerant Routing in Mesh Networks,” Proc. 1995 Int'l Conf. Parallel Processing, vol. I, pp. 106-109, 1995.
[5]
S. Chalasani and R.V. Boppana, “Communication in Multicomputers with Nonconvex Faults,” IEEE Trans. Computers, vol. 46, pp. 616-622, 1997.
[6]
C.-L. Chen and G.-M. Chiu, “A Fault-Tolerant Routing Scheme for Meshes with Nonconvex Faults,” IEEE Trans. Parallel and Distributed Systems, vol. 12, pp. 467-475, 2001.
[7]
A.A. Chien and J.H. Kim, “Planar-Adpative Routing: Low-Cost Adaptive Networks for Multiprocessors,” J. ACM, vol. 42, pp. 91-123, 1995.
[8]
T.H. Cormen C.E. Leiserson and R.L. Rivest, Introduction to Algorithms. 1990.
[9]
W.J. Dally and H. Aoki, “Deadlock-Free Adaptive Routing in Multicomputer Networks Using Virtual Channels,” IEEE Trans. Parallel and Distributed Systems, vol. 4, pp. 466-475, 1993.
[10]
W.J. Dally and C.L. Seitz, “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks,” IEEE Trans. Computers, vol. 36, pp. 547-553, 1987.
[11]
M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. New York: W.H. Freeman, 1979.
[12]
C.J. Glass and L.M. Ni, “Fault-Tolerant Wormhole Routing in Meshes,” Proc. 23rd Int'l Symp. Fault-Tolerant Computing, pp. 240-249, 1993.
[13]
D. Gusfield, “Design (with Analysis) of Efficient Algorithms,” Handbooks in Operations Research and Management Science, Vol. 3: Computing, chapter 8, 1992.
[14]
C.-T. Ho and L. Stockmeyer, “A New Approach to Fault-Tolerant Wormhole Routing for Mesh-Connected Parallel Computers,” Technical Report RJ 10265, IBM Almaden Research Center, San Jose, Calif., Nov. 2002.
[15]
D. Hochbaum, Approximation Algorithms for NP-Hard Problems. Boston: PWS Publishing, 1997.
[16]
Y.-J. Suh B.V. Dao J. Duato and S. Yalamanchili, “Software Based Fault-Tolerant Oblivious Routing in Pipelined Networks,” Proc. 1995 Int'l Conf. Parallel Processing, vol. I, pp. 101-105, 1995.
[17]
P.-H. Sui and S.-D. Wang, “An Improved Algorithm for Fault-Tolerant Wormhole Routing in Meshes,” IEEE Trans. Computers, vol. 46, pp. 1040-1042, 1997.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Computers
IEEE Transactions on Computers  Volume 53, Issue 4
April 2004
114 pages

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 April 2004

Author Tags

  1. Fault-tolerant routing
  2. mesh networks
  3. parallel computing
  4. performance evaluation
  5. wormhole routing.

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media