skip to main content
10.1109/CCGrid.2012.62acmconferencesArticle/Chapter ViewAbstractPublication PagesccgridConference Proceedingsconference-collections
Article

Topology Agnostic Dynamic Quick Reconfiguration for Large-Scale Interconnection Networks

Published: 13 May 2012 Publication History

Abstract

Toleration of faults in the interconnection networks is of vital importance in to days huge computer installations. Still, the existing solutions are short of being satisfactory. They require that the system defaults into a routing algorithm that is inferior to the original, either in terms of performance, or in terms of the need for virtual channels, or both. Furthermore, since support for dynamic reconfiguration is not supported in current hardware, existing methods require the system to be halted while reconfiguration takes place in order to avoid deadlocks. In this paper we present a method that efficiently generates a new routing function in the presence of faults. The new routing function only reroutes the traffic that is affected by the fault, so that the performance of the original routing function is preserved to the extent possible. No specific functionality in the switches is required, we only require exactly the same number of virtual channels in the presence of faults as the original routing algorithm did. Finally, the new routing function is compatible with the old one, so that deadlock free dynamic transition between the old and the new routing function is immediately available. This means that our solution can easily be implemented on current InfiniBand platforms, e.g. through the OFED software stack. We demonstrate that the method is workable for meshes, tori and fat-trees, and that it is able to guarantee one-fault tolerance for all of these topologies.

References

[1]
J. Duato, O. Lysne, R. Pang, and T. M. Pinkston, "Part I: A Theory for Deadlock-Free Dynamic Network Reconfiguration." IEEE Transactions on Parallel Distributed Systems, vol. 16, pp. 412-427, 2005.
[2]
T. Hoefler, T. Schneider, and A. Lumsdaine, "Optimized Routing for Large-Scale InfiniBand Networks," in High Performance Interconnects, 2009. HOTI 2009. 17th IEEE Symposium on, no. Lmc. IEEE, Aug. 2009, pp. 103-111. {Online}.
[3]
F. O. Sem-Jacobsen, T. Skeie, O. Lysne, and J. Duato, "Dynamic Fault Tolerance in Multistage Interconnection Networks," journal, 2009.
[4]
O. Lysne, T. Skeie, and T. Waadeland, "One-fault tolerance arid beyond in wormhole routed meshes 1," Microprocessors and Microsystems, vol. 21, no. 7-8, pp. 471-480, 1998. {Online}.
[5]
Chien and J. H. Kim, "Planar-adaptive routing: Low-cost adaptive networks for multiprocessors," 19th Ann..
[6]
S. Chalasani and R. V. Boppana, "Fault-tolerant wormhole routing in tori," in ICS '94: Proceedings of the 8th international conference on Supercomputing. New York, NY, USA: ACM Press, 1994, pp. 146-155.
[7]
S. Chalasani and R. Boppana, "Communication in multicomputers with nonconvex faults," Computers, IEEE Transactions on, vol. 46, no. 5, pp. 616-622, 1997. {Online}.
[8]
C.-T. Ho and L. Stockmeyer, "A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers," IEEE Transactions on Computers, vol. 53, no. 4, pp. 427-438, Apr. 2004. {Online}.
[9]
J. Duato, "A Theory of Fault-Tolerant Routing in Wormhole Networks," in Proceedings: 1994 International Conference on Parallel and Distributed Systems. IEEE Computer Society Press, 1994, pp. 600-607.
[10]
C. Glass and L. Ni, "The turn model for adaptive routing," in Proceedings of the 19th annual international symposium on Computer architecture, vol. pages. ACM, 1992, pp. 278-287. {Online}.
[11]
N. A. Nordbotten and T. Skeie, "A Routing Methodology for Dynamic Fault Tolerance in Meshes and Tori," in International Conference on High Performance Computing (HiPC), ser. LNCS 4873, R. B. V. K. P. Srinivas Aluru Manish Parashar, Ed. Springer-Verlag, 2007, pp. 514-527.
[12]
M. E. Gómez, N. A. Nordbotten, J. Flich, P. López, A. Robles, J. Duato, T. Skeie, and O. Lysne, "A Routing Methodology for Achieving Fault Tolerance in Direct Networks," IEEE Transactions on Computers, vol. 55, pp. 400-415, 2006.
[13]
T. M. Pinkston, R. Pang, and J. Duato, "Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability," IEEE Transactions on Parallel and Distributed Systems, vol. 14, pp. 780-794, 2003.
[14]
R. Casado, a. Bermudez, J. Duato, F. Quiles, and J. Sanchez, "A protocol for deadlock-free dynamic reconfiguration in high-speed local area networks," IEEE Transactions on Parallel and Distributed Systems, vol. 12, no. 2, pp. 115-132, 2001. {Online}.
[15]
O. Lysne and J. Duato, "Fast dynamic reconfiguration in irregular networks," icpp, 2000. {Online}.
[16]
O. Lysne, T. M. Pinkston, and J. Duato, "Part II: A Methodology for Developing Deadlock-Free Dynamic Network Reconfiguration Processes." IEEE Transactions on Parallel Distributed Systems, vol. 16, pp. 428-443, 2005.
[17]
O. Lysne, J. Montañana, T. Pinkston, T, and J. Duato, "Simple deadlock-free dynamic network reconfiguration," Computing-HiPC 2004, pp. 504-515, 2005. {Online}.
[18]
A. s. G. n. Solheim, O. Lysne, and T. Skeie, "RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration," in Euro-Par 2009, 2009.
[19]
W. Dally and B. Towles, Principles and Practices of Interconnection Networks. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2003.
[20]
O. Lysne, T. Skeie, S.-A. Reinemo, and I. r. T. Theiss, "Layered Routing in Irregular Networks," IEEE Transactions on Parallel and Distributed Systems, vol. 17, pp. 51-65, 2006.
[21]
F. O. Sem-Jacobsen and O. Lysne, "Topology Agnostic Dynamic Quick Reconfiguration for Large-Scale Interconnection Networks," Simula Research Laboratory, Research note, 2011.

Cited By

View all
  1. Topology Agnostic Dynamic Quick Reconfiguration for Large-Scale Interconnection Networks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CCGRID '12: Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012)
      May 2012
      936 pages
      ISBN:9780769546919

      Sponsors

      Publisher

      IEEE Computer Society

      United States

      Publication History

      Published: 13 May 2012

      Check for updates

      Author Tags

      1. HPC
      2. fault tolerance
      3. reconfiguration

      Qualifiers

      • Article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 05 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media