skip to main content
10.1109/ISCA.2005.37acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article

Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks

Published: 01 May 2005 Publication History

Abstract

Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good worst-case and average-case throughput and (c) enable low-complexity (and hence, low latency) router implementation. In this paper, we focus on routing algorithms for an important class of interconnection networks: two dimensional (2D) mesh networks. Existing routing algorithms for mesh networks fail to satisfy one or more of design goals mentioned above. Variously, the routing algorithms suffer from poor worst case throughput (ROMM [13], DOR [23]), poor latency due to increased packet hops (VALIANT [31]) or increased latency due to hardware complexity (minimaladaptive [7, 30]). The major contribution of this paper is the design of an oblivious routing algorithm - O1TURN - with provable near-optimal worst-case throughput, good average-case throughput, low design complexity and minimal number of network hops for 2D-mesh networks, thus satisfying all the stated design goals. O1TURN offers optimal worst-case throughput when the network radix (k in a kxk network) is even. When the network radix is odd, O1TURN is within a 1/k2 factor of optimal worst-case throughput. O1TURN achieves superior or comparable average-case throughput with global traffic as well as local traffic. For example, O1TURN achieves 18.8%, 0.7% and 13.6% higher average-case throughput than DOR, ROMM and VALIANT routing, respectively when averaged over one million random traffic patterns on an 8x8 network. Finally, we demonstrate that O1TURN is well suited for a partitioned router implementation that is of similar delay complexity as a simple dimension-ordered router. Our implementation incurs a marginal increase in switch arbitration delay that is completely hidden in pipelined routers as it is not on the clock-critical path.

References

[1]
{1} A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung. The mit alewife machine: architecture and performance. In 25 years of the international symposia on Computer architecture (selected papers), pages 509-520. ACM Press, 1998.
[2]
{2} Y. Choi and T. M. Pinkston. Evaluation of crossbar architectures for deadlock recovery routers. J. Parallel Distrib. Comput., 61(1):49-78, 2001.
[3]
{3} D. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publishers, 1999.
[4]
{4} W. J. Dally. Virtual-Channel Flow Control. IEEE Transactions on Parallel and Distributed Systems, 3(2):194-205, March 1992.
[5]
{5} W. J. Dally and C. L. Seitz. The TORUS routing chip. Journal of Distributed Computing, 1(3):187-196, October 1986.
[6]
{6} W. J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In Proceedings of the Design Automation Conference, pages 684-689, Las Vegas, NV, June 2001.
[7]
{7} J. Duato. A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks. IEEE Transactions on Parallel and Distributed Systems, 4(12):1320-1331, December 1993.
[8]
{8} C. J. Glass and L. M. Ni. The Turn Model for Adaptive Routing. In Proceedings of the 19th Annual International Symposium on Computer Architecture, pages 278-287, May 1992.
[9]
{9} C. R. Jesshope, P. R. Miller, and J. T. Yantchev. High performance communications in processor networks. In ISCA '89: Proceedings of the 16th annual international symposium on Computer architecture, pages 150-157. ACM Press, 1989.
[10]
{10} R. K. Koeninger, M. Furtney, and M. Walker. A shared memory mpp from cray research. Digital Technical Journal, 6(2):8-21, 1994.
[11]
{11} A. K. V. and T. Pinkston. An Efficient, Fully Adaptive Deadlock Recovery Scheme : Disha. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 201-210, June 1995.
[12]
{12} R. Mullins, A. West, and S. Moore. Low-Latency Virtual Channel Routers for On-Chip Networks. In Proceedings of the 31st Annual International Symposium on Computer Architecture (to appear), Jun 2004.
[13]
{13} T. Nesson and S. L. Johnsson. Romm routing on mesh and torus networks. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, pages 275-287. ACM Press, 1995.
[14]
{14} P. R. Nuth and W. J. Dally. The j-machine network. In Proceedings of the 1991 IEEE International Conference on Computer Design on VLSI in Computer & Processors, pages 420-423. IEEE Computer Society, 1992.
[15]
{15} L. S. Peh and W. J. Dally. A Delay Model and Speculative Architecture For Pipelined Routers. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA-7), pages 255-266, January 2001.
[16]
{16} K. Sankaralingam, R. Nagarajan, H. Liu, C. Kim, J. Huh, D. Burger, S. W. Keckler, and C. R. Moore. Exploiting ilp, tlp, and dlp with the polymorphous trips architecture. In Proceedings of the 30th annual international symposium on Computer architecture, pages 422-433. ACM Press, 2003.
[17]
{17} S. L. Scott and G. Thorson. The cray t3e network: Adaptive routing in a high performance 3d torus. In HOT Interconnects IV, August 1996.
[18]
{18} D. Seo, A. Ali, W.-T. Lim, N. Rafique, and M. Thottethodi. Near-optimal worst-case throughput routing for two-dimensional mesh networks. Technical Report TR-ECE 05-03, Purdue University, 2005.
[19]
{19} L. Shang, L. S. Peh, and N. K. Jha. Dynamic voltage scaling with links for power optimization of interconnection networks. In Proceedings of the 9th IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 79-90, Feb 2003.
[20]
{20} A. Singh, W. J. Dally, A. K. Gupta, and B. Towles. Goal: a load-balanced adaptive routing algorithm for torus networks. In Proceedings of the 30th annual international symposium on Computer architecture, pages 194-205. ACM Press, 2003.
[21]
{21} A. Singh, W. J. Dally, A. K. Gupta, and B. Towles. Adaptive Channel Queue Routing on k-ary n-cubes. In Proceedings of the Sixteenth Symposium on Parallel Algorithms and Architectures, June 2004.
[22]
{22} A. Singh, W. J. Dally, B. Towles, and A. K. Gupta. Locality-preserving randomized oblivious routing on torus networks. In Proceedings of the fourteenth annual ACM symposium on Parallel algorithms and architectures, pages 9-13. ACM Press, 2002.
[23]
{23} H. Sullivan and T. R. Bashkow. A large scale, homogeneous, fully distributed parallel machine, i. In Proceedings of the 4th annual symposium on Computer architecture, pages 105-117. ACM Press, 1977.
[24]
{24} I. Sutherland, B. Sproull, and D. Harris. Logical Effort: Designing Fast CMOS Circuits. Morgan Kauffman Publishers, 1999.
[25]
{25} S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. Wavescalar. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, page 291. IEEE Computer Society, 2003.
[26]
{26} M. Taylor, W. Lee, S. Amarainghe, and A. Agarwal. Scalar Operand Networks: On-chip interconnect for ILP in Partitioned Architectures. In Proceedings of the International Symposium on High Performance Computer Architecture, pages 341-353, 2003.
[27]
{27} M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro, 22(2):25-35, 2002.
[28]
{28} B. Towles and W. J. Dally. Worst-case Traffic for Oblivious Routing Functions. Computer Architecture Letters, 1, February 2002.
[29]
{29} B. Towles, W. J. Dally, and S. Boyd. Throughput-centric routing algorithm design. In Proceedings of the fifteenth annual ACM symposium on Parallel algorithms and architectures, pages 200-209. ACM Press, 2003.
[30]
{30} J. Upadhyay, V. Varavithya, and P. Mohapatra. A Traffic Balanced Adaptive wormhole routing scheme for Two-Dimensional Meshes. IEEE Transactions on Computers, pages 190-197, May 1997.
[31]
{31} L. G. Valiant and G. J. Brebner. Universal schemes for parallel communication. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 263-277. ACM Press, 1981.
[32]
{32} H. Wang, L.-S. Peh, and S. Malik. Power-driven design of router microarchitectures in on-chip networks. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, page 105. IEEE Computer Society, 2003.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
June 2005
541 pages
ISBN:076952270X
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 33, Issue 2
    ISCA 2005
    May 2005
    531 pages
    ISSN:0163-5964
    DOI:10.1145/1080695
    Issue’s Table of Contents

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 01 May 2005

Check for updates

Qualifiers

  • Article

Conference

ISCA05
Sponsor:

Acceptance Rates

ISCA '05 Paper Acceptance Rate 45 of 194 submissions, 23%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media