skip to main content
10.1145/3431379.3460650acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article
Open access

Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network

Published: 21 June 2021 Publication History

Abstract

High-radix interconnects such as Dragonfly and its variants rely on adaptive routing to balance network traffic for optimum performance. Ideally, adaptive routing attempts to forward packets between minimal and non-minimal paths with the least congestion. In practice, current adaptive routing algorithms estimate routing path congestion based on local information such as output queue occupancy. Using local information to estimate global path congestion is inevitably inaccurate because a router has no precise knowledge of link states a few hops away. This inaccuracy could lead to interconnect congestion. In this study, we present Q-adaptive routing, a multi-agent reinforcement learning routing scheme for Dragonfly systems. Q-adaptive routing enables routers to learn to route autonomously by leveraging advanced reinforcement learning technology. The proposed Q-adaptive routing is highly scalable thanks to its fully distributed nature without using any shared information between routers. Furthermore, a new two-level Q-table is designed for Q-adaptive to make it computational lightly and saves 50% of router memory usage compared with the previous Q-routing. We implement the proposed Q-adaptive routing in SST/Merlin simulator. Our evaluation results show that Q-adaptive routing achieves up to 10.5% system throughput improvement and 5.2x average packet latency reduction compared with adaptive routing algorithms. Remarkably, Q-adaptive can even outperform the optimal VALn non-minimal routing under the ADV+1 adversarial traffic pattern with up to 3% system throughput improvement and 75% average packet latency reduction.

References

[1]
Dennis Abts and Bob Felderman. 2012. A guided tour of data-center networking. Commun. ACM, Vol. 55, 6 (2012), 44--51.
[2]
Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).
[3]
Justin Boyan and Michael Littman. 1993. Packet routing in dynamically changing networks: A reinforcement learning approach. Advances in neural information processing systems, Vol. 6 (1993), 671--678.
[4]
Samuel PM Choi and Dit-Yan Yeung. 1996. Predictive Q-routing: A memory-based reinforcement learning approach to adaptive traffic control. In Advances in Neural Information Processing Systems. 945--951.
[5]
Sudheer Chunduri, Kevin Harms, Scott Parker, Vitali Morozov, Samuel Oshin, Naveen Cherukuri, and Kalyan Kumaran. 2017. Run-to-run variability on Xeon Phi based Cray XC systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--13.
[6]
Caroline Claus and Craig Boutilier. 1998. The dynamics of reinforcement learning in cooperative multiagent systems. AAAI/IAAI, Vol. 1998, 746--752 (1998), 2.
[7]
William J Dally et almbox. 1992. Virtual-channel flow control. IEEE Transactions on Parallel and Distributed systems, Vol. 3, 2 (1992), 194--205.
[8]
William J Dally and Charles L Seitz. 1988. Deadlock-free message routing in multiprocessor interconnection networks. (1988).
[9]
Daniele De Sensi, Salvatore Di Girolamo, and Torsten Hoefler. 2019. Mitigating network noise on Dragonfly networks through application-aware routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--32.
[10]
Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray cascade: a scalable HPC system based on a Dragonfly network. In SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--9.
[11]
Yuping Fan, Taylor Childers, Paul Rich, William Allcock, Michael Papka, and Zhiling Lan. 2021. Deep Reinforcement Agent for Scheduling in HPC. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).
[12]
Mario Flajslik, Eric Borch, and Mike A Parker. 2018. Megafly: A topology for exascale systems. In International Conference on High Performance Computing. Springer, 289--310.
[13]
Marina Garcia, Enrique Vallejo, Ramon Beivide, Miguel Odriozola, Cristobal Camarero, Mateo Valero, Jesús Labarta, Cyriel Minkenberg, et almbox. 2012. On-the-fly adaptive routing in high-radix hierarchical networks. In 2012 41st International Conference on Parallel Processing. IEEE, 279--288.
[14]
Nikhil Jain, Abhinav Bhatele, Xiang Ni, Nicholas J Wright, and Laxmikant V Kale. 2014. Maximizing throughput on a dragonfly network. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 336--347.
[15]
Nan Jiang, John Kim, and William J Dally. 2009. Indirect adaptive routing on large scale interconnection networks. In Proceedings of the 36th annual international symposium on Computer architecture. 220--231.
[16]
Yao Kang, Xin Wang, Neil McGlohon, Misbah Mubarak, Sudheer Chunduri, and Zhiling Lan. 2019. Modeling and Analysis of Application Interference on Dragonfly. In Proceedings of the 2019 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation. 161--172.
[17]
John Kim, Wiliam J Dally, Steve Scott, and Dennis Abts. 2008. Technology-driven, highly-scalable dragonfly topology. In 2008 International Symposium on Computer Architecture. IEEE, 77--88.
[18]
Georg Kresse and Jürgen Hafner. 1993. Ab initio molecular dynamics for liquid metals. Physical Review B, Vol. 47, 1 (1993), 558.
[19]
Martin Lauer and Martin Riedmiller. 2000. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer.
[20]
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
[21]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. 2019. Learning scheduling algorithms for data processing clusters. In Proceedings of the ACM Special Interest Group on Data Communication. 270--288.
[22]
Laëtitia Matignon, Guillaume J Laurent, and Nadine Le Fort-Piat. 2007. Hysteretic q-learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams. In 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 64--69.
[23]
Azalia Mirhoseini, Hieu Pham, Quoc V Le, Benoit Steiner, Rasmus Larsen, Yuefeng Zhou, Naveen Kumar, Mohammad Norouzi, Samy Bengio, and Jeff Dean. 2017. Device placement optimization with reinforcement learning. arXiv preprint arXiv:1706.04972 (2017).
[24]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et almbox. 2015. Human-level control through deep reinforcement learning. nature, Vol. 518, 7540 (2015), 529--533.
[25]
Misbah Mubarak, Neil McGlohon, Malek Musleh, Eric Borch, Robert B Ross, Ram Huggahalli, Sudheer Chunduri, Scott Parker, Christopher D Carothers, and Kalyan Kumaran. 2019. Evaluating quality of service traffic classes on the megafly network. In International Conference on High Performance Computing. Springer, 3--20.
[26]
Ann Nowe, Kris Steenhaut, Mohamed Fakir, and Katja Verbeeck. 1998. Q-learning for adaptive load based routing. In SMC'98 Conference Proceedings. 1998 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No. 98CH36218), Vol. 4. IEEE, 3965--3970.
[27]
Leonid Peshkin and Virginia Savova. 2002. Reinforcement learning for adaptive routing. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No. 02CH37290), Vol. 2. IEEE, 1825--1830.
[28]
James C Phillips, Gengbin Zheng, Sameer Kumar, and Laxmikant V Kalé. 2002. NAMD: Biomolecular simulation on thousands of processors. In SC'02: Proceedings of the 2002 ACM/IEEE Conference on Supercomputing. IEEE, 36--36.
[29]
LA Prashanth, HL Prasad, Shalabh Bhatnagar, and Prakash Chandra. 2016. A constrained optimization perspective on actor--critic algorithms and application to network routing. Systems & Control Letters, Vol. 92 (2016), 46--51.
[30]
Md Shafayat Rahman, Saptarshi Bhowmik, Yevgeniy Ryasnianskiy, Xin Yuan, and Michael Lang. 2019. Topology-custom UGAL routing on dragonfly. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--15.
[31]
Joao Reis, Miguel Rocha, Truong Khoa Phan, David Griffin, Franck Le, and Miguel Rio. 2019. Deep Neural Networks for Network Routing. In 2019 International Joint Conference on Neural Networks (IJCNN). IEEE, 1--8.
[32]
Arun F Rodrigues, K Scott Hemmert, Brian W Barrett, Chad Kersey, Ron Oldfield, Marlo Weston, Rolf Risen, Jeanine Cook, Paul Rosenfeld, Elliot Cooper-Balis, et almbox. 2011. The structural simulation toolkit. ACM SIGMETRICS Performance Evaluation Review, Vol. 38, 4 (2011), 37--42.
[33]
Steve Scott, Dennis Abts, John Kim, and William J Dally. 2006. The blackwidow high-radix clos network. ACM SIGARCH Computer Architecture News, Vol. 34, 2 (2006), 16--28.
[34]
Daniele Sensi, Salvatore Girolamo, Kim McMahon, Duncan Roweth, and Torsten Hoefler. 2020. An In-Depth Analysis of the Slingshot Interconnect. In 2020 SC20: International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Computer Society, 481--494.
[35]
Alexander Shpiner, Zachy Haramaty, Saar Eliad, Vladimir Zdornov, Barak Gafni, and Eitan Zahavi. 2017. Dragonfly: Low cost topology for scaling datacenters. In 2017 IEEE 3rd International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB). IEEE, 1--8.
[36]
Charles H Still, RL Berger, AB Langdon, DE Hinkel, LJ Suter, and EA Williams. 2000. Filamentation and forward brillouin scatter of entire smoothed and aberrated laser beams. Physics of Plasmas, Vol. 7, 5 (2000), 2023--2032.
[37]
Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
[38]
Allen Taflove and Susan C Hagness. 2005. Computational electrodynamics: the finite-difference time-domain method. Artech house.
[39]
Nigel Tao, Jonathan Baxter, and Lex Weaver. 2001. A multi-agent, policy-gradient approach to network routing. In In: Proc. of the 18th Int. Conf. on Machine Learning. Citeseer.
[40]
top500.org. 2020. Top500 list. https://www.top500.org/lists/top500/2020/11/
[41]
Didem Unat, Xing Cai, and Scott B Baden. 2011. Mint: realizing CUDA performance in 3D stencil methods with annotated C. In Proceedings of the international conference on Supercomputing. 214--224.
[42]
Asaf Valadarsky, Michael Schapira, Dafna Shahaf, and Aviv Tamar. 2017. Learning to route. In Proceedings of the 16th ACM workshop on hot topics in networks. 185--191.
[43]
Hanrui Wang, Jiacheng Yang, Hae-Seung Lee, and Song Han. 2018. Learning to design circuits. arXiv preprint arXiv:1812.02734 (2018).
[44]
Xin Wang, Misbah Mubarak, Yao Kang, Robert B Ross, and Zhiling Lan. 2020. Union: An Automatic Workload Manager for Accelerating Network Simulation. In 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 821--830.
[45]
Jeremiah J Wilke and Joseph P Kenny. 2020. Opportunities and limitations of Quality-of-Service in Message Passing applications on adaptively routed Dragonfly and Fat Tree networks. In 2020 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 109--118.
[46]
Jongmin Won, Gwangsun Kim, John Kim, Ted Jiang, Mike Parker, and Steve Scott. 2015. Overcoming far-end congestion in large-scale networks. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 415--427.
[47]
Xu Yang, John Jenkins, Misbah Mubarak, Robert B Ross, and Zhiling Lan. 2016. Watch out for the bully! job interference study on dragonfly network. In SC'16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 750--760.
[48]
Jieming Yin, Subhash Sethumurugan, Yasuko Eckert, Chintan Patel, Alan Smith, Eric Morton, Mark Oskin, Natalie Enright Jerger, and Gabriel H Loh. 2020. Experiences with ML-Driven Design: A NoC Case Study. In 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 637--648.
[49]
Xinyu You, Xuanjie Li, Yuedong Xu, Hui Feng, Jin Zhao, and Huaicheng Yan. 2020. Toward Packet Routing With Fully Distributed Multiagent Deep Reinforcement Learning. IEEE Transactions on Systems, Man, and Cybernetics: Systems (2020).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '21: Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing
June 2021
275 pages
ISBN:9781450382175
DOI:10.1145/3431379
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dragonfly
  2. hpc
  3. interconnect network
  4. multi-agent reinforcement learning
  5. routing

Qualifiers

  • Research-article

Funding Sources

Conference

HPDC '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)212
  • Downloads (Last 6 weeks)49
Reflects downloads up to 06 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media