skip to main content
10.1145/3555050.3569137acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

Drift-bottle: a lightweight and distributed approach to failure localization in general networks

Published: 30 November 2022 Publication History

Abstract

Network failure severely impairs network performance, affecting latency and throughput of data transmission. Existing failure localization solutions for general networks face problems such as difficulty in acquiring data from end hosts, need for extra infrastructure, and excessive resource consumption. Meanwhile, solutions designed for data center networks are hard to apply in general networks, as they usually rely on the topology regularity of DCNs. In this paper, we propose Drift-Bottle, a lightweight and distributed approach to failure localization in general networks. In Drift-Bottle, each switch judges the status of flows and makes a local inference for suspicious links. We design a distributed localization scheme where each normal packet is used as a "drift-bottle" that carries a "letter", i.e., a lightweight inference header, while traversing the network. Each switch along the path updates the inference header by aggregating it with its local inference. Whenever the inference is evident enough to identify the culprit links of failures, a warning is sent to the operator immediately. Drift-Bottle implements its function mainly on the data plane of programmable switches and thus reduce the overhead brought to switches significantly. Evaluation based on simulation on different topologies demonstrates that Drift-Bottle provides fast, precise and lightweight failure localization to operators of general networks.

References

[1]
V. Arrigoni, N. Bartolini, A. Massini, and F. Trombetti. 2021. Failure Localization through Progressive Network Tomography. In IEEE INFOCOM.
[2]
B. Arzani, S. Ciraci, L. Chamon, Y. Zhu, H. Liu, J. Padhye, B. T. Loo, and G. Outhred. 2018. 007: Democratically Finding the Cause of Packet Drops. In USENIX NSDI.
[3]
B. Arzani, S. Ciraci, B. T. Loo, A. Schuster, and G. Outhred. 2016. Taking the Blame Game out of Data Centers Operations with NetPoirot. In ACM SIGCOMM.
[4]
P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese, and D. Walker. Jul. 2014. P4: Programming Protocol-independent Packet Processors. ACM SIGCOMM Computer Communication Review (SIGCOMM-CCR) 44, 3 (Jul. 2014), 87--95.
[5]
Y. Chen, D. Bindel, and R. H. Katz. 2004. Tomography-based Overlay Network Monitoring. In ACM IMC.
[6]
Y. Chen, D. Bindel, H. Song, and R. H. Katz. 2004. An Algebraic Approach to Practical and Scalable Overlay Network Monitoring. In ACM SIGCOMM.
[7]
J. Dean. 2009. Designs, Lessons and Advice from Building Large Distributed Systems. In LADIS keynote.
[8]
M. Ghasemi, T. Benson, and J. Rexford. 2017. Dapper: Data Plane Performance Diagnosis of TCP. In ACM SOSR.
[9]
C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z. Lin, and V. Kurien. 2015. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In ACM SIGCOMM.
[10]
N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and Nick McKeown. 2014. I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks. In USENIX NSDI.
[11]
T. Holterbach, E. C. Molero, M. Apostolaki, A. Dainotti, S. Vissicchio, and L. Vanbever. 2018. Blink: Fast Connectivity Recovery Entirely in the Data Plane. In USENIX NSDI.
[12]
Intel. 2020. Tofino. https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/tofino-series/tofino.html.
[13]
P. G. Kannan, N. Budhdev, R. Joshi, and M. C. Chan. 2021. Debugging Transient Faults in Data Centers using Synchronized Network-wide Packet Histories. In USENIX NSDI.
[14]
S. Knight, H. Nguyen, N. Falkner, R. Bowden, and M. Roughan. Oct. 2011. The Internet Topology Zoo. IEEE Journal on Selected Areas in Communications (JSAC) 29, 9 (Oct. 2011), 1765--1775.
[15]
R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren. 2007. Detection and Localization of Network Black Holes. In IEEE INFOCOM.
[16]
B. Lantz, B. Heller, and N. McKeown. 2010. A Network in a Laptop: Rapid Prototyping for Software-Defined Networks. In ACM HotNets.
[17]
N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Apr. 2008. OpenFlow: Enabling Innovation in Campus Networks. ACM SIGCOMM Computer Communication Review (SIGCOMM-CCR) 38, 2 (Apr. 2008), 69--74.
[18]
H. X. Nguyen and P. Thiran. 2007. The Boolean Solution to the Congested IP Link Location Problem: Theory and Practice. In IEEE INFOCOM.
[19]
A. Roy, S. Diego, H. Zeng, J. Bagga, and A. C. Snoeren. 2017. Passive Realtime Datacenter Fault Detection and Localization. In USENIX NSDI.
[20]
J.-H. Lee K. Singh. 2020. SwitchTree: In-network Computing and Traffic Analyses with Random Forests. Neural Computing and Applications (2020).
[21]
N. Spring, R. Mahajan, D. Wetherall, and T. Anderson. 2004. Measuring ISP Topologies with Rocketfuel. IEEE/ACM Transactions on Networking (ToN) 12, 1 (2004), 2--16.
[22]
M. Steinder and A. S. Sethi. Oct. 2004. Probabilistic Fault Localization in Communication Systems Using Belief Networks. IEEE/ACM Transactions on Networking (ToN) 12, 5 (Oct. 2004), 809--822.
[23]
C. Tan, Z. Jin, C. Guo, T. Zhang, H. Wu, K. Deng, D. Bi, and D. Xiang. 2019. Net-Bouncer: Active Device and Link Failure Localization in Data Center Networks. In USENIX NSDI.
[24]
D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. 2010. California Fault Lines: Understanding the Causes and Impact of Network Failures. In ACM SIGCOMM.
[25]
Z. Xiong and N. Zilberman. 2019. Do Switches Dream of Machine Learning? Toward In-Network Classification. In ACM HotNets.
[26]
L. Ye, Q. Li, X. Zuo, J. Xiao, Y. Jiang, Z. Qi, and C. Zhu. 2021. PUFF: A Passive and Universal Learning-based Framework for Intra-domain Failure Detection. In IEEE IPCCC.
[27]
S. Zhang, Y. Liu, W. Meng, Z. Luo, J. Bu, S. Yang, P. Liang, D. Pei, J. Xu, Y. Zhang, Y. Chen, H. Dong, X. Qu, and L. Song. 2018. PreFix: Switch Failure Prediction in Datacenter Networks. In ACM SIGMETRICS.
[28]
X. Zhang, C. Lan, and A. Perrig. 2012. Secure and Scalable Fault Localization under Dynamic Traffic Patterns. In IEEE S&P.
[29]
Y. Zhao, Y. Chen, and D. Bindel. Dec. 2009. Towards Unbiased End-to-End Network Diagnosis. IEEE/ACM Transactions on Networking (ToN) 17, 6 (Dec. 2009), 1724--1737.
[30]
Y. Zhou, C. Sun, H. Liu, R. Miao, S. Bai, B. Li, Z. Zheng, L. Zhu, Z. Shen, Y. Xi, P. Zhang, D. Cai, M. Zhang, and M. Xu. 2020. Flow Event Telemetry on Programmable Data Plane. In ACM SIGCOMM.
[31]
Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng. 2015. Packet-Level Telemetry in Large Datacenter Networks. In ACM SIGCOMM.
[32]
M. Zukerman, T. D. Neame, and R. G. Addie. 2003. Internet traffic modeling and future technology implications. In IEEE INFOCOM.

Cited By

View all

Index Terms

  1. Drift-bottle: a lightweight and distributed approach to failure localization in general networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CoNEXT '22: Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies
    November 2022
    431 pages
    ISBN:9781450395083
    DOI:10.1145/3555050
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 November 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. failure localization
    2. general network
    3. in-network intelligence
    4. programmable switch

    Qualifiers

    • Research-article

    Conference

    CoNEXT '22
    Sponsor:

    Acceptance Rates

    CoNEXT '22 Paper Acceptance Rate 28 of 151 submissions, 19%;
    Overall Acceptance Rate 198 of 789 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)79
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 28 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media