skip to main content
10.1145/2999572.2999609acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article
Public Access

LossRadar: Fast Detection of Lost Packets in Data Center Networks

Published: 06 December 2016 Publication History

Abstract

Packet losses are common in data center networks, may be caused by a variety of reasons (e.g., congestion, blackhole), and have significant impacts on application performance and network operations. Thus, it is important to provide fast detection of packet losses independent of their root causes. We also need to capture both the locations and packet header information of the lost packets to help diagnose and mitigate these losses. Unfortunately, existing monitoring tools that are generic in capturing all types of network events often fall short in capturing losses fast with enough details and low overhead. Due to the importance of loss in data centers, we propose a specific monitoring system designed for loss detection. We propose LossRadar, a system that can capture individual lost packets and their detailed information in the entire network on a fine time scale. Our extensive evaluation on prototypes and simulations demonstrates that LossRadar is easy to implement in hardware switches, achieves low memory and bandwidth overhead, while providing detailed information about individual lost packets. We also build a loss analysis tool that demonstrates the usefulness of LossRadar with a few example applications.

References

[1]
https://aphyr.com/posts/288-the-network-is-reliable.
[2]
http://www.eweek.com/cloud/amazon-us-east-cloud-goes-down.html.
[3]
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. "CONGA: Distributed Congestion-aware Load Balancing for Datacenters". In: SIGCOMM. 2014.
[4]
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. "Data Center TCP (DCTCP)". In: SIGCOMM. 2010.
[5]
Weifeng Chen, Yong Huang, Bruno F. Ribeiro, Kyoungwon Suh, Honggang Zhang, Edmundo de Souza e Silva, Jim Kurose, and Don Towsley. "Exploiting the IPID Field to Infer Network Path and End-system Characteristics". In: PAM. 2005.
[6]
Peng Cheng, Fengyuan Ren, Ran Shu, and Chuang Lin. "Catch the Whole Lot in an Action: Rapid Precise Packet Loss Notification in Data Centers". In: NSDI. 2014.
[7]
Jeff Dean. "Designs, Lessons and Advice from Building Large Distributed Systems". In: LADIS keynote. 2009.
[8]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. "Dynamo: Amazon's highly available key-value store". In: SIGOPS. 2007.
[9]
DeterLab. https://www.isi.deterlab.net/.
[10]
D. Eppstein, M. Goodrich, F. Uyeda, and G. Varghese. "What's the Difference? Efficient Set Difference without Prior Context". In: SIGCOMM. 2011.
[11]
Michael T. Goodrich and Michael Mitzenmacher. "Invertible Bloom Lookup Tables". In: arXiv:1101.2245v2. 2011.
[12]
Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien. "Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis". In: SIGCOMM. 2015.
[13]
Nikhil Handigol, Brandon Heller, Vimalkumar Jeyakumar, David Mazières, and Nick McKeown. "I Know What Your Packet Did Last Hop: Using Packet Histories to Troubleshoot Networks". In: NSDI. 2014.
[14]
Xin Jin, Hongqiang Harry Liu, Rohan Gandhi, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Jennifer Rexford, and Roger Wattenhofer. "Dynamic Scheduling of Network Updates". In: SIGCOMM. 2014.
[15]
Peyman Kazemian, George Varghese, and Nick McKeown. "Header Space Analysis: Static Checking for Networks". In: NSDI. 2012.
[16]
Peyman Kazemian, Michael Chang, Hongyi Zeng, George Varghese, Nick McKeown, and Scott Whyte. "Real Time Network Policy Checking Using Header Space Analysis". In: NSDI. 2013.
[17]
Ahmed Khurshid, Xuan Zou, Wenxuan Zhou, Matthew Caesar, and P. Brighten Godfrey. "VeriFlow: Verifying Network-Wide Invariants in Real Time". In: NSDI. 2013.
[18]
R. Kompella, K. Levchenko, A. Snoeren, and G. Varghese. "Every Microsecond Counts: Tracking Fine-Grain Latencies with a Loss Difference Aggregator". In: SIGCOMM. 2009.
[19]
Maciej Kuzniar, Peter Perešíni, and Dejan Kostić. "What You Need to Know About SDN Flow Tables". In: PAM. 2015.
[20]
Ki Suh Lee, Han Wang, Vishal Shrivastav, and Hakim Weatherspoon. "Globally Synchronized Time via Datacenter Networks". In: SIGCOMM. 2016.
[21]
Yuliang Li, Rui Miao, Changhoon Kim, and Minlan Yu. "FlowRadar: A Better NetFlow for Data Centers". In: NSDI. 2016.
[22]
Yi Lu, Andrea Montanari, Balaji Prabhakar, Sarang Dharmapurikar, and Abdul Kabbani. "Counter Braids: A Novel Counter Architecture for Per-Flow Measurement". In: SIGMETRICS. 2010.
[23]
Matthew Mathis, Jeffrey Semke, Jamshid Mahdavi, and Teunis Ott. "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm". In: SIGCOMM Comput. Commun. Rev. 1997.
[24]
Masoud Moshref, Minlan Yu, Abhishek Sharma, and Ramesh Govindan. "Scalable Rule Management for Data Centers". In: NSDI. 2013.
[25]
Masoud Moshref, Minlan Yu, Ramesh Govindan, and Amin Vahdat. "Trumpet: Timely and Precise Triggers in Data Centers". In: SIGCOMM. 2016.
[26]
NetFlow. http://www.cisco.com/go/netflow/.
[27]
NTP FAQ. http://www.ntp.org/ntpfaq/NTP-s-algo.htm#Q-ACCURATE-CLOCK.
[28]
Open vSwitch. http://openvswitch.org/.
[29]
P4 behavioral model. https://github.com/p4lang/behavioral-model/blob/master/targets/simple_switch/simple_switch.cpp.
[30]
Packet loss impact on TCP throughput in ESNet. http://fasterdata.es.net/network-tuning/tcp-issues-explained/packet-loss/.
[31]
Pawan Prakash, Advait Dixit, Y. Charlie Hu, and Ramana Kompella. "The TCP Outcast Problem: Exposing Unfairness in Data Center Networks". In: NSDI. 2012.
[32]
Precision time protocol. http://www.ieee1588.com/.
[33]
Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. "Inside the Social Network's (Datacenter) Network". In: SIGCOMM. 2015.
[34]
D. E Taylor and J. S Turner. "ClassBench: A Packet Classification Benchmark". In: Transactions on Networking 15.3 (2007).
[35]
W. Vogels. Performance and Scalability. http://www.allthingsdistributed.com/2006/04/performance_and_scalability.html. 2009.
[36]
Xin Wu, Daniel Turner, Chao-Chih Chen, David A. Maltz, Xiaowei Yang, Lihua Yuan, and Ming Zhang. "NetPilot: Automating Datacenter Network Failure Mitigation". In: SIGCOMM. 2012.
[37]
Minlan Yu, Albert Greenberg, Dave Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. "Profiling Network Performance for Multi-tier Data Center Applications". In: NSDI. 2011.
[38]
Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg, Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan, Ming Zhang, Ben Y. Zhao, and Haitao Zheng. "Packet-Level Telemetry in Large Datacenter Networks". In: SIGCOMM. 2015.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CoNEXT '16: Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies
December 2016
524 pages
ISBN:9781450342926
DOI:10.1145/2999572
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 December 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data center networks
  2. network failures
  3. packet losses
  4. programmable switches

Qualifiers

  • Research-article

Funding Sources

Conference

CoNEXT '16
Sponsor:

Acceptance Rates

CoNEXT '16 Paper Acceptance Rate 30 of 160 submissions, 19%;
Overall Acceptance Rate 198 of 789 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)270
  • Downloads (Last 6 weeks)49
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media