skip to main content
article

Characterization of failures in an operational IP backbone network

Published: 01 August 2008 Publication History

Abstract

As the Internet evolves into a ubiquitous communication infrastructure and supports increasingly important services, its dependability in the presence of various failures becomes critical. In this paper, we analyze IS-IS routing updates fromthe Sprint IP backbone network to characterize failures that affect IP connectivity. Failures are first classified based on patterns observed at the IP-layer; in some cases, it is possible to further infer their probable causes, such as maintenance activities, router-related and optical layer problems. Key temporal and spatial characteristics of each class are analyzed and, when appropriate, parameterized using well-known distributions. Our results indicate that 20% of all failures happen during a period of scheduled maintenance activities. Of the unplanned failures, almost 30% are shared by multiple links and are most likely due to router-related and optical equipment-related problems, respectively, while 70% affect a single link at a time. Our classification of failures reveals the nature and extent of failures in the Sprint IP backbone. Furthermore, our characterization of the different classes provides a probabilistic failure model, which can be used to generate realistic failure scenarios, as input to various network design and traffic engineering problems.

References

[1]
C. Fraleigh, S. Moon, B. Lyles, C. Cotton, M. Khan, R. Rockell, D. Moll, T. Seely, and C. Diot, "Packet-level traffic measurements from the Sprint IP backbone," IEEE Network Mag., vol. 17, no. 6, pp. 6-16, Nov.-Dec. 2003.
[2]
K. Papagiannaki, S. Moon, C. Fraleigh, P. Thiran, and C. Diot, "Measurement and analysis of single-hop delay on an IP backbone network," IEEE J. Sel. Areas Commun., vol. 21, no. 6, pp. 908-921, Aug. 2003.
[3]
G. Iannaccone, C.-N. Chuah, R. Mortier, S. Bhattacharyya, and C. Diot, "Analysis of link failures in an IP backbone," in Proc. ACM Internet Measurement Workshop, Marseilles, France, Nov. 2002, pp. 237-242.
[4]
G. Iannaccone, C.-N. Chuah, S. Bhattacharyya, and C. Diot, "Feasibility of IP restoration in a tier-1 backbone," IEEE Netw., vol. 18, no. 2, pp. 13-19, Mar. 2004.
[5]
D. Oran, "OSIIS-IS Intra-Domain Routing Protocol," RFC 1142, 1990.
[6]
S. Iyer, S. Bhattacharyya, N. Taft, and C. Diot, "An approach to alleviate link overload as observed on an IP backbone," in Proc. IEEE INFOCOM, San Francisco, CA, Mar. 2003, vol. 1, pp. 406-416.
[7]
A. Markopoulou, G. Iannaccone, S. Bhatacharyya, C.-N. Chuah, and C. Diot, "Characterization of failures in an IP backbone," in Proc. IEEE INFOCOM, Hong Kong, Mar. 2004, vol. 4, pp. 2307-2317.
[8]
C. Fraleigh, F. Tobagi, and C. Diot, "Provisioning IP backbone networks to support latency sensitive traffic," in Proc. IEEE INFOCOM, San Francisco, CA, Mar.-Apr. 2003, vol. 1, pp. 375-385.
[9]
C. Boutremans, G. Iannaccone, and C. Diot, "Impact of link failures on VoIP performance," in Proc. ACM NOSSDAV, Miami Beach, FL, May 2002, pp. 63-71.
[10]
A. Fumagalli and L. Valcarenghi, "IP restoration versus WDM protection: Is there an optimal choice?," IEEE Network Magazine, vol. 14, no. 6, pp. 34-41, Nov. 2000.
[11]
L. Sahasrabuddhe, S. Ramamurthy, and B. Mukherjee, "Fault management in IP-over-WDM networks: WDM protection versus IP restoration," IEEE J. Sel. Areas Commun., vol. 20, no. 1, pp. 21-33, Jan. 2002.
[12]
A. Alaettinoglou and S. Casner, "Detailed analysis of ISIS Routing Protocol on the Qwest backbone," NANOG {Online}. Available: http:// www.nanog.org/mtg-0202/ppt/cengiz.pdf
[13]
A. Nucci, B. Schroeder, S. Bhattacharyya, N. Taft, and C. Diot, "IGP link weight assignment for transient link failures," in Proc. 18th Int. Teletraffic Congr., Berlin, Germany, Sep. 2003.
[14]
B. Fortz and M. Thorup, "Optimizing OSPF/IS-IS weights in a changing world," IEEE J Sel. Areas Commun., vol. 20, no. 4, pp. 756-767, Apr. 2002.
[15]
M. Durvy, C. Diot, N. Taft, and P. Thiran, "Network availability based service differentiation," in Proc. IWQoS, Monterey, CA, Jun. 2003.
[16]
S. Nelakuditi, S. Lee, Y. Yu, Z.-L. Zhang, and C.-N. Chuah, "Fast local rerouting for handling transient link failures," IEEE/ACM Trans. Netw., vol. 15, no. 2, pp. 359-372, Apr. 2007.
[17]
V. Paxson, "End-to-end routing behavior in the Internet," IEEE/ACM Trans. Netw., vol. 5, no. 5, pp. 601-615, Oct. 1997.
[18]
Y. Zhang, V. Paxson, and S. Shenker, "The stationarity of Internet path properties: Routing, loss and throughput," Tech. Rep. ICIR, 2000 {On-line}. Available: http://www.icir.org/
[19]
M. Dahlin, B. Chandra, L. Gao, and A. Nayate, "End-to-end WAN service availability," IEEE/ACM Trans. Netw., vol. 11, no. 2, pp. 300-313, Apr. 2003.
[20]
C. Labovitz, A. Ahuja, and F. Jahanian, "Experimental study of Internet stability and wide-area network failures," in Proc. FTCS, Jun. 1999.
[21]
D. Watson, F. Jahanian, and C. Labovitz, "Experiences with monitoring OSPF on a regional service provider network," in Proc. IEEE ICDCS, May 2003.
[22]
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb, "A case study of OSPF behavior in a large enterprise network," in Proc. ACM IMW, Marseille, France, Nov. 2002, pp. 217-230.
[23]
R. R. Kompella, J. Yates, and A. Greenberg, "IP fault localization via risk modeling," in Proc. ACM/USENIX NSDI, Apr. 2005.
[24]
S. Kandula, D. Katabi, and J.-P. Vasseur, "Shrink: A tool for failure diagnosis in IP networks," in ACM SIGCOMM Workshop on Mining Network Data (MineNet-05), Philadelphia, PA, Aug. 2005, pp. 173-178.
[25]
M. Steinder and A. Sethi, "Increasing robustness of fault localization through analysis of lost, spurious and positive symptoms," in Proc. IEEE INFOCOM, New York, NY, Jun. 2002, vol. 1, pp. 322-331.
[26]
Y. Ganjali, S. Bhattacharyya, and C. Diot, "Limiting the impact of failures on network performance," Sprint ATL Tech. Res. Rep. RR04- ATL-020666, 2003.
[27]
P. Tobias and D. Trindade, Applied Reliability, 2nd ed. London, U.K.: Chapman Hall/CRC, 1995.
[28]
L. Adamic, "Zipf, power-laws and Pareto: A ranking tutorial," Xerox Palo Alto Research Center, Palo Alto, CA {Online}. Available: http:// ginger.hpl.hp.com/shl/papers/ranking/ranking.html
[29]
G. Siganos, M. Faloutsos, P. Faloutsos, and C. Faloutsos, "Power-laws and the AS-level Internet topology," IEEE/ACM Trans. Netw., vol. 11, no. 4, pp. 514-524, Aug. 2003.
[30]
A. Feldmann, A. Gilbert, P. Huang, and W. Willinger, "Dynamics of IP traffic; A study of the role of variability and the impact of control," in Proc. ACM SIGCOMM, Cambridge, MA, Sep. 1999, pp. 301-303.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Networking
IEEE/ACM Transactions on Networking  Volume 16, Issue 4
August 2008
249 pages

Publisher

IEEE Press

Publication History

Published: 01 August 2008
Revised: 13 September 2006
Received: 15 July 2004
Published in TON Volume 16, Issue 4

Author Tags

  1. failure analysis
  2. intermediate system to intermediate system (IS-IS) protocol
  3. link failures
  4. modeling
  5. routing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media