skip to main content
10.1145/1921168.1921175acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

G-RCA: a generic root cause analysis platform for service quality management in large IP networks

Published: 30 November 2010 Publication History

Abstract

As IP networks have become the mainstay of an increasingly diverse set of applications ranging from Internet games and streaming videos, to e-commerce and online-banking, and even to mission-critical 911, best effort service is no longer acceptable. This requires a transformation in network management from detecting and replacing individual faulty network elements to managing the service quality as a whole.
In this paper we describe the design and development of a Generic Root Cause Analysis platform (G-RCA) for service quality management (SQM) in large IP networks. G-RCA contains a comprehensive service dependency model that includes network topological and cross-layer relationships, protocol interactions, and control plane dependencies. G-RCA abstracts the RCA process into signature identification for symptom and diagnostic events, temporal and spatial event correlation, and reasoning and inference logic. G-RCA provides a flexible rule specification language that allows operators to quickly customize G-RCA into different RCA tools as new problems need to be investigated. G-RCA is also integrated with the data trending, manual data exploration, and statistical correlation mining capabilities. G-RCA has proven to be a highly effective SQM platform in several different applications and we present results regarding BGP flaps, PIM flaps in Multicast VPN service, and end-to-end throughput drop in CDN service.

References

[1]
A border gateway protocol 4 (bgp-4). http://www.ietf.org/rfc/rfc4271.txt.
[2]
Emc ionix platform. http://www.emc.com/products/family/ionix-family.htm.
[3]
Hp operations center. https://h10078.www1.hp.com/cda/hpms/display/main/hpms_content.jsp?zn=bto&cp=1-11-15-28_4000_100__.
[4]
Ibm tivoli. https://www-01.ibm.com/software/tivoli/.
[5]
Keynote systems, inc. website. http://www.keynote.com/.
[6]
Overview of Multilink PPP Bundle. http://www.juniper.net/techpubs/software/erx/junose81/swconfig-link/html/mlppp-config2.html.
[7]
SONET Automatic Protection Switching. http://www.cisco.com/en/US/tech/tk482/tk606/tsd_technology_support_sub-protocol_home.html.
[8]
P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM '07: Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications, pages 13--24, 2007.
[9]
M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 595--604, 2002.
[10]
P. Corn, R. Dube, A. McMichael, and J. Tsay. An autonomous distributed expert system for switched network maintenance. In Proceedings of IEEE GLOBECOM88, pages 1530--1537, 1988.
[11]
I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 647--658, 2004.
[12]
C. Joseph, J. Kindrick, K. Muralidhar, and T. Toth-Fejel. MAP fault management expert system. Integrated Network Management I, North-Holland, Amsterdam, pages 627--636, 1989.
[13]
C. Kalmanek, I. Ge, S. Lee, C. Lund, D. Pei, J. Seidel, J. van der Merwe, and J. Ates. Darkstar: Using exploratory data mining to raise the bar on network reliability and performance. In Design of Reliable Communication Networks, 2009. DRCN 2009. 7th International Workshop on, pages 1--10. IEEE, 2009.
[14]
S. Kandula, D. Katabi, and J. Vasseur. Shrink: A tool for failure diagnosis in IP networks. In Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 173--178, 2005.
[15]
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM '09: Proceedings of the 2009 conference on Applications, technologies, architectures, and protocols for computer communications, pages 243--254, 2009.
[16]
R. Kompella, J. Yates, A. Greenberg, and A. Snoeren. Detection and localization of network black holes. In IEEE INFOCOM 2007. 26th IEEE International Conference on Computer Communications, pages 2180--2188, 2007.
[17]
R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren. Ip fault localization via risk modeling. In NSDI'05: Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, pages 57--70, 2005.
[18]
F. Le, S. Lee, T. Wong, H. Kim, D. Newcomb, F. Le, S. Lee, T. Wong, H. Kim, and D. Newcomb. Minerals: Using Data Mining to Detect Router. In ACM Sigcomm Workshop on Mining Network Data (MineNet), 2006.
[19]
A. Mahimkar, J. Yates, Y. Zhang, A. Shaikh, J. Wang, Z. Ge, and C. Ee. Troubleshooting chronic conditions in large IP networks. In Proceedings of the 2008 ACM CoNEXT Conference, 2008.
[20]
J. Moy. RFC2328: OSPF Version 2. 1998.
[21]
S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series. In Proceedings of the 31st international conference on Very large data bases, pages 697--708, 2005.
[22]
M. Pathan, R. Buyya, and A. Vakali. Content Delivery Networks: State of the Art, Insights, and Imperatives. Content Delivery Networks, page 1, 2008.
[23]
I. Rish, M. Brodie, and S. Ma. Efficient fault diagnosis using probing. In AAAI Spring Symposium on Information Refinement and Revision for Decision Making, 2002.
[24]
A. Shaikh and A. Greenberg. OSPF monitoring: Architecture, design, and deployment experience. In Proc. USENIX/ACM NSDI, 2004.
[25]
J. Treinen and R. Thurimella. A framework for the application of association rule mining in large intrusion detection infrastructures. Lecture Notes in Computer Science, 4219:1, 2006.
[26]
J. Wright, J. Zielinski, and E. Horton. Expert systems development: the ACE system. Expert Systems Applications to Telecommunications, pages 45--72, 1988.
[27]
T. Yamahira, Y. Kiriha, and S. Sakata. Unified fault management scheme for network troubleshooting expert system. Integrated Network Management, I. North-Holland: Elsevier Science Publishers BV, 1989.
[28]
K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 499--508, 2005.
[29]
H. Yan. U-RCA: A Unified Root Cause Analysis Platform for Service Quality Management in Large IP Networks. Technical Report 10-103, Colorado State Univeristy, 2010.

Cited By

View all

Index Terms

  1. G-RCA: a generic root cause analysis platform for service quality management in large IP networks

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        Co-NEXT '10: Proceedings of the 6th International COnference
        November 2010
        349 pages
        ISBN:9781450304481
        DOI:10.1145/1921168
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 30 November 2010

        Permissions

        Request permissions for this article.

        Check for updates

        Qualifiers

        • Research-article

        Conference

        Co-NEXT '10
        Sponsor:
        Co-NEXT '10: Conference on emerging Networking EXperiments and Technologies
        November 30 - December 3, 2010
        Pennsylvania, Philadelphia

        Acceptance Rates

        Overall Acceptance Rate 198 of 789 submissions, 25%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)25
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 28 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media