skip to main content
10.1145/3368089.3417055acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Towards intelligent incident management: why we need it and how we make it

Published: 08 November 2020 Publication History

Abstract

The management of cloud service incidents (unplanned interruptions or outages of a service/product) greatly affects customer satisfaction and business revenue. After years of efforts, cloud enterprises are able to solve most incidents automatically and timely. However, in practice, we still observe critical service incidents that occurred in an unexpected manner and orchestrated diagnosis workflow failed to mitigate them. In order to accelerate the understanding of unprecedented incidents and provide actionable recommendations, modern incident management system employs the strategy of AIOps (Artificial Intelligence for IT Operations). In this paper, to provide a broad view of industrial incident management and understand the modern incident management system, we conduct a comprehensive empirical study spanning over two years of incident management practices at Microsoft. Particularly, we identify two critical challenges (namely, incomplete service/resource dependencies and imprecise resource health assessment) and investigate the underlying reasons from the perspective of cloud system design and operations. We also present IcM BRAIN, our AIOps framework towards intelligent incident management, and show its practical benefits conveyed to the cloud services of Microsoft.

Supplementary Material

Auxiliary Archive (fse20ind-p61-p-archive.zip)
Auxiliary Teaser Video (fse20ind-p61-p-teaser.mp4)
This is a presentation video of my talk at ESEC/FSE 2020 on our paper accepted in the industry track. In this paper, we first report the current states of incident management. Then, we summarize the key challenges of incident management. We also provide our understanding about the fundamental reasons of these challenges. Finally, we present our AIOps framework towards more reliable cloud systems, called BRAIN. Experimental results demonstrate the effectiveness of BRAIN.
Auxiliary Presentation Video (fse20ind-p61-p-video.mp4)
This is a presentation video of my talk at ESEC/FSE 2020 on our paper accepted in the industry track. In this paper, we first report the current states of incident management. Then, we summarize the key challenges of incident management. We also provide our understanding about the fundamental reasons of these challenges. Finally, we present our AIOps framework towards more reliable cloud systems, called BRAIN. Experimental results demonstrate the effectiveness of BRAIN.

References

[1]
Victor Bahl, Paul Barham, Richard Black, Ranveer Chandra, Moises Goldszmidt, Rebecca Isaacs, Srikanth Kandula, Lun Li, John MacCormick, Dave Maltz, et al. 2006. Discovering dependencies for network management. ( 2006 ).
[2]
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. 2004. Using Magpie for request extraction and workload modelling. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 18-18.
[3]
Sujoy Basu, Fabio Casati, and Florian Daniel. 2008. Toward web service dependency discovery for SOA management. In Proceedings of the 2008 IEEE International Conference on Services Computing (SCC). IEEE, 422-429.
[4]
Aaron Brown, Gautam Kar, and Alexander Keller. 2001. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In Proceedings of the 2001 IEEE/IFIP International Symposium on Integrated Network Management. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No. 01EX470 ). IEEE, 377-390.
[5]
Junjie Chen, Xiaoting He, Qingwei Lin, Yong Xu, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. An empirical investigation of incident triage for online service systems. In Proceedings of the 41st International Conference on Software Engineering : Software Engineering in Practice (ICSE-SEIP). IEEE Press, 111-120.
[6]
Junjie Chen, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2019. Continuous incident triage for large-scale online service systems. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 364-375.
[7]
Xu Chen, Ming Zhang, Zhuoqing Morley Mao, and Paramvir Bahl. 2008. Automating Network Application Dependency Discovery: Experiences, Limitations, and New Solutions. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI). 117-130.
[8]
Yujun Chen, Xian Yang, Qingwei Lin, Hongyu Zhang, Feng Gao, Zhangwei Xu, Yingnong Dang, Dongmei Zhang, Hang Dong, Yong Xu, et al. 2019. Outage prediction and diagnosis for cloud service systems. In Proceedings of the 2019 International Conference on World Wide Web (WWW). 2659-2665.
[9]
Yen-Yang Michael Chen, Anthony J Accardi, Emre Kiciman, David A Patterson, Armando Fox, and Eric A Brewer. 2004. Path-based failure and evolution management. University of California, Berkeley.
[10]
Zhuangbin Chen, Yu Kang, Feng Gao, Li Yang, Jefrey Sun, Zhangwei Xu, Pu Zhao, Bo Qiao, Liqun Li, Xu Zhang, et al. 2020. AIOps Innovations of Incident Management for Cloud Services. ( 2020 ).
[11]
Marcello Cinque, Domenico Cotroneo, Rafaele Della Corte, and Antonio Pecchia. 2014. What logs should you look at when an application fails? insights from an industrial case study. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 690-695.
[12]
Yingnong Dang, Qingwei Lin, and Peng Huang. 2019. AIOps: real-world challenges and research innovations. In Proceedings of the 41st International Conference on Software Engineering Companion ( ICSE-C). IEEE Press, 4-5.
[13]
Rui Ding, Qiang Fu, Jian Guang Lou, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Mining historical issue repositories to heal large-scale online service systems. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 311-322.
[14]
Valentina Fedorova, Alex Gammerman, Ilia Nouretdinov, and Vladimir Vovk. 2012. Plug-in martingales for testing exchangeability on-line. In Proceedings of the 29th International Conference on Machine Learning (ICML). 923-930.
[15]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In Proceedings of the 9th IEEE International Conference on Data Mining. IEEE, 149-158.
[16]
Haryadi S Gunawi, Mingzhe Hao, Riza O Suminto, Agung Laksono, Anang D Satria, Jefry Adityatama, and Kurnia J Eliazar. 2016. Why does the cloud stop computing?: Lessons from hundreds of service outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC). ACM, 1-16.
[17]
Pinjia He, Zhuangbin Chen, Shilin He, and Michael R Lyu. 2018. Characterizing the natural language descriptions in software logging statements. In Proceedings of the 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). ACM, 178-189.
[18]
Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R Lyu, and Dongmei Zhang. 2018. Identifying impactful service system problems via log analysis. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 60-70.
[19]
Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience report: System log analysis for anomaly detection. In Proceedings of the 27th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 207-218.
[20]
Shen-Shyang Ho and Harry Wechsler. 2010. A martingale framework for detecting changes in data streams by testing exchangeability. IEEE transactions on pattern analysis and machine intelligence 32, 12 ( 2010 ), 2113-2127.
[21]
Hao Hu, Hongyu Zhang, Jifeng Xuan, and Weigang Sun. 2014. Efective bug triage based on historical bug-fix information. In Proceedings of the 25th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 122-132.
[22]
Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray failure: The achilles' heel of cloud-scale systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. ACM, 150-155.
[23]
Moogsoft Inc. 2019. Everything You Need to Know About AIOps. https://www. moogsoft.com/resources/aiops/guide/everything-aiops/.
[24]
Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL). 562-570.
[25]
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1746-1751.
[26]
Sun-Ro Lee, Min-Jae Heo, Chan-Gun Lee, Milhan Kim, and Gaeul Jeong. 2017. Applying deep learning based automatic bug triager to industrial projects. In Proceedings of the 11th Joint Meeting on Foundations of Software Engineering. 926-931.
[27]
Qingwei Lin, Ken Hsieh, Yingnong Dang, Hongyu Zhang, Kaixin Sui, Yong Xu, Jian-Guang Lou, Chenggang Li, Youjiang Wu, Randolph Yao, et al. 2018. Predicting Node failure in cloud service systems. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 480-490.
[28]
Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, and Dongmei Zhang. 2016. iDice: problem identification for emerging issues. In Proceedings of the 38th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 214-224.
[29]
Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion ( ICSE-C). IEEE, 102-111.
[30]
Jian-Guang Lou, Qiang Fu, Yi Wang, and Jiang Li. 2010. Mining dependency in distributed systems through unstructured logs analysis. ACM SIGOPS Operating Systems Review 44, 1 ( 2010 ), 91-96.
[31]
Jian-Guang Lou, Qingwei Lin, Rui Ding, Qiang Fu, Dongmei Zhang, and Tao Xie. 2013. Software analytics for incident management of online services: An experience report. In Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Press, 475-485.
[32]
Victor Ion Munteanu, Andrew Edmonds, Thomas M Bohnert, and Teodor-Florin Fortis. 2014. Cloud incident management, challenges, research directions, and architectural approach. In Proceedings of the 7th International Conference on Utility and Cloud Computing (UCC). IEEE Computer Society, 786-791.
[33]
Chong Wang, Xin Peng, Mingwei Liu, Zhenchang Xing, Xuefang Bai, Bing Xie, and Tuo Wang. 2019. A learning-based approach for automatic construction of domain glossary from source code and documentation. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 97-108.
[34]
Yong Xu, Kaixin Sui, Randolph Yao, Hongyu Zhang, Qingwei Lin, Yingnong Dang, Peng Li, Keceng Jiang, Wenchi Zhang, Jian-Guang Lou, et al. 2018. Improving service availability of cloud systems by predicting disk error. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC). 481-494.
[35]
Jianwei Yin, Xinkui Zhao, Yan Tang, Chen Zhi, Zuoning Chen, and Zhaohui Wu. 2016. Cloudscout: A non-intrusive approach to service dependency discovery. IEEE Transactions on Parallel and Distributed Systems 28, 5 ( 2016 ), 1271-1284.
[36]
Hongyu Zhang, Liang Gong, and Steve Versteeg. 2013. Predicting bug-fixing time: an empirical study of commercial software projects. In Proceedings of the 35th International Conference on Software Engineering : Software Engineering in Practice (ICSE-SEIP). IEEE Press, 1042-1051.
[37]
Xu Zhang, Qingwei Lin, Yong Xu, Si Qin, Hongyu Zhang, Bo Qiao, Yingnong Dang, Xinsheng Yang, Qian Cheng, Murali Chintalapati, et al. 2019. Cross-dataset time series anomaly detection for cloud systems. In Proceedings of the 2019 USENIX Annual Technical Conference (USENIX ATC). 1063-1076.
[38]
Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, et al. 2019. Robust log-based anomaly detection on unstable log data. In Proceedings of the 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). 807-817.
[39]
Wujie Zheng, Haochuan Lu, Yangfan Zhou, Jianming Liang, Haibing Zheng, and Yuetang Deng. 2019. iFeedback: Exploiting User Feedback for Real-Time Issue Detection in Large-Scale Online Service Systems. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 352-363.
[40]
Hucheng Zhou, Jian-Guang Lou, Hongyu Zhang, Haibo Lin, Haoxiang Lin, and Tingting Qin. 2015. An empirical study on quality issues of production big data platform. In Proceedings of the 37th International Conference on Software Engineering (ICSE). IEEE Press, 17-26.

Cited By

View all
  • (2024)A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and VerificationApplied Sciences10.3390/app1403104714:3(1047)Online publication date: 26-Jan-2024
  • (2024)Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663846(266-277)Online publication date: 10-Jul-2024
  • (2024)FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud SystemsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639754(392-404)Online publication date: 14-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2020: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2020
1703 pages
ISBN:9781450370431
DOI:10.1145/3368089
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AIOps
  2. Cloud Computing
  3. Incident Management

Qualifiers

  • Research-article

Conference

ESEC/FSE '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)236
  • Downloads (Last 6 weeks)21
Reflects downloads up to 15 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and VerificationApplied Sciences10.3390/app1403104714:3(1047)Online publication date: 26-Jan-2024
  • (2024)Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663846(266-277)Online publication date: 10-Jul-2024
  • (2024)FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud SystemsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639754(392-404)Online publication date: 14-Apr-2024
  • (2024)Intelligent Monitoring Framework for Cloud Services: A Data-Driven ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639753(381-391)Online publication date: 14-Apr-2024
  • (2024)Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639745(369-380)Online publication date: 14-Apr-2024
  • (2024)Xpert: Empowering Incident Management with Query Recommendations via Large Language ModelsProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639081(1-13)Online publication date: 20-May-2024
  • (2024)Understanding and Improving Change Risk Detection in Practice2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)10.1109/SANER60148.2024.00079(717-727)Online publication date: 12-Mar-2024
  • (2024)Tracemesh: Scalable and Streaming Sampling for Distributed Traces2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00016(54-65)Online publication date: 7-Jul-2024
  • (2024)A New Approach to Road Incident Detection Leveraging Live Traffic Data: An Empirical InvestigationProcedia Computer Science10.1016/j.procs.2024.04.217235(2288-2296)Online publication date: 2024
  • (2024)GRAND: GAN-based software runtime anomaly detection method using trace informationNeural Networks10.1016/j.neunet.2023.10.036169(365-377)Online publication date: Jan-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media