skip to main content
10.1145/3663529.3663834acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating Systems

Published: 10 July 2024 Publication History

Abstract

Timely localization of the root causes of gray failure is essential for maintaining the stability of the server OS. The previous intrusive gray failure localization methods usually require modifying the source code of applications, limiting their practical deployment. In this paper, we propose GrayScope, a method for non-intrusively localizing the root causes of gray failures based on the metric data in the server OS. Its core idea is to combine expert knowledge with causal learning techniques to capture more reliable inter-metric causal relationships. It then incorporates metric correlations and anomaly degrees, aiding in identifying potential root causes of gray failures. Additionally, it infers the gray failure propagation paths between metrics, providing interpretability and enhancing operators’ efficiency in mitigating gray failures. We evaluate GrayScope’s performance based on 1241 injected gray failure cases and 135 ones from industrial experiments in Huawei. GrayScope achieves the AC@5 of 90% and interpretability accuracy of 81%, significantly outperforming popular root cause localization methods. Additionally, we have made the code publicly available to facilitate further research.

References

[1]
Monica Bianchini, Marco Gori, and Franco Scarselli. 2005. Inside PageRank. ACM Transactions on Internet Technology (TOIT), 5, 1 (2005), 92–128. https://doi.org/10.1145/1052934.1052938
[2]
Mattia Carletti, Chiara Masiero, Alessandro Beghi, and Gian Antonio Susto. 2019. Explainable Machine Learning in Industry 4.0: Evaluating Feature Importance in Anomaly Detection to Enable Root Cause Analysis. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC). IEEE, Los Alamitos, CA. 21–26. https://doi.org/10.1109/SMC.2019.8913901
[3]
Chaosblade. 2024. Open Source Repository of Chaosblade. https://github.com/chaosblade-io/chaosblade
[4]
Pengfei Chen, Yong Qi, Pengfei Zheng, and Di Hou. 2014. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, Los Alamitos, CA. 1887–1895. https://doi.org/10.1109/INFOCOM.2014.6848128
[5]
Shanshan Chen, Xiaoxin Tang, Hongwei Wang, Han Zhao, and Minyi Guo. 2016. Towards Scalable and Reliable In-Memory Storage System: A Case Study with Redis. In 2016 IEEE Trustcom/BigDataSE/ISPA. IEEE, Los Alamitos, CA. 1660–1667. https://doi.org/10.1109/TrustCom.2016.0255
[6]
Yuhua Chen, Dongqi Xu, Ningjiang Chen, and Xu Wu. 2023. FRL-MFPG: Propagation-aware fault root cause location for microservice intelligent operation and maintenance. Information and Software Technology, 153 (2023), 107083. https://doi.org/10.1016/j.infsof.2022.107083
[7]
Gala-anteater. 2024. Open Source Repository of Gala-anteater. https://gitee.com/openeuler/gala-anteater
[8]
Gala-gopher. 2024. Open Source Repository of Gala-gopher. https://gitee.com/openeuler/gala-gopher
[9]
Janos Gertler. 2002. Fault Detection and Diagnosis in Engineering Systems. Control Engineering Practice, 9, 10 (2002), 1037–1038. https://doi.org/10.1201/9780203756126
[10]
Clive WJ Granger. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica: journal of the Econometric Society, 37, 3 (1969), 424–438. issn:00129682, 14680262 https://doi.org/10.2307/1912791
[11]
GrayScope. 2024. Open Source Repository of GrayScope. https://gitee.com/milohaha/grayscope
[12]
Shiqi Hao, Yang Liu, Yu Wang, Yuan Wang, and Wenming Zhe. 2022. Three-Stage Root Cause Analysis for Logistics Time Efficiency via Explainable Machine Learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA. 2987–2996. https://doi.org/10.1145/3534678.3539024
[13]
Zijun Hu, Pengfei Chen, Guangba Yu, Zilong He, and Xiaoyun Li. 2022. TS-InvarNet: Anomaly Detection and Localization based on Tempo-spatial KPI Invariants in Distributed Services. In 2022 IEEE International Conference on Web Services (ICWS). IEEE, Los Alamitos, CA. 109–119. https://doi.org/10.1109/ICWS55610.2022.00031
[14]
Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and Enhancing In Situ System Observability for Failure Detection. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA. 1–16.
[15]
Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. 2017. Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. ACM, New York, NY, USA. 150–155. https://doi.org/10.1145/3102980.3103005
[16]
Ltd. Huawei Technologies Co. 2022. Introduction to Huawei Cloud Database GaussDB. In Database Principles and Technologies–Based on Huawei GaussDB. Springer, Berlin. 287–312. https://doi.org/10.1007/978-981-19-3032-4_8
[17]
Azam Ikram, Sarthak Chakraborty, Subrata Mitra, Shiv Saini, Saurabh Bagchi, and Murat Kocaoglu. 2022. Root Cause Analysis of Failures in Microservices through Causal Discovery. Advances in Neural Information Processing Systems, 35 (2022), 31158–31170.
[18]
M Frans Kaashoek, Dawson R Engler, Gregory R Ganger, and Deborah A Wallach. 1996. Server operating systems. In Proceedings of the 7th workshop on ACM SIGOPS European workshop: Systems support for worldwide applications. ACM, New York, NY, USA. 141–148. https://doi.org/10.1145/504450.504478
[19]
Mingjie Li, Zeyan Li, Kanglin Yin, Xiaohui Nie, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2022. Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA. 3230–3240. https://doi.org/10.1145/3534678.3539041
[20]
Zeyan Li, Nengwen Zhao, Mingjie Li, Xianglin Lu, Lixin Wang, Dongdong Chang, Xiaohui Nie, Li Cao, Wenchi Zhang, and Kaixin Sui. 2022. Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY. 996–1008. https://doi.org/10.5281/zenodo.6955909
[21]
Zhong Li, Yuxuan Zhu, and Matthijs Van Leeuwen. 2023. A Survey on Explainable Anomaly Detection. ACM Transactions on Knowledge Discovery from Data, 18, 1 (2023), 1–54. https://doi.org/10.1145/3609333
[22]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, Los Alamitos, CA. 338–347. https://doi.org/10.1109/ICSE-SEIP52600.2021.00043
[23]
Ping Liu, Yu Chen, Xiaohui Nie, Jing Zhu, Shenglin Zhang, Kaixin Sui, Ming Zhang, and Dan Pei. 2019. FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). IEEE, Los Alamitos, CA. 35–46. https://doi.org/10.1109/ISSRE.2019.00014
[24]
Chang Lou, Peng Huang, and Scott Smith. 2020. Understanding, Detecting and Localizing Partial Failures in Large System Software. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, Carlsbad, CA. 559–574.
[25]
Xianglin Lu, Zhe Xie, Zeyan Li, Mingjie Li, Xiaohui Nie, Nengwen Zhao, Qingyang Yu, Shenglin Zhang, Kaixin Sui, and Lin Zhu. 2022. Generic and Robust Performance Diagnosis via Causal Inference for OLTP Database Systems. In 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, Los Alamitos, CA. 655–664. https://doi.org/10.1109/CCGrid54584.2022.00075
[26]
Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2019. MS-Rank: Multi-Metric and Self-Adaptive Root Cause Diagnosis for Microservice Applications. In 2019 IEEE International Conference on Web Services (ICWS). IEEE, Los Alamitos, CA. 60–67. https://doi.org/10.1109/ICWS.2019.00022
[27]
Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. 2021. ServiceRank: Root Cause Identification of Anomaly in Large-Scale Microservice Architectures. IEEE Transactions on Dependable and Secure Computing, 19, 5 (2021), 3087–3100. https://doi.org/10.1109/TDSC.2021.3083671
[28]
Meng Ma, Jingmin Xu, Yuan Wang, Pengfei Chen, Zonghua Zhang, and Ping Wang. 2020. AutoMAP: Diagnose Your Microservice-based Web Applications Automatically. In Proceedings of The Web Conference 2020. ACM, New York, NY, USA. 246–258. https://doi.org/10.1145/3366423.3380111
[29]
Luciano Manelli, Giulio Zambon, Luciano Manelli, and Giulio Zambon. 2020. Introducing JSP and Tomcat. Beginning Jakarta EE Web Development: Using JSP, JSF, MySQL, and Apache Tomcat for Building Java Web Applications, 1–53.
[30]
Yuan Meng, Shenglin Zhang, Yongqian Sun, Ruru Zhang, Zhilong Hu, Yiyin Zhang, Chenyang Jia, Zhaogang Wang, and Dan Pei. 2020. Localizing Failure Root Causes in a Microservice through Causality Inference. In 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS). IEEE, Los Alamitos, CA. 1–10. https://doi.org/10.1109/IWQoS49365.2020.9213058
[31]
Meike Nauta, Doina Bucur, and Christin Seifert. 2019. Causal Discovery with Attention-Based Convolutional Neural Networks. Machine Learning and Knowledge Extraction, 1, 1 (2019), 19. https://doi.org/10.3390/make1010019
[32]
Yicheng Pan, Meng Ma, Xinrui Jiang, and Ping Wang. 2021. Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis. ACM, New York, NY. 646–657. https://doi.org/10.1145/3460319.3464805
[33]
Judea Pearl. 2000. Models, reasoning and inference. Cambridge, UK: CambridgeUniversityPress, 19, 2 (2000), 3.
[34]
Karl Pearson. 1905. The Problem of the Random Walk. Nature, 72, 1865 (1905), 294–294. https://doi.org/10.1038/072342a0
[35]
Prometheus. 2024. Open Source Repository of Prometheus. https://github.com/prometheus/prometheus
[36]
Juan Qiu, Qingfeng Du, Kanglin Yin, Shuang-Li Zhang, and Chongshu Qian. 2020. A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications. Applied Sciences, 10, 6 (2020), 2166. https://doi.org/10.3390/app10062166
[37]
Huasong Shan, Yuan Chen, Haifeng Liu, Yunpeng Zhang, Xiao Xiao, Xiaofeng He, Min Li, and Wei Ding. 2019. ∊ -diagnosis: Unsupervised and Real-time Diagnosis of Small- window Long-tail Latency in Large-scale Microservice Platforms. In The World Wide Web Conference. ACM, New York, NY, USA. 3215–3222. https://doi.org/10.1145/3308558.3313653
[38]
Jacopo Soldani and Antonio Brogi. 2022. Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey. ACM Computing Surveys (CSUR), 55, 3 (2022), 1–39. https://doi.org/10.1145/3501297
[39]
Peter Spirtes and Clark Glymour. 1991. An Algorithm for Fast Recovery of Sparse Causal Graphs. Social science computer review, 9, 1 (1991), 62–72. https://doi.org/10.1177/089443939100900106
[40]
Alex Tank, Ian Covert, Nicholas Foti, Ali Shojaie, and Emily B Fox. 2021. Neural Granger Causality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 8 (2021), 4267–4279. https://doi.org/10.1109/TPAMI.2021.3065601
[41]
Dongjie Wang, Zhengzhang Chen, Yanjie Fu, Yanchi Liu, and Haifeng Chen. 2023. Incremental Causal Graph Learning for Online Root Cause Analysis. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA. 2269–2278. https://doi.org/10.1145/3580305.3599392
[42]
Dongjie Wang, Zhengzhang Chen, Jingchao Ni, Liang Tong, Zheng Wang, Yanjie Fu, and Haifeng Chen. 2023. Interdependent Causal Networks for Root Cause Localization. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA. 5051–5060. https://doi.org/10.1145/3580305.3599849
[43]
Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. 2021. Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, Los Alamitos, CA. 419–429. https://doi.org/10.1109/ASE51524.2021.9678708
[44]
Lu Wang, Chaoyun Zhang, Ruomeng Ding, Yong Xu, Qihang Chen, Wentao Zou, Qingjun Chen, Meng Zhang, Xuedong Gao, and Hao Fan. 2023. Root Cause Analysis for Microservice Systems via Hierarchical Reinforcement Learning from Human Feedback. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, USA. 5116–5125. https://doi.org/10.1145/3580305.3599934
[45]
Gary White, Jaroslaw Diuwe, Erika Fonseca, and Owen O’Brien. 2021. MMRCA: MultiModal Root Cause Analysis. In International Conference on Service-Oriented Computing. Springer-Verlag, Berlin. 177–189. https://doi.org/10.1007/978-3-031-14135-5_14
[46]
Canhua Wu, Nengwen Zhao, Lixin Wang, Xiaoqin Yang, Shining Li, Ming Zhang, Xing Jin, Xidao Wen, Xiaohui Nie, and Wenchi Zhang. 2021. Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems. In 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE). IEEE, Los Alamitos, CA. 91–102. https://doi.org/10.1109/ISSRE52982.2021.00022
[47]
Han Wu, Zhihao Shang, and Katinka Wolter. 2020. Learning to Reliably Deliver Streaming Data with Apache Kafka. In 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, Los Alamitos, CA. 564–571. https://doi.org/10.1109/DSN48063.2020.00068
[48]
Li Wu, Johan Tordsson, Jasmin Bogatinovski, Erik Elmroth, and Odej Kao. 2021. MicroDiag: Fine-grained Performance Diagnosis for Microservice Systems. In 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence). IEEE, Los Alamitos, CA. 31–36. https://doi.org/10.1109/CloudIntelligence52565.2021.00015
[49]
Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root Cause Localization of Performance Issues in Microservices. In NOMS 2020-2020 IEEE/IFIP Network Operations and Management Symposium. IEEE, Los Alamitos, CA. 1–9. https://doi.org/10.1109/NOMS47738.2020.9110353
[50]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In Proceedings of the Web Conference 2021. ACM, New York, NY, USA. 3087–3098. https://doi.org/10.1145/3442381.3449905
[51]
Nengwen Zhao, Junjie Chen, Xiao Peng, Honglin Wang, Xinya Wu, Yuanzong Zhang, Zikai Chen, Xiangzhong Zheng, Xiaohui Nie, and Gang Wang. 2020. Understanding and handling alert storm for online service systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice. ACM, New York, NY, USA. 162–171. https://doi.org/10.1145/3377813.3381363
[52]
Minghui Zhou, Xinwei Hu, and Wei Xiong. 2022. openEuler: Advancing a Hardware and Software Application Ecosystem. IEEE Software, 39, 2 (2022), 101–105. https://doi.org/10.1109/MS.2021.3132138

Index Terms

  1. Illuminating the Gray Zone: Non-intrusive Gray Failure Localization in Server Operating Systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    FSE 2024: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering
    July 2024
    715 pages
    ISBN:9798400706585
    DOI:10.1145/3663529
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 July 2024

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Causal Discovery
    2. Gray Failure
    3. Root Cause Localization
    4. Server Operating System

    Qualifiers

    • Research-article

    Conference

    FSE '24
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 112 of 543 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 82
      Total Downloads
    • Downloads (Last 12 months)82
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media