skip to main content
10.1109/ICSE-SEIP.2019.00020acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

An empirical investigation of incident triage for online service systems

Published: 27 May 2019 Publication History

Abstract

Online service systems have become increasingly popular. During operation of an online service system, incidents (unplanned interruptions or outages of the service) are inevitable. As an initial step of incident management, it is important to be able to automatically assign an incident report to a suitable team. We call this step incident triage, which can significantly affect the efficiency and accuracy of overall incident management. To better understand the incident-triage practice in industry, we perform an empirical study of incident triage on 20 large-scale online service systems in Microsoft. We find that incorrect assignment of incident reports occurs frequently and incurs unnecessary cost, especially for the incidents with high severity. For example, about 4.11% to 91.58% of incident reports are reassigned at least once and the average increment in incident-triage time caused by the reassignments is up to 10.16X. Considering the similarity between bug triage (automatically assigning bug reports to software developers) and incident triage, we then explore the applicability of typical bug-triage techniques to incident triage for online service systems. The results demonstrate that these bug-triage techniques are able to correctly assign incident reports to a certain extent, but still need to be further improved, especially for the incident reports that are assigned incorrectly at the first time. We further discuss possible ways to improve the accuracy of incident triage based on the empirical study. To our best knowledge, we are the first to investigate incident triage in industrial practice. Our results are useful for both practitioners and researchers to develop methods and tools to improve the current incident-triage practice for online service systems.

References

[1]
J.-G. Lou, Q. Lin, R. Ding, Q. Fu, D. Zhang, and T. Xie, "Software analytics for incident management of online services: An experience report," in Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, 2013, pp. 475--485.
[2]
D. Cubranic and G. C. Murphy, "Automatic bug triage using text categorization," in Proceedings of the Sixteenth International Conference on Software Engineering & Knowledge Engineering, 2004, pp. 92--97.
[3]
G. Jeong, S. Kim, and T. Zimmermann, "Improving bug triage with bug tossing graphs," in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, 2009, pp. 111--120.
[4]
J. Park, M. Lee, J. Kim, S. Hwang, and S. Kim, "Costriage: A cost-aware triage algorithm for bug reporting systems," in Proceedings of the National Conference on Artificial Intelligence, 2011, p. 139.
[5]
J. Xuan, H. Jiang, Y. Hu, Z. Ren, W. Zou, Z. Luo, and X. Wu, "Towards effective bug triage with software data reduction techniques," IEEE transactions on knowledge and data engineering, vol. 27, no. 1, pp. 264--280, 2015.
[6]
H. Hu, H. Zhang, J. Xuan, and W. Sun, "Effective bug triage based on historical bug-fix information," in IEEE 25th International Symposium on Software Reliability Engineering, 2014, pp. 122--132.
[7]
S. Wang, W. Zhang, and Q. Wang, "Fixercache: Unsupervised caching active developers for diverse bug triage," in Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2014, p. 25.
[8]
R. Shokripour, J. Anvik, Z. M. Kasirun, and S. Zamani, "Why so complicated? simple term filtering and weighting for location-based bug report assignment recommendation," in 10th IEEE Working Conference on Mining Software Repositories, 2013, pp. 2--11.
[9]
J. Anvik, L. Hiew, and G. C. Murphy, "Who should fix this bug?" in 28th International Conference on Software Engineering, 2006, pp. 361--370.
[10]
L. Jonsson, M. Borg, D. Broman, K. Sandahl, S. Eldh, and P. Runeson, "Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts," Empirical Software Engineering, vol. 21, no. 4, pp. 1533--1578, 2016.
[11]
S.-R. Lee, M.-J. Heo, C.-G. Lee, M. Kim, and G. Jeong, "Applying deep learning based automatic bug triager to industrial projects," in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp. 926--931.
[12]
H. Naguib, N. Narayan, B. Brügge, and D. Helal, "Bug report assignee recommendation using activity profiles," in Proceedings of the 10th Working Conference on Mining Software Repositories, 2013, pp. 22--30.
[13]
A. Tamrawi, T. T. Nguyen, J. M. Al-Kofahi, and T. N. Nguyen, "Fuzzy set and cache-based approach for bug triaging," in Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, 2011, pp. 365--375.
[14]
Z. Lin, F. Shu, Y. Yang, C. Hu, and Q. Wang, "An empirical study on bug assignment automation using chinese bug data," in Proceedings of the 3rd international symposium on Empirical software engineering and measurement, 2009, pp. 451--455.
[15]
G. Bortis and A. v. d. Hoek, "Porchlight: A tag-based approach to bug triaging," in Proceedings of the 2013 International Conference on Software Engineering, 2013, pp. 342--351.
[16]
J. Anvik and G. C. Murphy, "Reducing the effort of bug report triage: Recommenders for development-oriented decisions," ACM Transactions on Software Engineering and Methodology, vol. 20, no. 3, p. 10, 2011.
[17]
S. R. Gunn et al., "Support vector machines for classification and regression," ISIS technical report, vol. 14, no. 1, pp. 5--16, 1998.
[18]
D. H. Wolpert, "Stacked generalization," Neural networks, vol. 5, no. 2, pp. 241--259, 1992.
[19]
L. Breiman, "Random forests," Machine Learning, vol. 45, no. 1, pp. 5--32, 2001.
[20]
R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers, 1993.
[21]
D. W. Aha, D. Kibler, and M. K. Albert, "Instance-based learning algorithms," Machine learning, vol. 6, no. 1, pp. 37--66, 1991.
[22]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, "Distributed representations of words and phrases and their compositionality," in Advances in neural information processing systems, 2013, pp. 3111--3119.
[23]
C. dos Santos and M. Gatti, "Deep convolutional neural networks for sentiment analysis of short texts," in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 69--78.
[24]
M. Steyvers and T. Griffiths, "Probabilistic topic models," Handbook of latent semantic analysis, vol. 427, no. 7, pp. 424--440, 2007.
[25]
E. Alpaydin, Introduction to machine learning. MIT press, 2009.
[26]
G. Klir and B. Yuan, Fuzzy sets and fuzzy logic. Prentice hall New Jersey, 1995, vol. 4.
[27]
X. Xia, D. Lo, Y. Ding, J. M. Al-Kofahi, T. N. Nguyen, and X. Wang, "Improving automated bug triaging with specialized topic model," IEEE Transactions on Software Engineering, vol. 43, no. 3, pp. 272--297, 2017.
[28]
M. Alenezi, K. Magel, and S. Banitaan, "Efficient bug triaging using text mining," Journal of Software, vol. 8, no. 9, pp. 2185--2190, 2013.
[29]
P. Bhattacharya and I. Neamtiu, "Fine-grained incremental learning and multi-feature tossing graphs to improve bug triaging," in IEEE International Conference on Software Maintenance, 2010, pp. 1--10.
[30]
A. S. Badashian, A. Hindle, and E. Stroulia, "Crowdsourced bug triaging: Leveraging q & a platforms for bug assignment," in International Conference on Fundamental Approaches to Software Engineering, 2016, pp. 231--248.
[31]
O. Baysal, R. Holmes, and M. W. Godfrey, "Revisiting bug triage and resolution practices," in Proceedings of the First International Workshop on User Evaluation for Software Engineering Researchers, 2012, pp. 29--30.
[32]
A. Goyal and N. Sardana, "Machine learning or information retrieval techniques for bug triaging: Which is better?" e-Informatica Software Engineering Journal, vol. 11, no. 1, 2017.
[33]
V. Dedík and B. Rossi, "Automated bug triaging in an industrial context," in Software Engineering and Advanced Applications (SEAA), 2016 42th Euromicro Conference on, 2016, pp. 363--367.
[34]
M. R. Karim, G. Ruhe, M. M. Rahman, V. Garousi, and T. Zimmermann, "An empirical investigation of single-objective and multiobjective evolutionary algorithms for developer's assignment to bugs," Journal of Software: Evolution and Process, vol. 28, no. 12, pp. 1025--1060, 2016.
[35]
P. Bodik, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen, "Fingerprinting the datacenter: automated classification of performance crises," in Proceedings of the 5th European conference on Computer systems, 2010, pp. 111--124.
[36]
I. Cohen, J. S. Chase, M. Goldszmidt, T. Kelly, and J. Symons, "Correlating instrumentation data to system states: A building block for automated diagnosis and control." in OSDI, vol. 4, 2004, pp. 16--16.
[37]
S. Duan and S. Babu, "Guided problem diagnosis through active learning," in International Conference on Autonomic Computing, 2008, pp. 45--54.
[38]
M. H. Lim, J. G. Lou, H. Zhang, F. Qiang, A. Teoh, Q. Lin, J. Ding, and D. Zhang, "Identifying recurrent and unknown performance issues," in IEEE International Conference on Data Mining, 2015.
[39]
J.-G. Lou, Q. Lin, R. Ding, Q. Fu, D. Zhang, and T. Xie, "Experience report on applying software analytics in incident management of online service," Automated Software Engineering, vol. 24, no. 4, pp. 905--941, 2017.
[40]
Q. Lin, J.-G. Lou, H. Zhang, and D. Zhang, "How to tame your online services," in Perspectives on Data Science for Software Engineering. Elsevier, 2016, pp. 63--65.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE-SEIP '19: Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice
May 2019
339 pages

Sponsors

Publisher

IEEE Press

Publication History

Published: 27 May 2019

Check for updates

Qualifiers

  • Research-article

Conference

ICSE '19
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)12
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)SLIM: a Scalable and Interpretable Light-weight Fault Localization Algorithm for Imbalanced Data in MicroserviceProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3694984(27-39)Online publication date: 27-Oct-2024
  • (2024)LLexus: an AI agent system for incident managementACM SIGOPS Operating Systems Review10.1145/3689051.368905658:1(23-36)Online publication date: 14-Aug-2024
  • (2024)LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud IncidentsCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering10.1145/3663529.3663858(388-398)Online publication date: 10-Jul-2024
  • (2024)Try with Simpler - An Evaluation of Improved Principal Component Analysis in Log-based Anomaly DetectionACM Transactions on Software Engineering and Methodology10.1145/364438633:5(1-27)Online publication date: 3-Jun-2024
  • (2024)FaultProfIT: Hierarchical Fault Profiling of Incident Tickets in Large-scale Cloud SystemsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639754(392-404)Online publication date: 14-Apr-2024
  • (2024)Dynamic Alert Suppression Policy for Noise Reduction in AIOpsProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639752(178-188)Online publication date: 14-Apr-2024
  • (2024)Knowledge-aware Alert Aggregation in Large-scale Cloud Systems: a Hybrid ApproachProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice10.1145/3639477.3639745(369-380)Online publication date: 14-Apr-2024
  • (2024)Automatic Root Cause Analysis via Large Language Models for Cloud IncidentsProceedings of the Nineteenth European Conference on Computer Systems10.1145/3627703.3629553(674-688)Online publication date: 22-Apr-2024
  • (2023)Assess and Summarize: Improve Outage Understanding with Large Language ModelsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613891(1657-1668)Online publication date: 30-Nov-2023
  • (2023)On-Premise AIOps Infrastructure for a Software Editor SME: An Experience ReportProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3613876(1820-1831)Online publication date: 30-Nov-2023
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media