skip to main content
research-article
Open access

TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State

Published: 12 July 2024 Publication History

Abstract

Distributed tracing has been widely adopted in many microservice systems and plays an important role in monitoring and analyzing the system. However, trace data often come in large volumes, incurring substantial computational and storage costs. To reduce the quantity of traces, trace sampling has become a prominent topic of discussion, and several methods have been proposed in prior work. To attain higher-quality sampling outcomes, biased sampling has gained more attention compared to random sampling. Previous biased sampling methods primarily considered the importance of traces based on diversity, aiming to sample more edge-case traces and fewer common-case traces. However, we contend that relying solely on trace diversity for sampling is insufficient, system runtime state is another crucial factor that needs to be considered, especially in cases of system failures. In this study, we introduce TraStrainer, an online sampler that takes into account both system runtime state and trace diversity. TraStrainer employs an interpretable and automated encoding method to represent traces as vectors. Simultaneously, it adaptively determines sampling preferences by analyzing system runtime metrics. When sampling, it combines the results of system-bias and diversity-bias through a dynamic voting mechanism. Experimental results demonstrate that TraStrainer can achieve higher quality sampling results and significantly improve the performance of downstream root cause analysis (RCA) tasks. It has led to an average increase of 32.63% in Top-1 RCA accuracy compared to four baselines in two datasets.

References

[1]
2023. Kubernetes Homepage. http://kubernetes.io/ [Online]
[2]
2023. Zipkin Homepage. https://zipkin.io [Online]
[3]
Ivan Beschastnikh, Patty Wang, Yuriy Brun, and Michael D Ernst. 2016. Debugging distributed systems. Commun. ACM, 59, 8 (2016), 32–37.
[4]
Chaosblade. 2023. Chaosblade. https://github.com/chaosblade-io/chaosblade Accessed Jan. 6, 2023
[5]
Chris Chatfield. 2000. Time-series forecasting. CRC press.
[6]
Yufu Chen, Meng Yan, Dan Yang, Xiaohong Zhang, and Ziliang Wang. 2022. Deep Attentive Anomaly Detection for Microservice Systems with Multimodal Time-Series Data. In ICWS 2022. IEEE, 373–378.
[7]
Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker. 2007. $X-Trace$: A pervasive network tracing framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07).
[8]
Michael Frigge, David C Hoaglin, and Boris Iglewicz. 1989. Some implementations of the boxplot. The American Statistician, 43, 1 (1989), 50–54.
[9]
FudanSELab. 2023. TrainTicket. https://github.com/FudanSELab/train-ticket Accessed Jan. 6, 2023
[10]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: Practical and Scalable ML-Driven Performance Debugging in Microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21). Association for Computing Machinery, New York, NY, USA. 135–151. isbn:9781450383172 https://doi.org/10.1145/3445814.3446700
[11]
Alim Ul Gias, Yicheng Gao, Matthew Sheldon, José A. Perusquía, Owen O’Brien, and Giuliano Casale. 2022. SampleHST: Efficient On-the-Fly Selection of Distributed Traces. arxiv:2210.04595.
[12]
GoogleCloudPlatform. 2023. OnlineBoutique. https://github.com/GoogleCloudPlatform/microservices-demo Accessed Jan. 6, 2023
[13]
Grafana. 2023. Grafana Tempo. https://github.com/grafana/tempo [Online]
[14]
Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-Based Trace Analysis for Microservice Architecture Understanding and Problem Diagnosis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020). Association for Computing Machinery, New York, NY, USA. 1387–1397. isbn:9781450370431 https://doi.org/10.1145/3368089.3417066
[15]
Lexiang Huang and Timothy Zhu. 2021. Tprof: Performance Profiling via Structural Aggregation and Automated Analysis of Distributed Systems Traces. In Proceedings of the ACM Symposium on Cloud Computing (SoCC ’21). Association for Computing Machinery, New York, NY, USA. 76–91. isbn:9781450386388 https://doi.org/10.1145/3472883.3486994
[16]
Zicheng Huang, Pengfei Chen, Guangba Yu, Hongyang Chen, and Zibin Zheng. 2021. Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems. In 2021 IEEE International Conference on Web Services (ICWS). 436–446. https://doi.org/10.1109/ICWS53863.2021.00063
[17]
jaeger. 2023. Jaeger. https://www.jaegertracing.io/ Accessed: 2023/7/14
[18]
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, and Yee Jiun Song. 2017. Canopy: An End-to-End Performance Tracing And Analysis System. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). Association for Computing Machinery, New York, NY, USA. 34–50. isbn:9781450350853 https://doi.org/10.1145/3132747.3132749
[19]
Kmaork. 2023. Hypno. https://docs.aws.amazon.com/prescriptive-guidance/latest/implementing-logging-monitoring-cloudwatch/configure-cloudwatch-ec2-on-premises.html Accessed Jan. 6, 2023
[20]
Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, and Rodrigo Fonseca. 2018. Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay. In Proceedings of the ACM Symposium on Cloud Computing (SoCC ’18). Association for Computing Machinery, New York, NY, USA. 326–332. isbn:9781450360111 https://doi.org/10.1145/3267809.3267841
[21]
Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, and Jonathan Mace. 2019. Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering. In Proceedings of the ACM Symposium on Cloud Computing (SoCC ’19). Association for Computing Machinery, New York, NY, USA. 312–324. isbn:9781450369732 https://doi.org/10.1145/3357223.3362736
[22]
Xing Li, Yan Chen, and Zhiqiang Lin. 2019. Towards automated inter-service authorization for microservice applications. In SIGCOMM 2019. ACM, 3–5.
[23]
Zeyan Li, Junjie Chen, Rui Jiao, Nengwen Zhao, Zhijun Wang, Shuwei Zhang, Yanjun Wu, Long Jiang, Leiqin Yan, Zikai Wang, Zhekang Chen, Wenchi Zhang, Xiaohui Nie, Kaixin Sui, and Dan Pei. 2021. Practical Root Cause Localization for Microservice Systems via Trace Analysis. In 2021 IEEE/ACM 29th International Symposium on Quality of Service (IWQOS). 1–10. https://doi.org/10.1109/IWQOS52092.2021.9521340
[24]
Jinjin Lin, Pengfei Chen, and Zibin Zheng. 2018. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. In ICSOC 2018. 3–20. https://doi.org/10.1007/978-3-030-03596-9_1
[25]
Dewei Liu, Chuan He, Xin Peng, Fan Lin, Chenxi Zhang, Shengfang Gong, Ziang Li, Jiayu Ou, and Zheshun Wu. 2021. MicroHECL: High-Efficient Root Cause Localization in Large-Scale Microservice Systems. In Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP ’21). IEEE Press, 338–347. isbn:9780738146690 https://doi.org/10.1109/ICSE-SEIP52600.2021.00043
[26]
Ping Liu, Haowen Xu, Qianyu Ouyang, Rui Jiao, Zhekang Chen, Shenglin Zhang, Jiahai Yang, Linlin Mo, Jice Zeng, Wenman Xue, and Dan Pei. 2020. Unsupervised Detection of Microservice Trace Anomalies through Service-Level Deep Bayesian Networks. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). 48–58. https://doi.org/10.1109/ISSRE5003.2020.00014
[27]
Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly Detection from System Tracing Data Using Multimodal Deep Learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). 179–186. https://doi.org/10.1109/CLOUD.2019.00038
[28]
Sasho Nedelkoski, Jorge Cardoso, and Odej Kao. 2019. Anomaly Detection from System Tracing Data Using Multimodal Deep Learning. In 2019 IEEE 12th International Conference on Cloud Computing (CLOUD). 179–186. https://doi.org/10.1109/CLOUD.2019.00038
[29]
Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the international multiconference of engineers and computer scientists. 1, 380–384.
[30]
Opentelemetry. 2023. Opentelemetry. https://opentelemetry.io Accessed: 2023/7/14
[31]
Opentelemetry. 2023. OpenTelemetry Collector. https://github.com/open-telemetry/opentelemetry-collector [Online]
[32]
Opentelemetry. 2023. Opentelemetry span-events concept. https://opentelemetry.io/docs/concepts/signals/traces/#span-events Accessed: 2023/7/14
[33]
Austin Parker, Daniel Spoonhower, Jonathan Mace, Ben Sigelman, and Rebecca Isaacs. 2020. Distributed tracing in practice: Instrumenting, analyzing, and debugging microservices. O’Reilly Media.
[34]
Thomas W. Reps, Thomas Ball, Manuvir Das, and James R. Larus. 1997. The Use of Program Profiling for Software Maintenance with Applications to the Year 2000 Problem. In 6th European Software Engineering Conference Held Jointly with the 5th ACM SIGSOFT Symposium on Foundations of Software Engineering. 432–449.
[35]
Jesus Rios, Saurabh Jha, and Laura Shwartz. 2022. Localizing and Explaining Faults in Microservices Using Distributed Tracing. In 2022 IEEE 15th International Conference on Cloud Computing (CLOUD). 489–499. https://doi.org/10.1109/CLOUD55607.2022.00072
[36]
Alex Sherstinsky. 2020. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network. Physica D: Nonlinear Phenomena, 404 (2020), 132306.
[37]
Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure.
[38]
Apache SkyWalking. 2023. Apache SkyWalking. https://skywalking.apache.org Accessed July. 6, 2023
[39]
Cindy Sridharan. 2018. Distributed systems observability: a guide to building robust systems. O’Reilly Media.
[40]
TraStrainer. 2024. TraStrainer implementation. https://github.com/IntelligentDDS/TraStrainer Accessed Feb. 20, 2024
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, 30 (2017).
[42]
Haixu Wu, Jiehui Xu, Jianmin Wang, and Mingsheng Long. 2021. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Advances in Neural Information Processing Systems, 34 (2021), 22419–22430.
[43]
Zihao Ye, Pengfei Chen, and Guangba Yu. 2021. T-Rank: A Lightweight Spectrum based Fault Localization Approach for Microservice Systems. In CCGrid 2021. IEEE/ACM, 416–425. https://doi.org/10.1109/CCGrid51090.2021.00051
[44]
Zuoning Yin, Xiao Ma, Jing Zheng, Yuanyuan Zhou, Lakshmi N Bairavasundaram, and Shankar Pasupathy. 2011. An empirical study on configuration errors in commercial and open source systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. 159–172.
[45]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. 2021. MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments. In WWW 2021. ACM, 3087–3098.
[46]
Guangba Yu, Pengfei Chen, Pairui Li, Tianjun Weng, Haibing Zheng, Yuetang Deng, and Zibin Zheng. 2023. LogReducer: Identify and Reduce Log Hotspots in Kernel on the Fly. In ICSE 2023. 1763–1775.
[47]
Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-Modal Observability Dat. In ESEC/FSE 2023. ACM, 1–1. https://doi.org/10.1145/3611643.3616249
[48]
Guangba Yu, Pengfei Chen, and Zibin Zheng. 2019. Microscaler: Automatic Scaling for Microservices with an Online Learning Approach. In ICWS 2019. IEEE, 68–75.
[49]
Ailing Zeng, Muxi Chen, Lei Zhang, and Qiang Xu. 2023. Are transformers effective for time series forecasting? In Proceedings of the AAAI conference on artificial intelligence. 37, 11121–11128.
[50]
Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. 2022. DeepTraLog: Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning. In ICSE 2022. IEEE, 623–634.
[51]
Chenxi Zhang, Xin Peng, Tong Zhou, Chaofeng Sha, Zhenghui Yan, Yiru Chen, and Hong Yang. 2022. TraceCRL: Contrastive Representation Learning for Microservice Trace Analysis. In ESEC/FSE 2022. ACM, 1221–1232.
[52]
Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing $Edge-Cases$ in Distributed Systems. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 321–339.
[53]
Hao Zhou, Ming Chen, Qian Lin, Yong Wang, Xiaobin She, Sifan Liu, Rui Gu, Beng Chin Ooi, and Junfeng Yang. 2018. Overload Control for Scaling WeChat Microservices. In SoCC 2018. ACM, 149–161.
[54]
Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. 2022. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. In International Conference on Machine Learning. 27268–27286.
[55]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 47, 2 (2018), 243–260.
[56]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study. IEEE TSE, 47, 2 (2021), 243–260.
[57]
Zhenyi Zhu. 2022. Anomaly detection over time series data.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Software Engineering
Proceedings of the ACM on Software Engineering  Volume 1, Issue FSE
July 2024
2770 pages
EISSN:2994-970X
DOI:10.1145/3554322
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2024
Published in PACMSE Volume 1, Issue FSE

Badges

  • Distinguished Paper

Author Tags

  1. biased sampling
  2. distributed tracing
  3. microservice

Qualifiers

  • Research-article

Funding Sources

  • Guangdong Basic and Applied Basic Research Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 266
    Total Downloads
  • Downloads (Last 12 months)266
  • Downloads (Last 6 weeks)51
Reflects downloads up to 07 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media