skip to main content
10.1145/3611643.3613881acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

STEAM: Observability-Preserving Trace Sampling

Published: 30 November 2023 Publication History

Abstract

In distributed systems and microservice applications, tracing is a crucial observability signal employed for comprehending their internal states. To mitigate the overhead associated with distributed tracing, most tracing frameworks utilize a uniform sampling strategy, which retains only a subset of traces. However, this approach is insufficient for preserving system observability. This is primarily attributed to the long-tail distribution of traces in practice, which results in the omission or rarity of minority yet critical traces after sampling. In this study, we introduce an observability-preserving trace sampling method, denoted as STEAM, which aims to retain as much information as possible in the sampled traces. We employ Graph Neural Networks (GNN) for trace representation, while incorporating domain knowledge of trace comparison through logical clauses. Subsequently, we employ a scalable approach to sample traces, emphasizing mutually dissimilar traces. STEAM has been implemented on top of OpenTelemetry, comprising approximately 1.6K lines of Golang code and 2K lines of Python code. Evaluation on four benchmark microservice applications and a production system demonstrates the superior performance of our approach compared to baseline methods. Furthermore, STEAM is capable of processing 15,000 traces in approximately 4 seconds.

References

[1]
Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In 9th Workshop on Hot Topics in Operating Systems (HotOS IX).
[2]
Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and Nisheeth Vishnoi. 2018. Fair and diverse DPP-based data summarization. In International Conference on Machine Learning. 716–725.
[3]
Gal Chechik, Varun Sharma, Uri Shalit, and Samy Bengio. 2010. Large Scale Online Learning of Image Similarity Through Ranking. Journal of Machine Learning Research, 11, 3 (2010).
[4]
Laming Chen, Guoxin Zhang, and Eric Zhou. 2018. Fast greedy map inference for determinantal point process to improve recommendation diversity. Advances in Neural Information Processing Systems, 31 (2018).
[5]
Mike Y Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Dave Patterson, Armando Fox, and Eric Brewer. 2004. Path-based faliure and evolution management. In Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation-Volume 1. 23–23.
[6]
Rodrigo Fonseca, George Porter, Randy H Katz, and Scott Shenker. 2007. $X-Trace$: A Pervasive Network Tracing Framework. In 4th USENIX Symposium on Networked Systems Design & Implementation (NSDI 07).
[7]
Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 135–151.
[8]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, and Brendon Jackson. 2019. An open-source benchmark suite for microservices and their hardware-software implications for cloud & edge systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 3–18.
[9]
Adam Gluck. [n. d.]. Introducing Domain-Oriented Microservice Architecture. https://eng.uber.com/microservice-architecture/ [Online; accessed 03-Sep-2022]
[10]
Robert M Gray. 2011. Entropy and information theory. Springer Science & Business Media.
[11]
Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph-based trace analysis for microservice architecture understanding and problem diagnosis. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1387–1397.
[12]
Lexiang Huang and Timothy Zhu. 2021. Tprof: Performance Profiling via Structural Aggregation and Automated Analysis of Distributed Systems Traces. In Proceedings of the ACM Symposium on Cloud Computing (SoCC ’21) (SoCC ’21). Association for Computing Machinery, 76–91. https://doi.org/10.1145/3472883.3486994
[13]
Peng Huang, Chuanxiong Guo, Jacob R Lorch, Lidong Zhou, and Yingnong Dang. 2018. Capturing and enhancing in situ system observability for failure detection. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 1–16.
[14]
Zicheng Huang, Pengfei Chen, Guangba Yu, Hongyang Chen, and Zibin Zheng. 2021. Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems. In 2021 IEEE International Conference on Web Services (ICWS). 436–446.
[15]
Jaeger. 2022. Jaeger: open source, end-to-end distributed tracing. https://www.jaegertracing.io/ [Online; accessed 03-Sep-2022]
[16]
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O’Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, and Brendan Viscomi. 2017. Canopy: An end-to-end performance tracing and analysis system. In Proceedings of the 26th symposium on operating systems principles. 34–50.
[17]
Rudolf E Kalman. 1960. On the general theory of control systems. In Proceedings First International Conference on Automatic Control, Moscow, USSR. 481–492.
[18]
Ari Kobren, Nicholas Monath, Akshay Krishnamurthy, and Andrew McCallum. 2017. A hierarchical algorithm for extreme clustering. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 255–264.
[19]
Alex Kulesza and Ben Taskar. 2012. Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083.
[20]
Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, and Rodrigo Fonseca. 2018. Weighted sampling of execution traces: capturing more needles and less hay. In Proceedings of the ACM Symposium on Cloud Computing. 326–332.
[21]
Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, and Jonathan Mace. 2019. Sifter: Scalable sampling for distributed traces, without feature engineering. In Proceedings of the ACM Symposium on Cloud Computing. 312–324.
[22]
Jiaxin Li, Yuxi Chen, Haopeng Liu, Shan Lu, Yiming Zhang, Haryadi S Gunawi, Xiaohui Gu, Xicheng Lu, and Dongsheng Li. 2018. Pcatch: automatically detecting performance cascading bugs in cloud systems. In Proceedings of the Thirteenth EuroSys Conference. 1–14.
[23]
lightstep. [n. d.]. The cloud-native reliability platform | Lightstep. https://www.lightstep.com/ [Online; accessed 03-Sep-2022]
[24]
Aristidis Likas, Nikos Vlassis, and Jakob J Verbeek. 2003. The global k-means clustering algorithm. Pattern recognition, 36, 2 (2003), 451–461.
[25]
John W Lloyd. 2012. Foundations of logic programming. Springer Science & Business Media.
[26]
Locust. [n. d.]. LOCUST: An open source load testing tool. https://locust.io/ [Online; accessed 03-Sep-2022]
[27]
Shutian Luo, Huanle Xu, Chengzhi Lu, Kejiang Ye, Guoyao Xu, Liping Zhang, Yu Ding, Jian He, and Chengzhong Xu. 2021. Characterizing Microservice Dependency and Performance: Alibaba Trace Analysis. SoCC.
[28]
Jonathan Mace, Ryan Roelke, and Rodrigo Fonseca. 2015. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proceedings of the 25th Symposium on Operating Systems Principles. 378–393.
[29]
C. Majors, L. Fong-Jones, and G. Miranda. 2022. Observability Engineering. O’Reilly Media. isbn:9781492076391 https://books.google.com/books?id=JmZuEAAAQBAJ
[30]
Gideon Mann, Mark Sandler, Darja Krushevskaja, Sudipto Guha, and Eyal Even-Dar. 2011. Modeling the Parallel Execution of $Black-Box$ Services. In 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 11).
[31]
Chaos Mesh. [n. d.]. Chaos Mesh: A Powerful Chaos Engineering Platform for Kubernetes. https://chaos-mesh.org/ [Online; accessed 03-Sep-2022]
[32]
Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. In ICLR.
[33]
OpenTelemetry. [n. d.]. OpenTelemetry: High-quality, ubiquitous, and portable telemetry to enable effective observability. https://opentelemetry.io/ [Online; accessed 03-Sep-2022]
[34]
Google Cloud Platform. [n. d.]. Online Boutique: a cloud-native microservices demo application. https://github.com/GoogleCloudPlatform/microservices-demo [Online; accessed 03-Sep-2022]
[35]
Raja R Sambasivan, Alice X Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R Ganger. 2011. Diagnosing performance changes by comparing request flows. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11).
[36]
Giulio Santoli. [n. d.]. Microservices Architectures: Become a Unicorn like Netflix, Twitter and Hailo. https://www.slideshare.net/gjuljo/ microservices-architectures-become-a-unicorn-like-netflix-twitter-and-hailo [Online; accessed 03-Sep-2022]
[37]
Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010. Dapper, a large-scale distributed systems tracing infrastructure.
[38]
Apache Spark. [n. d.]. Streaming K-means. https://spark.apache.org/docs/latest/mllib-clustering.html [Online; accessed 03-Sep-2022]
[39]
C. Sridharan. 2018. Distributed Systems Observability: A Guide to Building Robust Systems. O’Reilly Media. isbn:9781492033424 https://books.google.co.jp/books?id=07EswAEACAAJ
[40]
Stanford. [n. d.]. Inverse Document Frequency. https://nlp.stanford.edu/IR-book/html/htmledition/inverse-document-frequency-1.html [Online; accessed 03-Sep-2022]
[41]
w3c. 2021. W3C Recommendation: Trace Context. https://www.w3.org/TR/trace-context-1 [Online; accessed 03-Sep-2022]
[42]
Lingmei Weng, Peng Huang, Jason Nieh, and Junfeng Yang. 2021. Argus: Debugging Performance Issues in Modern Desktop Applications with Annotated Causal Tracing. In 2021 USENIX Annual Technical Conference (USENIX ATC 21). USENIX Association, 193–207.
[43]
Yang Wu, Ang Chen, and Linh Thi Xuan Phan. 2019. Zeno: Diagnosing performance problems with temporal provenance. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 395–420.
[44]
Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems, 32, 1 (2020), 4–24.
[45]
Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. 2018. How powerful are graph neural networks? arXiv preprint arXiv:1810.00826.
[46]
Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in Neural Information Processing Systems, 33, 5812–5823.
[47]
Chenxi Zhang, Xin Peng, Tong Zhou, Chaofeng Sha, Zhenghui Yan, Yiru Chen, and Hong Yang. 2022. TraceCRL: contrastive representation learning for microservice trace analysis. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1221–1232.
[48]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study. IEEE Transactions on Software Engineering, 47, 2 (2018), 243–260.
[49]
Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Dewei Liu, Qilin Xiang, and Chuan He. 2019. Latent error prediction and fault localization for microservice applications by learning from system trace logs. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 683–694.
[50]
Yanqiao Zhu, Yichen Xu, Qiang Liu, and Shu Wu. 2021. An Empirical Study of Graph Contrastive Learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[51]
Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021. Graph contrastive learning with adaptive augmentation. In Proceedings of the Web Conference 2021. 2069–2080.
[52]
Zipkin. [n. d.]. Open Zipkin. https://www.zipkin.io/ [Online; accessed 03-Sep-2022]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ESEC/FSE 2023: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
November 2023
2215 pages
ISBN:9798400703270
DOI:10.1145/3611643
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 November 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. distributed tracing
  2. graph neural network
  3. trace sampling

Qualifiers

  • Research-article

Conference

ESEC/FSE '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 112 of 543 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)355
  • Downloads (Last 6 weeks)30
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media