skip to main content
10.5555/2685048.2685066acmotherconferencesArticle/Chapter ViewAbstractPublication PagesosdiConference Proceedingsconference-collections
Article

The mystery machine: end-to-end performance analysis of large-scale internet services

Published: 06 October 2014 Publication History

Abstract

Current debugging and optimization methods scale poorly to deal with the complexity of modern Internet services, in which a single request triggers parallel execution of numerous heterogeneous software components over a distributed set of computers. The Achilles' heel of current methods is the need for a complete and accurate model of the system under observation: producing such a model is challenging because it requires either assimilating the collective knowledge of hundreds of programmers responsible for the individual components or restricting the ways in which components interact.
Fortunately, the scale of modern Internet services offers a compensating benefit: the sheer volume of requests serviced means that, even at low sampling rates, one can gather a tremendous amount of empirical performance observations and apply "big data" techniques to analyze those observations. In this paper, we show how one can automatically construct a model of request execution from pre-existing component logs by generating a large number of potential hypotheses about program behavior and rejecting hypotheses contradicted by the empirical observations. We also show how one can validate potential performance improvements without costly implementation effort by leveraging the variation in component behavior that arises naturally over large numbers of requests to measure the impact of optimizing individual components or changing scheduling behavior.
We validate our methodology by analyzing performance traces of over 1.3 million requests to Facebook servers. We present a detailed study of the factors that affect the end-to-end latency of such requests. We also use our methodology to suggest and validate a scheduling optimization for improving Facebook request latency.

References

[1]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 74-89, Bolton Landing, NY, October 2003.
[2]
Gautam Altekar and Ion Stoica. ODR: Output-deterministic replay for multicore debugging. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles, pages 193-206, Big Sky, MT, October 2009.
[3]
Mona Attariyan, Michael Chow, and Jason Flinn. X-ray: Automating root-cause diagnosis of performance anomalies in production software. In Proceedings of the 10th Symposium on Operating Systems Design and Implementation, Hollywood, CA, October 2012.
[4]
Paramvir Bahl, Ranveer Chandra, Albert Greenberg, Srikanth Kandula, David A. Maltz, and Ming Zhang. Towards highly reliable enterprise network services via interface of multi-level dependencies. In Proceedings of the Symposium on Communications Architectures and Protocols (SIGCOMM), August 2007.
[5]
Paul Barford and Mark Crovella. Critical path analysis of TCP transactions. In Proceedings of the ACM Conference on Computer Communications (SIGCOMM), Stockholm, Sweden, August/September 2000.
[6]
Paul Barham, Austin Donnelly, Rebecca Isaacs, and Richard Mortier. Using Magpie for request extraction and workload modelling. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, pages 259-272, San Francisco, CA, December 2004.
[7]
Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. Tao: Facebook's distributed data store for the social graph. In Proceedings of the 2013 USENIX Annual Technical Conference, San Jose, CA, June 2013.
[8]
Michael Butkiewicz, Harsha V. Madhyastha, and Vyas Sekar. Understanding website complexity: Measurements, metrics, and implications. In Internet Measurement Conference (IMC), Berlin, Germany, November 2011.
[9]
Anupam Chanda, Alan L. Cox, and Willy Zwanepoel. Whodunit: Transactional profiling for multi-tier applications. In Proceedings of the 2nd ACM European Conference on Computer Systems, Lisboa, Portugal, March 2007.
[10]
Xu Chen, Ming Zhang, Z. Morley Mao, and Paramir Bahl. Automating network application dependency discovery: Experiences, limitations, and new solutions. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation, San Diego, CA, December 2008.
[11]
Yingying Chen, Ratul Mahajan, Baskar Sridharan, and Zhi-Li Zhang. A provider-side view of web search response time. In Proceedings of the 2013 ACM Conference on Computer Communications, Hong Kong, China, August 2013.
[12]
Michael D. Ernst, Jake Cockrell, William G. Griswold, and David Notkin. Dynamically discovering likely program invariants to support program evolution. IEEE Transactions on Software Engineering, 27(2), February 2001.
[13]
Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, and Ion Stoica. X-trace: A pervasive network tracing framework. In Proceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation, pages 271-284, Cambridge, MA, April 2007.
[14]
Google. Google Pagespeed Insight. https://developers.google.com/speed/pagespeed/.
[15]
Eric Koskinen and John Jannotti. Borderpatrol: Isolating events for precise black-box tracing. In Proceedings of the 3rd ACM European Conference on Computer Systems, April 2008.
[16]
Adam Lazur. Building a billion user load balancer. In Velocity Web Performance and Operations Conference, Santa Clara, CA, June 2013.
[17]
Zhichun Li, Ming Zhang, Zhaosheng Zhu, Yan Chen, Albert Greenberg, and Yi-Min Wang. Webprophet: Automating performance prediction for web services. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation, April 2010.
[18]
Gideon Mann, Mark Sandler, Darja Krushevskaja, Sudipto Guha, and Eyal Even-dar. Modeling the parallel execution of black-box services. In Proceedings of the 3rd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), Portland, OR, June 2011.
[19]
Karthik Nagaraj, Charles Killian, and Jennifer Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, San Jose, CA, April 2012.
[20]
Rajesh Nishtala, Hans Fugal, Steven Grimm, Marc Kwiatkowski, Herman Lee, Harry C. Li, Ryan McElroy, Mike Paleczny, Daniel Peek, Paul Saab, David Stafford, Tony Tung, and Venkateshwaran Venkataramani. Scaling memcache at facebook. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, Lombard, IL, April 2013.
[21]
Brock Pytlik, Manos Renieris, Shriram Krishnamurthi, and Steven P. Reiss. Automated fault localization using potential invariants. In Proceedings of the 5th International Workshop on Automated and Algorithmic Debugging, Ghent, Belgium, September 2003.
[22]
Lenin Ravindranath, Jitendra Padjye, Sharad Agrawal, Ratul Mahajan, Ian Obermiller, and Shahin Shayandeh. AppInsight: Mobile app performance monitoring in the wild. In Proceedings of the 10th Symposium on Operating Systems Design and Implementation, Hollywood, CA, October 2012.
[23]
Lenin Ravindranath, Jitendra Pahye, Ratul Mahajan, and Hari Balakrishnan. Timecard: Controlling user-perceived delays in server-based mobile applications. In Proceedings of the 24th ACM Symposium on Operating Systems Principles, Farmington, PA, October 2013.
[24]
Patrick Reynolds, Charles Killian, Janet L. Wiener, Jeffrey C. Mogul, Mehul A. Shah, and Amin Vahdat. Pip: Detecting the unexpected in distributed systems. In Proceedings of the 3rd USENIX Symposium on Networked Systems Design and Implementation, pages 115-128, San Jose, CA, May 2006.
[25]
Swarup Kumar Sahoo, John Criswell, Chase Geigle, and Vikram Adve. Using likely invariants for automated software fault localization. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems, Houston, TX, March 2013.
[26]
Ali Ghassan Saidi. Full-System Critical-Path Analysis and Performance Prediction. PhD thesis, Department of Computer Science and Engineering, University of Michigan, 2009.
[27]
Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. Diagnosing performance changes by comparing request flows. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation, pages 43-56, Boston, MA, March 2011.
[28]
Benjamin H Sigelman, Luiz Andre Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google research, 2010.
[29]
Mukarram Bin Tariq, Amgad Zeitoun, Vytautas Valancius, Nick Feamster, and Mostafa Ammar. Answering what-if deployment and configuration questions with wise. In Proceedings of the 2008 ACM Conference on Computer Communications, Seattle, WA, August 2008.
[30]
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive - a warehousing solution over a map-reduce framework. In 35th International Conference on Very Large Data Bases (VLDB), Lyon, France, August 2009.
[31]
Bhuvan Urgaonkar, Giovanni Pacifici, Prashant Shenoy, Mike Spreitzer, and Asser Tantawi. An analytical model for multi-tier Internet services and its applications. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (ACM SIGMETRICS), Banff, AB, June 2005.
[32]
Xiao Sophia Wang, Aruna Balasubramanian, Arvind Krishnamurthy, and David Wetherall. Demystifying page load performance with wprof. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, April 2013.
[33]
Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I. Jordan. Detecting large-scale system problems by mining console logs. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles, Big Sky, MT, October 2009.
[34]
Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. SherLog: Error diagnosis by connecting clues from run-time logs. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 143-154, Pittsburgh, PA, March 2010.
[35]
Haiping Zhao, Iain Proctor, Minghui Yang, Xin Qi, Mark Williams, Qi Gao, Guilherme Ottoni, Andrew Paroski, Scott MacVicar, Jason Evans, and Stephen Tu. The HipHop compiler for PHP. ACM International Conference on Object Oriented Programming Systems, Languages, and Applications, October 2012.
[36]
Xu Zhao, Yongle Zhang, David Lion, Muhammad Faizan, Yu Luo, Ding Yuan, and Michael Stumm. lprof: A non-intrusive request flow profiler for distributed systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation, October 2014.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
OSDI'14: Proceedings of the 11th USENIX conference on Operating Systems Design and Implementation
October 2014
676 pages
ISBN:9781931971164

Sponsors

  • USENIX Assoc: USENIX Assoc

In-Cooperation

Publisher

USENIX Association

United States

Publication History

Published: 06 October 2014

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Everything Everywhere All At Once: Efficient Cross-Service Program Analysis with OverSeerProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops10.1145/3691621.3694937(82-87)Online publication date: 27-Oct-2024
  • (2024)Unlocking the Power of Numbers: Log Compression via Numeric Token ParsingProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695474(919-930)Online publication date: 27-Oct-2024
  • (2024)Eliminating eBPF Tracing Overhead on Untraced ProcessesProceedings of the ACM SIGCOMM 2024 Workshop on eBPF and Kernel Extensions10.1145/3672197.3673431(16-22)Online publication date: 4-Aug-2024
  • (2024)TraceWeaver: Distributed Request Tracing for Microservices Without Application ModificationProceedings of the ACM SIGCOMM 2024 Conference10.1145/3651890.3672254(828-842)Online publication date: 4-Aug-2024
  • (2024)LogFlux: A Software Suite for Replicating Results in Automated Log ParsingProceedings of the 2nd ACM Conference on Reproducibility and Replicability10.1145/3641525.3663625(64-74)Online publication date: 18-Jun-2024
  • (2024)Decoding Log Parsing Challenges: A Comprehensive Taxonomy for Actionable SolutionsProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings10.1145/3639478.3643523(392-393)Online publication date: 14-Apr-2024
  • (2023)Sleuth: A Trace-Based Root Cause Analysis System for Large-Scale Microservices with Graph Neural NetworksProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624758(324-337)Online publication date: 25-Mar-2023
  • (2023)Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero CodeProceedings of the ACM SIGCOMM 2023 Conference10.1145/3603269.3604823(420-437)Online publication date: 10-Sep-2023
  • (2023)WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay InjectionProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567507(111-126)Online publication date: 8-May-2023
  • (2022)Distributed Latency Profiling through Critical Path TracingCommunications of the ACM10.1145/357052266:1(44-51)Online publication date: 20-Dec-2022
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media