research-article

Leveraging the short-term memory of hardware to diagnose production-run software failures

Authors:

Shan LuAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 42, Issue 1

Pages 207 - 222

https://doi.org/10.1145/2654822.2541973

Published: 24 February 2014 Publication History

Abstract

Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once.

This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than maintaining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically locate failure root causes using LBR and LCR.

Our evaluation uses 31 real-world sequential and concurrency bug failures from 18 representative open-source software. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3% run-time overhead. As our system does not rely on sampling,

References

[1]

G. Altekar and I. Stoica. ODR: Output-deterministic replay for multicore debugging. In SOSP, 2009.

Digital Library

[2]

J. Arulraj, P.-C. Chang, G. Jin, and S. Lu. Production-run software failure diagnosis via hardware performance counters. In ASPLOS, 2013.

Digital Library

[3]

012)}nainar.12P. Arumuga Nainar. Applications of Static Analysis and Program Structure in Statistical Debugging. PhD thesis, University of Wisconsin -- Madison, 2012.

Digital Library

[4]

P. Arumuga Nainar and B. Liblit. Adaptive bug isolation. In ICSE, 2010.

Digital Library

[5]

B. Buck and J. K. Hollingsworth. An API for runtime code patching. Int. J. High Perform. Comput. Appl., 2000.

Digital Library

[6]

L. Ceze, P. Montesinos, C. von Praun, and J. Torrellas. Colorama: Architectural support for data-centric synchronization. In HPCA, 2007.

Digital Library

[7]

L. Chew and D. Lie. Kivati: Fast detection and prevention of atomicity violations. In EuroSys, 2010.

Digital Library

[8]

CNNMoneyTech. Is knight's $440 million glitch the costliest computer bug ever? http://money.cnn.com/2012/08/09/technology/knight-expensive-computer-bug/index.html.

[9]

J. Demme and S. Sethumadhavan. Rapid identification of architectural bottlenecks via precise event counting. In ISCA, 2011.

Digital Library

[10]

J. Demme, M. Maycock, J. Schmitz, A. Tang, A. Waksman, S. Sethumadhavan, and S. Stolfo. On the feasibility of online malware detection with performance counters. In ISCA, 2013.

Digital Library

[11]

J. L. Greathouse, Z. Ma, M. I. Frank, R. Peri, and T. M. Austin. Demand-driven software race detection using hardware performance counters. In ISCA, 2011.

Digital Library

[12]

W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characterization of Linux kernel behavior under errors. In DSN, 2003.

[13]

D. R. Hower, P. Montesinos, L. Ceze, M. D. Hill, and J. Torrellas. Two hardware-based approaches for deterministic multiprocessor replay. Commun. ACM, 52 (6), June 2009.

Digital Library

[14]

Intel. Nehalem PMU programming guide. http://software.intel.com/sites/default/files/76/87/30320, 2010.

[15]

Intel. Intel 64 and IA-32 architectures software developer's manual. http://download.intel.com/products/processor/manual/325462.pdf, 2013.

[16]

Intel. GDB - the GNU* project debugger for Intel architecture. http://software.intel.com/en-us/articles/intel-system-studio-gdb#Trace, March 2013.

[17]

J. L. Lions et. al. ARIANE 5 Flight 501 Failure -- report by the inquiry board. http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html.

[18]

G. Jin, A. Thakur, B. Liblit, and S. Lu. Instrumentation and sampling strategies for cooperative concurrency bug isolation. In OOPSLA, 2010.

Digital Library

[19]

S. T. King, G. W. Dunlap, and P. M. Chen. Operating systems with time-traveling virtual machines. In Usenix Annual Technical Conference, 2005.

Digital Library

[20]

D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: Efficient online multiprocessor replay via speculation and external determinism. In ASPLOS, 2010.

Digital Library

[21]

N. Leveson and C. S. Turner. An investigation of the Therac-25 accidents. In IEEE Computer, 1993.

Digital Library

[22]

B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI, 2003.

Digital Library

[23]

B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In PLDI, 2005.

Digital Library

[24]

S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: Detecting atomicity violations via access interleaving invariants. In ASPLOS, 2006.

Digital Library

[25]

S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study of real world concurrency bug characteristics. In ASPLOS, 2008.

Digital Library

[26]

B. Lucia and L. Ceze. Finding concurrency bugs with context-aware communication graphs. In MICRO, 2009.

Digital Library

[27]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.

Digital Library

[28]

V. Nagarajan and R. Gupta. ECMon: Exposing cache events for monitoring. In ISCA, 2009.

Digital Library

[29]

S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously recording program execution for deterministic replay debugging. In ISCA, 2005.

Digital Library

[30]

}nasdaqPCWorld. Nasdaq's Facebook Glitch Came From Race Conditions. http://www.pcworld.com/businesscenter/article/255911/nasdaqs_facebook_glitch_came_from_race_conditions.html.

[31]

C. Pedersen and J. Acampora. Intel code execution trace resources. http://noggin.intel.com/content/intel-code-execution-trace-resources, 2012.

[32]

M. Prvulovic. CORD: Cost-effective (and nearly overhead-free) order-reordering and data race detection. In HPCA, 2006.

[33]

M. Prvulovic and J. Torrellas. ReEnact: Using thread-level speculation mechanisms to debug data races in multithreaded codes. In ISCA, 2003.

Digital Library

[34]

F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as allergies c a safe method to survive software failures. In SOSP, 2005.

Digital Library

[35]

R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold. Lightweight fault-localization using multiple coverage types. In ICSE, 2009.

Digital Library

[36]

SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016.

[37]

T. Sheng, N. Vachharajani, S. Eranian, R. Hundt, W. Chen, and W. Zheng. RACEZ: A lightweight and non-invasive race detection tool for production applications. In ICSE, 2011.

Digital Library

[38]

J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP, 2007.

Digital Library

[39]

G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic. MemTracker: Efficient and programmable support for memory access monitoring and debugging. In HPCA, 2007.

Digital Library

[40]

K. Walcott-Justice, J. Mars, and M. L. Soffa. THeME: A system for testing by hardware monitoring events. In ISSTA, 2012.

Digital Library

[41]

C. Willems, R. Hund, A. Fobian, D. Felsch, T. Holz, and A. Vasudevan. Down to the bare metal: Using processor features for binary analysis. In ACSAC, 2012.

Digital Library

[42]

J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ISCA, 2009.

Digital Library

[43]

D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: Error diagnosis by connecting clues from run-time logs. In ASPLOS, 2010.

Digital Library

[44]

Yuan, Zheng, Park, Zhou, and Savage}logenhancer.asplos11D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving software diagnosability via log enhancement. In ASPLOS, 2011.

Digital Library

[45]

D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, Y. Zhou, and S. Savage. Be conservative: Enhancing failure diagnosis with proactive logging. In OSDI, 2012.

Digital Library

[46]

Yuan, Xing, Chen, and Zang}binyu.apsys11L. Yuan, W. Xing, H. Chen, and B. Zang. Security breaches as PMU deviation: Detecting and identifying security attacks using performance counters. In phAPSys, 2011.

Digital Library

[47]

W. Zhang, C. Sun, and S. Lu. ConMem: Detecting severe concurrency bugs through an effect-oriented approach. In ASPLOS, 2010.

Digital Library

[48]

W. Zhang, J. Lim, R. Olichandran, J. Scherpelz, G. Jin, S. Lu, and T. Reps. ConSeq: Detecting concurrency bugs through sequential errors. In ASPLOS, 2011.

Digital Library

[49]

P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient architecture support for software debugging. In ISCA, 2004.

Digital Library

[50]

P. Zhou, R. Teodorescu, and Y. Zhou. Hard: Hardware-assisted lockset-based race detection. In HPCA, 2007.

Digital Library

Cited By

Botacin MGeus Pgrégio A(2018)Who Watches the WatchmenACM Computing Surveys10.1145/319967351:4(1-34)Online publication date: 13-Jul-2018
https://dl.acm.org/doi/10.1145/3199673
Jeong DJung MLee YLee BShin IKwon YFedorova ANarayanan DDi Luna GQuerzoni L(2023)Diagnosing Kernel Concurrency Failures with AITIAProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567486(94-110)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567486
Qiu ZShao SZhao QKhan HHui XJin GLo DMcIntosh SNovielli N(2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1145/3524842.3528463
Show More Cited By

Index Terms

Leveraging the short-term memory of hardware to diagnose production-run software failures
1. Hardware
  1. Hardware test
  2. Robustness
2. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Leveraging the short-term memory of hardware to diagnose production-run software failures
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and ...
Leveraging the short-term memory of hardware to diagnose production-run software failures
ASPLOS '14

Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and ...
Production-run software failure diagnosis via hardware performance counters
ASPLOS '13

Sequential and concurrency bugs are widespread in deployed software. They cause severe failures and huge financial loss during production runs. Tools that diagnose production-run failures with low overhead are needed. The state-of-the-art diagnosis ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 42, Issue 1

ASPLOS '14

March 2014

729 pages

ISSN:0163-5964

DOI:10.1145/2654822

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
February 2014
780 pages
ISBN:9781450323055
DOI:10.1145/2541940
General Chairs:
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Program Chair:
Sarita Adve
University of Illinois at Urbana-Champ

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014

Published in SIGARCH Volume 42, Issue 1

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
504
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Botacin MGeus Pgrégio A(2018)Who Watches the WatchmenACM Computing Surveys10.1145/319967351:4(1-34)Online publication date: 13-Jul-2018
https://dl.acm.org/doi/10.1145/3199673
Jeong DJung MLee YLee BShin IKwon YFedorova ANarayanan DDi Luna GQuerzoni L(2023)Diagnosing Kernel Concurrency Failures with AITIAProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567486(94-110)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3567486
Qiu ZShao SZhao QKhan HHui XJin GLo DMcIntosh SNovielli N(2022)A deep study of the effects and fixes of server-side request races in web applicationsProceedings of the 19th International Conference on Mining Software Repositories10.1145/3524842.3528463(744-756)Online publication date: 23-May-2022
https://dl.acm.org/doi/10.1145/3524842.3528463
Qiu ZShao SZhao QJin GSpinellis DGousios GChechik MDi Penta M(2021)Understanding and detecting server-side request races in web applicationsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468594(842-854)Online publication date: 20-Aug-2021
https://dl.acm.org/doi/10.1145/3468264.3468594
Zuo ZJi KWang YTao WWang LLi XXu GFreund SYahav E(2021)JPortal: precise and efficient control-flow tracing for JVM programs with Intel processor traceProceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3453483.3454096(1080-1094)Online publication date: 19-Jun-2021
https://dl.acm.org/doi/10.1145/3453483.3454096
Zhang YRodrigues KLuo YStumm MYuan DBrecht TWilliamson C(2019)The inflection point hypothesisProceedings of the 27th ACM Symposium on Operating Systems Principles10.1145/3341301.3359650(131-146)Online publication date: 27-Oct-2019
https://dl.acm.org/doi/10.1145/3341301.3359650
Tu TLiu XSong LZhang YBahar IHerlihy MWitchel ELebeck A(2019)Understanding Real-World Concurrency Bugs in GoProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304069(865-878)Online publication date: 4-Apr-2019
https://dl.acm.org/doi/10.1145/3297858.3304069
Zhao QQiu ZJin GHollingsworth JKeidar I(2019)Semantics-aware scheduling policies for synchronization determinismProceedings of the 24th Symposium on Principles and Practice of Parallel Programming10.1145/3293883.3295731(242-256)Online publication date: 16-Feb-2019
https://dl.acm.org/doi/10.1145/3293883.3295731
Cui WGe XKasikci BNiu BSharma UWang RYun IArpaci-Dusseau AVoelker G(2018)REPTProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291171(17-32)Online publication date: 8-Oct-2018
https://dl.acm.org/doi/10.5555/3291168.3291171
Liu HSilvestro SWang WTian CLiu T(2018)iReplayer: in-situ and identical record-and-replay for multithreaded applicationsACM SIGPLAN Notices10.1145/3296979.319238053:4(344-358)Online publication date: 11-Jun-2018
https://dl.acm.org/doi/10.1145/3296979.3192380
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents