skip to main content
research-article

Leveraging the short-term memory of hardware to diagnose production-run software failures

Published: 24 February 2014 Publication History

Abstract

Failures caused by software bugs are widespread in production runs, causing severe losses for end users. Unfortunately, diagnosing production-run failures is challenging. Existing work cannot satisfy privacy, run-time overhead, diagnosis capability, and diagnosis latency requirements all at once.
This paper designs a low overhead, low latency, privacy preserving production-run failure diagnosis system based on two observations. First, short-term memory of program execution is often sufficient for failure diagnosis, as many bugs have short propagation distances. Second, maintaining a short-term memory of execution is much cheaper than maintaining a record of the whole execution. Following these observations, we first identify an existing hardware unit, Last Branch Record (LBR), that records the last few taken branches to help diagnose sequential bugs. We then propose a simple hardware extension, Last Cache-coherence Record (LCR), to record the last few cache accesses with specified coherence states and hence help diagnose concurrency bugs. Finally, we design LBRA and LCRA to automatically locate failure root causes using LBR and LCR.
Our evaluation uses 31 real-world sequential and concurrency bug failures from 18 representative open-source software. The results show that with just 16 record entries, LBR and LCR enable our system to automatically locate the root causes for 27 out of 31 failures, with less than 3% run-time overhead. As our system does not rely on sampling,

References

[1]
G. Altekar and I. Stoica. ODR: Output-deterministic replay for multicore debugging. In SOSP, 2009.
[2]
J. Arulraj, P.-C. Chang, G. Jin, and S. Lu. Production-run software failure diagnosis via hardware performance counters. In ASPLOS, 2013.
[3]
012)}nainar.12P. Arumuga Nainar. Applications of Static Analysis and Program Structure in Statistical Debugging. PhD thesis, University of Wisconsin -- Madison, 2012.
[4]
P. Arumuga Nainar and B. Liblit. Adaptive bug isolation. In ICSE, 2010.
[5]
B. Buck and J. K. Hollingsworth. An API for runtime code patching. Int. J. High Perform. Comput. Appl., 2000.
[6]
L. Ceze, P. Montesinos, C. von Praun, and J. Torrellas. Colorama: Architectural support for data-centric synchronization. In HPCA, 2007.
[7]
L. Chew and D. Lie. Kivati: Fast detection and prevention of atomicity violations. In EuroSys, 2010.
[8]
CNNMoneyTech. Is knight's $440 million glitch the costliest computer bug ever? http://money.cnn.com/2012/08/09/technology/knight-expensive-computer-bug/index.html.
[9]
J. Demme and S. Sethumadhavan. Rapid identification of architectural bottlenecks via precise event counting. In ISCA, 2011.
[10]
J. Demme, M. Maycock, J. Schmitz, A. Tang, A. Waksman, S. Sethumadhavan, and S. Stolfo. On the feasibility of online malware detection with performance counters. In ISCA, 2013.
[11]
J. L. Greathouse, Z. Ma, M. I. Frank, R. Peri, and T. M. Austin. Demand-driven software race detection using hardware performance counters. In ISCA, 2011.
[12]
W. Gu, Z. Kalbarczyk, R. K. Iyer, and Z. Yang. Characterization of Linux kernel behavior under errors. In DSN, 2003.
[13]
D. R. Hower, P. Montesinos, L. Ceze, M. D. Hill, and J. Torrellas. Two hardware-based approaches for deterministic multiprocessor replay. Commun. ACM, 52 (6), June 2009.
[14]
Intel. Nehalem PMU programming guide. http://software.intel.com/sites/default/files/76/87/30320, 2010.
[15]
Intel. Intel 64 and IA-32 architectures software developer's manual. http://download.intel.com/products/processor/manual/325462.pdf, 2013.
[16]
Intel. GDB - the GNU* project debugger for Intel architecture. http://software.intel.com/en-us/articles/intel-system-studio-gdb#Trace, March 2013.
[17]
J. L. Lions et. al. ARIANE 5 Flight 501 Failure -- report by the inquiry board. http://sunnyday.mit.edu/accidents/Ariane5accidentreport.html.
[18]
G. Jin, A. Thakur, B. Liblit, and S. Lu. Instrumentation and sampling strategies for cooperative concurrency bug isolation. In OOPSLA, 2010.
[19]
S. T. King, G. W. Dunlap, and P. M. Chen. Operating systems with time-traveling virtual machines. In Usenix Annual Technical Conference, 2005.
[20]
D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. M. Chen, and J. Flinn. Respec: Efficient online multiprocessor replay via speculation and external determinism. In ASPLOS, 2010.
[21]
N. Leveson and C. S. Turner. An investigation of the Therac-25 accidents. In IEEE Computer, 1993.
[22]
B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In PLDI, 2003.
[23]
B. Liblit, M. Naik, A. X. Zheng, A. Aiken, and M. I. Jordan. Scalable statistical bug isolation. In PLDI, 2005.
[24]
S. Lu, J. Tucek, F. Qin, and Y. Zhou. AVIO: Detecting atomicity violations via access interleaving invariants. In ASPLOS, 2006.
[25]
S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study of real world concurrency bug characteristics. In ASPLOS, 2008.
[26]
B. Lucia and L. Ceze. Finding concurrency bugs with context-aware communication graphs. In MICRO, 2009.
[27]
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.
[28]
V. Nagarajan and R. Gupta. ECMon: Exposing cache events for monitoring. In ISCA, 2009.
[29]
S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously recording program execution for deterministic replay debugging. In ISCA, 2005.
[30]
}nasdaqPCWorld. Nasdaq's Facebook Glitch Came From Race Conditions. http://www.pcworld.com/businesscenter/article/255911/nasdaqs_facebook_glitch_came_from_race_conditions.html.
[31]
C. Pedersen and J. Acampora. Intel code execution trace resources. http://noggin.intel.com/content/intel-code-execution-trace-resources, 2012.
[32]
M. Prvulovic. CORD: Cost-effective (and nearly overhead-free) order-reordering and data race detection. In HPCA, 2006.
[33]
M. Prvulovic and J. Torrellas. ReEnact: Using thread-level speculation mechanisms to debug data races in multithreaded codes. In ISCA, 2003.
[34]
F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as allergies c a safe method to survive software failures. In SOSP, 2005.
[35]
R. Santelices, J. A. Jones, Y. Yu, and M. J. Harrold. Lightweight fault-localization using multiple coverage types. In ICSE, 2009.
[36]
SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016.
[37]
T. Sheng, N. Vachharajani, S. Eranian, R. Hundt, W. Chen, and W. Zheng. RACEZ: A lightweight and non-invasive race detection tool for production applications. In ICSE, 2011.
[38]
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP, 2007.
[39]
G. Venkataramani, B. Roemer, Y. Solihin, and M. Prvulovic. MemTracker: Efficient and programmable support for memory access monitoring and debugging. In HPCA, 2007.
[40]
K. Walcott-Justice, J. Mars, and M. L. Soffa. THeME: A system for testing by hardware monitoring events. In ISSTA, 2012.
[41]
C. Willems, R. Hund, A. Fobian, D. Felsch, T. Holz, and A. Vasudevan. Down to the bare metal: Using processor features for binary analysis. In ACSAC, 2012.
[42]
J. Yu and S. Narayanasamy. A case for an interleaving constrained shared-memory multi-processor. In ISCA, 2009.
[43]
D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: Error diagnosis by connecting clues from run-time logs. In ASPLOS, 2010.
[44]
Yuan, Zheng, Park, Zhou, and Savage}logenhancer.asplos11D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving software diagnosability via log enhancement. In ASPLOS, 2011.
[45]
D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, Y. Zhou, and S. Savage. Be conservative: Enhancing failure diagnosis with proactive logging. In OSDI, 2012.
[46]
Yuan, Xing, Chen, and Zang}binyu.apsys11L. Yuan, W. Xing, H. Chen, and B. Zang. Security breaches as PMU deviation: Detecting and identifying security attacks using performance counters. In phAPSys, 2011.
[47]
W. Zhang, C. Sun, and S. Lu. ConMem: Detecting severe concurrency bugs through an effect-oriented approach. In ASPLOS, 2010.
[48]
W. Zhang, J. Lim, R. Olichandran, J. Scherpelz, G. Jin, S. Lu, and T. Reps. ConSeq: Detecting concurrency bugs through sequential errors. In ASPLOS, 2011.
[49]
P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient architecture support for software debugging. In ISCA, 2004.
[50]
P. Zhou, R. Teodorescu, and Y. Zhou. Hard: Hardware-assisted lockset-based race detection. In HPCA, 2007.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 42, Issue 1
ASPLOS '14
March 2014
729 pages
ISSN:0163-5964
DOI:10.1145/2654822
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems
    February 2014
    780 pages
    ISBN:9781450323055
    DOI:10.1145/2541940
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 February 2014
Published in SIGARCH Volume 42, Issue 1

Check for updates

Author Tags

  1. concurrency bugs
  2. failure diagnosis
  3. hardware performance monitoring unit
  4. production runs

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)0
Reflects downloads up to 06 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media