skip to main content
10.1145/1669112.1669129acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

Published: 12 December 2009 Publication History

Abstract

Continued technology scaling is resulting in systems with billions of devices. Unfortunately, these devices are prone to failures from various sources, resulting in even commodity systems being affected by the growing reliability threat. Thus, traditional solutions involving high redundancy or piecemeal solutions targeting specific failure modes will no longer be viable owing to their high overheads. Recent reliability solutions have explored using low-cost monitors that watch for anomalous software behavior as a symptom of hardware faults. We previously proposed the SWAT system that uses such low-cost detectors to detect hardware faults, and a higher cost mechanism for diagnosis. However, all of the prior work in this context, including SWAT, assumes single-threaded applications and has not been demonstrated for multithreaded applications running on multicore systems.
This paper presents mSWAT, the first work to apply symptom based detection and diagnosis for faults in multicore architectures running multithreaded software. For detection, we extend the symptom-based detectors in SWAT and show that they result in a very low Silent Data Corruption (SDC) rate for both permanent and transient hardware faults. For diagnosis, the multicore environment poses significant new challenges. First, deterministic replay required for SWAT's single-threaded diagnosis incurs higher overheads for multithreaded workloads. Second, the fault may propagate to fault-free cores resulting in symptoms from fault-free cores and no available known-good core, breaking fundamental assumptions of SWAT's diagnosis algorithm. We propose a novel permanent fault diagnosis algorithm for multithreaded applications running on multicore systems that uses a lightweight isolated deterministic replay to diagnose the faulty core with no prior knowledge of a known good core. Our results show that this technique successfully diagnoses over 95% of the detected permanent faults while incurring low hardware overheads. mSWAT thus offers an affordable solution to protect future multicore systems from hardware faults.

References

[1]
D. Bernick et al. NonStop Advanced Architecture. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2005.
[2]
S. Borkar. Designing Reliable Systems from Unreliable Components: The Challenges of Transistor Variability and Degradation. IEEE Micro, 25(6), 2005.
[3]
F. A. Bower, D. Sorin, and S. Ozev. Online Diagnosis of Hard Faults in Microprocessors. ACM Transactions on Architecture and Code Optimization (TACO), 4(2), 2007.
[4]
M. Dimitrov and H. Zhou. Unified Architectural Support for Soft-Error Protection or Software Bug Detection. In Proceedings of the International Conference on Parallel Archtectures and Compilation Techniques (PACT), 2007.
[5]
O. Goloubeva, M. Rebaudengo, M. S. Reonda, and M. Violante. Soft-Error Detection Using Control Flow Assertions. In International Symposium on Defect and Fault Tolerance in VLSI Systems, 2003.
[6]
S. Gupta, S. Feng, A. Ansari, J. Blome, and S. Mahlke. The StageNet Fabric for Constructing Reslilient Multicore Systems. In Proceedings of the International Symposium on Microarchitecture (MICRO), 2008.
[7]
D. R. Hower and M. D. Hill. Rerun: Exploiting Episodes for Lightweight Memory Race Ordering. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2008.
[8]
M.-L. Li, P. Ramachandran, R. Karpuzcu, S. K. S. Hari, and S. Adve. Accurate Microarchitecture-Level Fault Modeling for Studying Hardware Faults. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2009.
[9]
M.-L. Li, P. Ramachandran, S. K. Sahoo, S. Adve, V. Adve, and Y. Zhou. Trace-Based Microarchitecture-Level Diagnosis of Permanent Hardware Faults. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2008.
[10]
M.-L. Li, P. Ramachandran, S. K. Sahoo, S. Adve, V. Adve, and Y. Zhou. Understanding the Propagation of Hard Errors to Software and Implications for Resilient Systems Design. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2008.
[11]
X. Li and D. Yeung. Application-level correctness and its impact on fault tolerance. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2007.
[12]
G. Lyle, S. Cheny, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer. An End-to-end Approach for the Automatic Derivation of Application-Aware Error Detectors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2009.
[13]
C. Mauer, M. Hill, and D. Wood. Full-System Timing-First Simulation. Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 2002.
[14]
A. Meixner, M. Bauer, and D. Sorin. Argus: Low-Cost, Comprehensive Error Detection in Simple Cores. In Proceedings of the International Symposium on Microarchitecture (MICRO), 2007.
[15]
P. Montesinos, L. Ceze, and J. Torellas. DeLorean: Recording and Deterministically Replaying Shared Memory Multiprocessor Execution Efficiently. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2008.
[16]
J. Nakano, P. Montesinos, K. Gharacorloo, and J. Torrellas. Re-ViveI/O: Efficient Handling of I/O in Highly-Available Rollback-Recovery Servers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2006.
[17]
N. Nakka, G. P. Saggese, Z. Kalbarczyk, and R. K. Iyer. An Architectural Framework for Detecting Process Hangs/Crashes. In Proceedings of European Dependable Computing Conference (EDCC), 2005.
[18]
S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2005.
[19]
K. Pattabiraman, G. Saggese, D. Chen, Z. Kalbarczyk, and R. K. Iyer. Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware. In European Dependable Computing Conference, 2006.
[20]
A. Pellegrini, K. Constantinides, D. Zhang, S. Sudhakar, V. Bertacco, and T. Austin. CrashTest: A Fast High-Fidelity FPGA-based Resiliency Analysis Framework. In IEEE International Conference on Computer Design, September 2008.
[21]
M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2002.
[22]
P. Racunas, K. Constantinides, S. Manne, and S. S. Mukherjee. Perturbation-based Fault Screening. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 2007.
[23]
S. K. Sahoo, M.-L. Li, P. Ramachandran, S. Adve, V. Adve, and Y. Zhou. Using Likely Program Invariants to Detect Hardware Errors. In Proceedings of the International Conference on Dependable Systems and Networks (DSN), 2008.
[24]
S. Shyam, K. Constantinides, S. Phadke, V. Bertacco, and T. Austin. Ultra Low-Cost Defect Protection for Microprocessor Pipelines. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006.
[25]
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2002.
[26]
L. Spainhower et al. IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. In IBM Journal of R&D, September/November 1999.
[27]
J. Srinivasan, S. Adve, P. Bose, and J. A. Rivers. Exploiting Structural Duplication for Lifetime Reliability Enhancement. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2005.
[28]
R. Venkatasubramanian, J. Hayes, and B. Murray. Low-Cost On-Line Fault Detection Using Control Flow Assertions. In Proceedings of the International Online Test Symposium, 2003.
[29]
Virtutech. Simics Full System Simulator. Website, 2006. http://www.simics.net.
[30]
N. Wang and S. Patel. ReStore: Symptom-Based Soft Error Detection in Microprocessors. IEEE Transactions on Dependable and Secure Computing, 3(3), July-Sept 2006.
[31]
M. Xu, R. Bodik, and M. Hill. A "Flight Data Recorder" for Enabling Full-system Multiprocessor Deterministic Replay. In Proceedings of the International Symposium on Computer Architecture (ISCA), 2003.

Cited By

View all

Index Terms

  1. mSWAT: low-cost hardware fault detection and diagnosis for multicore systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture
      December 2009
      601 pages
      ISBN:9781605587981
      DOI:10.1145/1669112
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 December 2009

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. architecture
      2. error detection
      3. fault injection
      4. multicore processors

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      Micro-42
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 484 of 2,242 submissions, 22%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)20
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 01 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media