skip to main content
10.1145/2694344.2694348acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

Published: 14 March 2015 Publication History

Abstract

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

References

[1]
Flux calculator. http://seutest.com/cgi-bin/FluxCalculator.cgi.
[2]
mcelog: memory error handling in user space. http://halobates.de/lk10-mcelog.pdf.
[3]
AMD. AMD graphics cores next (GCN) architecture. http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf.
[4]
AMD. Bios and kernel developer guide (BKDG) for AMD family 10h models 00h-0fh processors. http://developer.amd.com/wordpress/media/2012/10/31116.pdf.
[5]
AMD. AMD64 architecture programmer's manual volume 2: System programming, revision 3.23. http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf, 2013.
[6]
A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1):11--33, Jan.-Mar. 2004.
[7]
R. Baumann. Radiation-induced soft errors in advanced semi-conductor technologies. IEEE Transactions on Device and Materials Reliability, 5(3):305--316, Sept. 2005.
[8]
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, Peter Kogge, editor & study lead, Sep. 2008.
[9]
L. Borucki, G. Schindlbeck, and C. Slayman. Comparison of accelerated DRAM soft error rates measured at component and system level. In IEEE International Reliability Physics Symposium (IRPS), pages 482--487, 2008.
[10]
C. Constantinescu. Impact of deep submicron technology on dependability of VLSI circuits. In International Conference on Dependable Systems and Networks (DSN), pages 205--209, 2002.
[11]
C. Constantinescu. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, Jul.-Aug. 2003.
[12]
C. Di Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In International Conference on Dependable Systems and Networks (DSN), pages 610--621, 2014.
[13]
A. Dixit, R. Heald, and A. Wood. Trends from ten years of soft error experimentation. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2009.
[14]
N. El-Sayed, I. A. Stefanovici, G. Amvrosiadis, A. A. Hwang, and B. Schroeder. Temperature management in data centers: why some (might) like it hot. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), pages 163--174, 2012.
[15]
X. Huang, W.-C. Lee, C. Kuo, D. Hisamoto, L. Chang, J. Kedzierski, E. Anderson, H. Takeuchi, Y.-K. Choi, K. Asano, V. Subramanian, T.-J. King, J. Bokor, and C. Hu. Sub 50-nm FinFET: PMOS. In International Electron Devices Meeting (IEDM), pages 67--70, 1999.
[16]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder. Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 111--122, 2012.
[17]
E. Ibe, H. Taniguchi, Y. Yahagi, K. i. Shimbo, and T. Toba. Impact of scaling on neutron-induced soft error in SRAMs from a 250 nm to a 22 nm design rule. In IEEE Transactions on Electron Devices, pages 1527--1538, Jul. 2010.
[18]
X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar. Low-power, low-storage-overhead chipkill correct via multi-line error correction. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 24:1--24:12, 2013.
[19]
Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee, C. Wilkerson, K. Lai, and O. Mutlu. Flipping bits in memory without accessing them: An experimental study of DRAM disturbance errors. In International Symposium on Computer Architecture (ISCA), pages 361 -- 372, 2014.
[20]
X. Li, M. C. Huang, K. Shen, and L. Chu. A realistic evaluation of memory hardware errors and software system susceptibility. In USENIX Annual Technical Conference (USENIX- ATC), pages 6--20, 2010.
[21]
X. Li, K. Shen, M. C. Huang, and L. Chu. A memory soft error measurement on production systems. In USENIX Annual Technical Conference (USENIXATC), pages 21:1--21:6, 2007.
[22]
P. W. Lisowski and K. F. Schoenberg. The Los Alamos neutron science center. In Nuclear Instruments and Methods, volume 562:2, pages 910--914, June 2006.
[23]
T. May and M. H. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, Jan. 1979.
[24]
A. Messer, P. Bernadat, G. Fu, D. Chen, Z. Dimitrijevic, D. Lie, D. Mannaru, A. Riska, and D. Milojicic. Susceptibility of commodity systems and software to memory soft errors. IEEE Transactions on Computers, 53(12):1557--1568, Dec. 2004.
[25]
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In International Symposium on Microarchitecture (MICRO), pages 29--40, 2003.
[26]
J. T. Pawlowski. Memory errors and mitigation: Keynote talk for SELSE 2014. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2014.
[27]
H. Quinn, P. Graham, and T. Fairbanks. SEEs induced by high-energy protons and neutrons in SDRAM. In IEEE Radiation Effects Data Workshop (REDW), pages 1--5, 2011.
[28]
B. Schroeder. Personal Communication.
[29]
B. Schroeder and G. Gibson. A large-scale study of failures in high-performance computing systems. In International Conference on Dependable Systems and Networks (DSN), pages 249--258, 2006.
[30]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: a large-scale field study. Commun. ACM, 54(2):100--107, Feb. 2011.
[31]
T. Siddiqua, A. Papathanasiou, A. Biswas, and S. Gurumurthi. Analysis of memory errors from large-scale field data collection. In IEEE Workshop on Silicon Errors in Logic - System Effects (SELSE), 2013.
[32]
J. Sim, G. H. Loh, V. Sridharan, and M. O'Connor. Resilient die-stacked DRAM caches. In International Symposium on Computer Architecture (ISCA), pages 416--427, 2013.
[33]
V. Sridharan and D. Liberty. A study of DRAM failures in the field. In International Conference on High Performance Computing, Networking, Storage and Analysis (SC), pages 76:1--76:11, 2012.
[34]
V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi. Feng shui of supercomputer memory: Positional effects in DRAM and SRAM faults. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), pages 22:1--22:11, 2013.
[35]
A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi. LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems. In International Symposium on Computer Architecture (ISCA), pages 285--296, 2012.
[36]
J. Wadden, A. Lyashevsky, S. Gurumurthi, V. Sridharan, and K. Skadron. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In International Symposium on Computer Architecture (ISCA), pages 73--84, 2014.
[37]
C. Wilkerson, A. R. Alameldeen, Z. Chishti, W. Wu, D. Somasekhar, and S.-l. Lu. Reducing cache power with low-cost, multi-bit error-correcting codes. In International Symposium on Computer Architecture (ISCA), pages 83--93, 2010.

Cited By

View all

Index Terms

  1. Memory Errors in Modern Systems: The Good, The Bad, and The Ugly

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
      March 2015
      720 pages
      ISBN:9781450328357
      DOI:10.1145/2694344
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 14 March 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. field studies
      2. large-scale systems
      3. reliability

      Qualifiers

      • Research-article

      Funding Sources

      • United States Department of Energy

      Conference

      ASPLOS '15

      Acceptance Rates

      ASPLOS '15 Paper Acceptance Rate 48 of 287 submissions, 17%;
      Overall Acceptance Rate 535 of 2,713 submissions, 20%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)171
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 15 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media