skip to main content
10.5555/3571885.3571986acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

From correctable memory errors to uncorrectable memory errors: what error bits tell

Published: 18 November 2022 Publication History

Abstract

Uncorrectable memory errors are one of the major failure causes in datacenters. In this paper, we present an empirical study correlating correctable errors (CEs) and uncorrectable errors (UEs) using the large-scale field data across 3 major dual inline memory module (DIMM) manufacturers from a contemporary server farm of ByteDance. Different from the previous studies, our study is the first to comprehend the error-bit information of CEs and the DIMM part numbers. Unlike the traditional chipkill error correction code (ECC), in contemporary Intel server platforms the ECC gets weakened, not able to tolerate some error-bit patterns from a single chip. Using obtainable coarse-grained ECC knowledge, we derive a new indicator from the error-bit information: risky CE occurrence in terms of ECC guaranteed coverage. From the data, we show that the new indicator has a consistently high sensitivity and specificity in the test of future UE occurrences across DIMMs from different manufacturers. This leads us to conjecture that the weakened ECC substantially contributes to many UEs today. The new risky CE indicator is then applied in predicting the future UE occurrence based on the CE history. We empirically demonstrate how practically useful predictors are constructed in conjunction with other useful attributes such as certain micro-level fault indicators and DIMM part numbers, achieving the state-of-the-art performance.

Supplementary Material

MP4 File (SC22_Presentation_Li_Cong.mp4)
Presentation at SC '22

References

[1]
T. J. Dell, "A white paper on the benefits of chipkill-correct ECC for PC server main memory," Jan. 1997.
[2]
A. A. Hwang, I. A. Stefanovici, and B. Schroeder, "Cosmic rays don't strike twice: Understanding the nature of DRAM errors and the implications for system design," in Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XVII. New York, NY, USA: ACM, 2012, pp. 111--122.
[3]
X. Li, M. C. Huang, K. Shen, and L. Chu, "An empirical study of memory hardware errors in a server farm," in Proceedings of the 3rd Workshop on on Hot Topics in System Dependability, ser. HotDep '07. Berkeley, CA, USA: USENIX Association, 2007.
[4]
B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM errors in the wild: A large-scale field study," in Proceedings of the 11th International Joint Conference on Measurement and Modeling of Computer Systems, ser. SIGMETRICS '09. New York, NY, USA: ACM, 2009, pp. 193--204.
[5]
X. Li, M. C. Huang, K. Shen, and L. Chu, "A realistic evaluation of memory hardware errors and software system susceptibility," in Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, ser. USENIX ATC '10. Berkeley, CA, USA: USENIX Association, 2010.
[6]
V. Sridharan and D. Liberty, "A study of DRAM failures in the field," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Los Alamitos, CA, USA: IEEE Computer Society Press, 2012, pp. 76:1--76:11.
[7]
J. Meza, Q. Wu, S. Kumar, and O. Mutlu, "Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field," in Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, ser. DSN '15. Washington, DC, USA: IEEE Computer Society, 2015, pp. 415--426.
[8]
V. Sridharan, N. DeBardeleben, S. Blanchard, K. B. Ferreira, J. Stearley, J. Shalf, and S. Gurumurthi, "Memory errors in modern systems: The good, the bad, and the ugly," in Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '15. New York, NY, USA: ACM, 2015, pp. 297--310.
[9]
I. Giurgiu, J. Szabo, D. Wiesmann, and J. Bird, "Predicting DRAM reliability in the field with machine learning," in Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference: Industrial Track, ser. Middleware '17. New York, NY, USA: ACM, 2017, pp. 15--21.
[10]
S. Levy, K. B. Ferreira, N. DeBardeleben, T. Siddiqua, V. Sridharan, and E. Baseman, "Lessons learned from memory errors observed over the lifetime of Cielo," in Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, ser. SC '18. Piscataway, NJ, USA: IEEE Press, 2018, pp. 43:1--43:12.
[11]
X. Du and C. Li, "Memory failure prediction using online learning," in Proceedings of the 4th International Symposium on Memory Systems, ser. MEMSYS '18. New York, NY, USA: ACM, 2018, pp. 38--49.
[12]
E. Baseman, N. Debardeleben, S. Blanchard, J. Moore, O. Tkachenko, K. Ferreira, T. Siddiqua, and V. Sridharan, "Physics-informed machine learning for DRAM error modeling," in 2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2018, pp. 1--6.
[13]
D. Zivanovic, P. E. Dokht, S. Moré, J. Bartolome, P. M. Carpenter, P. Radojković, and E. Ayguadé, "DRAM errors in the field: A statistical approach," in Proceedings of the International Symposium on Memory Systems, ser. MEMSYS '19. New York, NY, USA: Association for Computing Machinery, 2019, pp. 69--84.
[14]
X. Du, C. Li, S. Zhou, M. Ye, and J. Li, "Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data," in 2020 16th European Dependable Computing Conference, ser. EDCC '20, 2020, pp. 41--46.
[15]
I. Boixaderas, D. Zivanovic, S. Moré, J. Bartolome, D. Vicente, M. Casas, P. M. Carpenter, P. Radojković, and E. Ayguadé, "Cost-aware prediction of uncorrected DRAM errors in the field," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '20. IEEE Press, 2020.
[16]
X. Du and C. Li, "Predicting uncorrectable memory errors from the correctable error history: No free predictors in the field," in Proceedings of the 7th International Symposium on Memory Systems, ser. MEMSYS '21. New York, NY, USA: ACM, 2021, pp. 1:1--1:10.
[17]
F. Yu, H. Xu, S. Jian, C. Huang, Y. Wang, and Z. Wu, "DRAM failure prediction in large-scale data centers," in 2021 IEEE International Conference on Joint Cloud Computing (JCC). Los Alamitos, CA, USA: IEEE Computer Society, Aug. 2021, pp. 1--8.
[18]
P. Zhang, Y. Wang, X. Ma, Y. Xu, B. Yao, X. Zheng, and L. Jiang, "Predicting DRAM-caused node unavailability in hyper-scale clouds," in 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2022, pp. 275--286.
[19]
Z. Cheng, S. Han, P. P. C. Lee, X. Li, J. Liu, and Z. Li, "An in-depth correlative study between DRAM errors and server failures in production data centers," in 2022 41st International Symposium on Reliable Distributed Systems (SRDS), 2022, to appear.
[20]
Fujitsu, "2020.1 Intel platform update (IPU)," https://www.fujitsu.com/my/support/products/computing/pc/ap/announcements/2020-intel-platform-update-ipu.html, accessed: 2022-03-15.
[21]
K. Criss, K. Bains, R. Agarwal, T. Bennett, T. Grunzke, J. K. Kim, H. Chung, and M. Jang, "Improving memory reliability by bounding DRAM faults: DDR5 improved reliability features," in Proceedings of the 6th International Symposium on Memory Systems, ser. MEMSYS '20. New York, NY, USA: Association for Computing Machinery, 2020, pp. 317--322.
[22]
X. Du, C. Li, S. Zhou, X. Liu, X. Xu, T. Wang, and S. Ge, "Fault-aware prediction-guided page offlining for uncorrectable memory error prevention," in 2021 IEEE 39th International Conference on Computer Design, ser. ICCD '21, 2021, pp. 456--463.
[23]
M. Patel, J. S. Kim, T. Shahroodi, H. Hassan, and O. Mutlu, "Bit-exact ECC recovery (BEER): Determining DRAM on-die ECC functions by exploiting DRAM data retention characteristics," in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '20. Los Alamitos, CA, USA: IEEE Computer Society, Oct. 2020, pp. 282--297.
[24]
D. G. Altman and J. M. Bland, "Diagnostic tests. 1: Sensitivity and specificity." BMJ: British Medical Journal, vol. 308, no. 6943, p. 1552, 1994.
[25]
C. Constantinescu, "AMD EPYC™ 7002 series - a processor with improved soft error resilience," in 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S), 2021, pp. 33--36.
[26]
R. L. Rivest, "Learning decision lists," Machine learning, vol. 2, no. 3, pp. 229--246, 1987.
[27]
H. Schütze, C. D. Manning, and P. Raghavan, Introduction to information retrieval. Cambridge University Press Cambridge, 2008, vol. 39.
[28]
J. R. Quinlan, C4.5: programs for machine learning. Elsevier, 2014.
[29]
B. Nie, J. Xue, S. Gupta, T. Patel, C. Engelmann, E. Smirni, and D. Tiwari, "Machine learning models for GPU error prediction in a large scale HPC system," in 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2018, pp. 95--106.
[30]
L. Mukhanov, K. Tovletoglou, H. Vandierendonck, D. S. Nikolopoulos, and G. Karakonstantis, "Workload-aware DRAM error prediction using machine learning," in 2019 IEEE International Symposium on Workload Characterization (IISWC), 2019, pp. 106--118.
[31]
K. Karlapalem, H. Cheng, N. Ramakrishnan, R. K. Agrawal, P. K. Reddy, J. Srivastava, and T. Chakraborty, Eds., Advances in Knowledge Discovery and Data Mining - 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11--14, 2021, Proceedings, ser. Lecture Notes in Computer Science, vol. 12714. Springer, 2021.
[32]
S. Schechter, G. H. Loh, K. Strauss, and D. Burger, "Use ECP, not ECC, for hard failures in resistive memories," in Proceedings of the 37th Annual International Symposium on Computer Architecture, ser. ISCA '10. New York, NY, USA: ACM, 2010, pp. 141--152.
[33]
X. Du and C. Li, "DPCLS: Improving partial cache line sparing with dynamics for memory error prevention," in 2020 IEEE 38th International Conference on Computer Design, ser. ICCD '20, 2020, pp. 197--204.
[34]
D. Tang, P. Carruthers, Z. Totari, and M. W. Shapiro, "Assessment of the effect of memory page retirement on system RAS against hardware faults," in Proceedings of the International Conference on Dependable Systems and Networks, ser. DSN '06. Washington, DC, USA: IEEE Computer Society, 2006, pp. 365--370.
[35]
C. H. A. Costa, Y. Park, B. S. Rosenburg, C.-Y. Cher, and K. D. Ryu, "A system software approach to proactive memory-error avoidance," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '14. Piscataway, NJ, USA: IEEE Press, 2014, pp. 707--718.
[36]
X. Du and C. Li, "Combining error statistics with failure prediction in memory page offlining," in Proceedings of the International Symposium on Memory Systems, ser. MEMSYS '19. New York, NY, USA: ACM, 2019, pp. 127--132.
[37]
M. Patel, G. F. de Oliveira, and O. Mutlu, "HARP: Practically and effectively identifying uncorrectable errors in memory chips that use on-die error-correcting codes," in MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '21. New York, NY, USA: Association for Computing Machinery, 2021, pp. 623--640.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2022
1277 pages
ISBN:9784665454445

Sponsors

In-Cooperation

  • IEEE CS

Publisher

IEEE Press

Publication History

Published: 18 November 2022

Check for updates

Author Tags

  1. error correction code
  2. memory reliability
  3. risky errors
  4. uncorrectable error prediction

Qualifiers

  • Research-article

Conference

SC '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 382
    Total Downloads
  • Downloads (Last 12 months)228
  • Downloads (Last 6 weeks)16
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media