skip to main content
10.1109/SC41406.2024.00061acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Versatile Datapath Soft Error Detection on the Cheap for HPC Applications

Published: 17 November 2024 Publication History

Abstract

With the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, CONDA analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, CONDA detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that ConDa only incurs 57.79% runtime overhead, which is 41.84% faster than existing state-of-the-art, with the same level of error detection effectiveness and low detection latency.

Supplemental Material

MP4 File
Recorded presentation of "Versatile Datapath Soft Error Detection on the Cheap for HPC Applications" at SC24.

References

[1]
G. P. Saggese, N. J. Wang, Z. T. Kalbarczyk, S. J. Patel, and R. K. Iyer, "An experimental study of soft errors in microprocessors," IEEE micro, vol. 25, no. 6, pp. 30--39, 2005.
[2]
D. Tiwari, S. Gupta, J. Rogers, D. Maxwell, P. Rech, S. Vazhkudai, D. Oliveira, D. Londo, N. DeBardeleben, P. Navaux et al., "Understanding gpu errors on large-scale hpc systems and the implications for system design and operation," in 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2015, pp. 331--342.
[3]
C. Constantinescu, "Trends and challenges in vlsi circuit reliability," IEEE micro, vol. 23, no. 4, pp. 14--19, 2003.
[4]
M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson et al., "Addressing failures in exascale computing," The International Journal of High Performance Computing Applications, vol. 28, no. 2, pp. 129--173, 2014.
[5]
P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi, "Modeling the effect of technology trends on the soft error rate of combinational logic," in Proceedings International Conference on Dependable Systems and Networks. IEEE, 2002, pp. 389--398.
[6]
R. Lucas, J. Ang, K. Bergman, S. Borkar, W. Carlson, L. Carrington, G. Chiu, R. Colwell, W. Dally, J. Dongarra et al., "Doe advanced scientific computing advisory subcommittee (ascac) report: top ten exascale research challenges," USDOE Office of Science (SC)(United States), Tech. Rep., 2014.
[7]
A. Marathe, P. E. Bailey, D. K. Lowenthal, B. Rountree, M. Schulz, and B. R. de Supinski, "A run-time system for power-constrained hpc applications," in High Performance Computing: 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, Proceedings 30. Springer, 2015, pp. 394--408.
[8]
B. Li, R. Basu Roy, D. Wang, S. Samsi, V. Gadepally, and D. Tiwari, "Toward sustainable hpc: Carbon footprint estimation and environmental implications of hpc systems," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1--15.
[9]
N. Oh, P. P. Shirvani, and E. J. McCluskey, "Error detection by duplicated instructions in super-scalar processors," IEEE Transactions on Reliability, vol. 51, no. 1, pp. 63--75, 2002.
[10]
M. Didehban and A. Shrivastava, "nzdc: A compiler technique for near zero silent data corruption," in Proceedings of the 53rd Annual Design Automation Conference, 2016, pp. 1--6.
[11]
Y. Huang, S. Guo, S. Di, G. Li, and F. Cappello, "Mitigating silent data corruptions in hpc applications across multiple program inputs," in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2022, pp. 1--14.
[12]
Q. Lu, G. Li, K. Pattabiraman, M. S. Gupta, and J. A. Rivers, "Configurable detection of sdc-causing errors in programs," ACM Transactions on Embedded Computing Systems (TECS), vol. 16, no. 3, pp. 1--25, 2017.
[13]
A. Mahmoud, S. K. S. Hari, M. B. Sullivan, T. Tsai, and S. W. Keckler, "Optimizing software-directed instruction replication for gpu error detection," in SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018, pp. 842--854.
[14]
Z. He, Y. Huang, H. Xu, D. Tao, and G. Li, "Demystifying and mitigating cross-layer deficiencies of soft error protection in instruction duplication," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1--13.
[15]
N. Oh, P. P. Shirvani, and E. J. McCluskey, "Control-flow checking by software signatures," IEEE transactions on Reliability, vol. 51, no. 1, pp. 111--122, 2002.
[16]
O. Goloubeva, M. Rebaudengo, M. S. Reorda, and M. Violante, "Softerror detection using control flow assertions," in Proceedings 18th IEEE Symposium on Defect and Fault Tolerance in VLSI Systems. IEEE, 2003, pp. 581--588.
[17]
R. Vemu and J. Abraham, "Ceda: Control-flow error detection using assertions," IEEE Transactions on Computers, vol. 60, no. 9, pp. 1233--1245, 2011.
[18]
D. S. Khudia and S. Mahlke, "Low cost control flow protection using abstract control signatures," in Proceedings of the 14th ACM SIGPLAN/SIGBED conference on Languages, compilers and tools for embedded systems, 2013, pp. 3--12.
[19]
Z. Zhang, S. Park, and S. Mahlke, "Path sensitive signatures for control flow error detection," in The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, 2020, pp. 62--73.
[20]
A. Rhisheekesan, R. Jeyapaul, and A. Shrivastava, "Control flow checking or not?(for soft errors)," ACM Transactions on Embedded Computing Systems (TECS), vol. 18, no. 1, pp. 1--25, 2019.
[21]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August, "Swift: Software implemented fault tolerance," in International symposium on Code generation and optimization. IEEE, 2005, pp. 243--254.
[22]
M. Didehban, H. So, P. Gali, A. Shrivastava, and K. Lee, "Generic soft error data and control flow error detection by instruction duplication," IEEE Transactions on Dependable and Secure Computing, 2023.
[23]
S. Wang, G. Zhang, J. Wei, Y. Wang, J. Wu, and Q. Luo, "Understanding silent data corruptions in a large production cpu population," in Proceedings of the 29th Symposium on Operating Systems Principles, 2023, pp. 216--230.
[24]
P. H. Hochschild, P. Turner, J. C. Mogul, R. Govindaraju, P. Ranganathan, D. E. Culler, and A. Vahdat, "Cores that don't count," in Proceedings of the Workshop on Hot Topics in Operating Systems, 2021, pp. 9--16.
[25]
H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, and S. Sankar, "Silent data corruptions at scale," arXiv preprint arXiv:2102.11245, 2021.
[26]
C. Kalra, F. Previlon, N. Rubin, and D. Kaeli, "Armorall: Compiler-based resilience targeting gpu applications," ACM Transactions on Architecture and Code Optimization (TACO), vol. 17, no. 2, pp. 1--24, 2020.
[27]
I. Laguna, M. Schulz, D. F. Richards, J. Calhoun, and L. Olson, "Ipas: Intelligent protection against silent output corruption in scientific applications," in Proceedings of the 2016 International Symposium on Code Generation and Optimization, 2016, pp. 227--238.
[28]
S. Schuster, P. Ulbrich, I. Stilkerich, C. Dietrich, and W. Schröder-Preikschat, "Demystifying soft-error mitigation by control-flow checking-a new perspective on its effectiveness," ACM Transactions on Embedded Computing Systems (TECS), vol. 16, no. 5s, pp. 1--19, 2017.
[29]
R. W. Hamming, "Error detecting and error correcting codes," The Bell system technical journal, vol. 29, no. 2, pp. 147--160, 1950.
[30]
U. Sharif, D. Mueller-Gritschneder, and U. Schlichtmann, "Repair: Control flow protection based on register pairing updates for swimplemented hw fault tolerance," ACM Transactions on Embedded Computing Systems (TECS), vol. 20, no. 5s, pp. 1--22, 2021.
[31]
J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu, "Parboil: A revised benchmark suite for scientific and commercial throughput computing," Center for Reliable and High-Performance Computing, vol. 127, p. 27, 2012.
[32]
Y. Huang, Z. He, L. Li, and G. Li, "Characterizing runtime performance variation in error detection by duplicating instructions," in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 730--741.
[33]
S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, "Memory access scheduling," ACM SIGARCH Computer Architecture News, vol. 28, no. 2, pp. 128--138, 2000.
[34]
S. Di, M. S. Bouguerra, L. Bautista-Gomez, and F. Cappello, "Optimization of multi-level checkpoint model for large scale hpc applications," in 2014 IEEE 28th international parallel and distributed processing symposium. IEEE, 2014, pp. 1181--1190.
[35]
M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich, "Improving Performance via Mini-applications," Sandia National Laboratories, Tech. Rep. SAND2009-5574, 2009.
[36]
J. R. Tramm, A. R. Siegel, T. Islam, and M. Schulz, "XSBench - the development and verification of a performance abstraction for Monte Carlo reactor analysis," in PHYSOR 2014 - The Role of Reactor Physics toward a Sustainable Future, Kyoto, 2014. [Online]. Available: https://www.mcs.anl.gov/papers/P5064-0114.pdf
[37]
N. P. Benchmarks, "Nas parallel benchmarks," CG and IS, 2006.
[38]
M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown, "Mibench: A free, commercially representative embedded benchmark suite," in Proceedings of the fourth annual IEEE international workshop on workload characterization. WWC-4 (Cat. No. 01EX538). IEEE, 2001, pp. 3--14.
[39]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in 2009 IEEE international symposium on workload characterization (IISWC). Ieee, 2009, pp. 44--54.
[40]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The splash-2 programs: Characterization and methodological considerations," ACM SIGARCH computer architecture news, vol. 23, no. 2, pp. 24--36, 1995.
[41]
Z. Li, H. Menon, K. Mohror, P.-T. Bremer, Y. Livant, and V. Pascucci, "Understanding a program's resiliency through error propagation," in Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021, pp. 362--373.
[42]
M. Ebrahimi, M. Rashvand, F. Kaddachi, M. B. Tahoori, and G. Di Natale, "Revisiting software-based soft error mitigation techniques via accurate error generation and propagation models," in 2016 IEEE 22nd International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE, 2016, pp. 66--71.
[43]
Q. Lu, M. Farahani, J. Wei, A. Thomas, and K. Pattabiraman, "Llfi: An intermediate code-level fault injection tool for hardware faults," in 2015 IEEE International Conference on Software Quality, Reliability and Security. IEEE, 2015, pp. 11--16.
[44]
M. Didehban, S. R. D. Lokam, and A. Shrivastava, "Incheck: An inapplication recovery scheme for soft errors," in Proceedings of the 54th Annual Design Automation Conference 2017, 2017, pp. 1--6.
[45]
P. R. Bodmann, G. Papadimitriou, R. L. R. Junior, D. Gizopoulos, and P. Rech, "Soft error effects on arm microprocessors: Early estimations versus chip measurements," IEEE Transactions on Computers, vol. 71, no. 10, pp. 2358--2369, 2021.
[46]
Q. Guan, N. Debardeleben, S. Blanchard, and S. Fu, "F-sefi: A fine-grained soft error fault injection tool for profiling application vulnerability," in 2014 IEEE 28th International Parallel and Distributed Processing Symposium. IEEE, 2014, pp. 1245--1254.
[47]
Uiuc openimpact effort, "the openimpact ia-64 compiler." [deprecated]. [Online]. Available: http://gelato.uiuc.edu/
[48]
Llvm-or1k compiler [open access]. [Online]. Available: https://github.com/openrisc/llvm-or1k
[49]
L. Palazzi, G. Li, B. Fang, and K. Pattabiraman, "A tale of two injectors: End-to-end comparison of ir-level and assembly-level fault injection," in 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2019, pp. 151--162.
[50]
"Intel pin," https://www.intel.com/content/www/us/en/developer/articles/tool/pin-a-dynamic-binary-instrumentation-tool.html.
[51]
J. Calhoun, L. Olson, and M. Snir, "Flipit: An llvm based fault injector for hpc," in Euro-Par 2014: Parallel Processing Workshops: Euro-Par 2014 International Workshops, Porto, Portugal, August 25-26, 2014, Revised Selected Papers, Part 120. Springer, 2014, pp. 547--558.
[52]
S. Vishal, C. Sharma, and G. Gopalakrishnan, "Towards re-seiliency evaluation of vector programs," in 21st IEEE Workshop on Dependable Parallel, Distributed and Network-Centric Systems (DPDNS), 2016.
[53]
D. Li, J. S. Vetter, and W. Yu, "Classifying soft error vulnerabilities in extreme-scale scientific applications using a binary instrumentation tool," in SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 2012, pp. 1--11.
[54]
U. Wappler and C. Fetzer, "Hardware fault injection using dynamic binary instrumentation: Fitgrind," Proceedings Supplemental Volume of EDCC-6, 2006.
[55]
S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran, "Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults," ACM SIGARCH Computer Architecture News, vol. 40, no. 1, pp. 123--134, 2012.
[56]
S. Feng, S. Gupta, A. Ansari, and S. Mahlke, "Shoestring: probabilistic soft error reliability on the cheap," ACM SIGARCH Computer Architecture News, vol. 38, no. 1, pp. 385--396, 2010.
[57]
G. Li, K. Pattabiraman, S. K. S. Hari, M. Sullivan, and T. Tsai, "Modeling soft-error propagation in programs," in 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 2018, pp. 27--38.
[58]
G. Gunow, J. Tramm, B. Forget, K. Smith, and T. He, "SimpleMOC - a performance abstraction for 3D MOC," in ANS & M&C 2015 - Joint International Conference on Mathematics and Computation (M&C), Supercomputing in Nuclear Applications (SNA) and the Monte Carlo (MC) Method, 2015.
[59]
K.-C. Wu and D. Marculescu, "Power-aware soft error hardening via selective voltage scaling," in 2008 IEEE International Conference on Computer Design. IEEE, 2008, pp. 301--306.
[60]
S. Poledna, Fault-tolerant real-time systems: The problem of replica determinism. Springer Science & Business Media, 2007, vol. 345.
[61]
E. Ozer, B. Venu, X. Iturbe, S. Das, S. Lyberis, J. Biggs, P. Harrod, and J. Penton, "Error correlation prediction in lockstep processors for safety-critical systems," in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018, pp. 737--748.
[62]
Y. Huang, S. Guo, S. Di, G. Li, and F. Cappello, "Hardening selective protection across multiple program inputs for hpc applications," in Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2022, pp. 437--438.
[63]
J. Chang, G. A. Reis, and D. I. August, "Automatic instruction-level software-only recovery," in International Conference on Dependable Systems and Networks (DSN'06). IEEE, 2006, pp. 83--92.
[64]
K. Zhao, S. Di, S. Li, X. Liang, Y. Zhai, J. Chen, K. Ouyang, F. Cappello, and Z. Chen, "Ft-cnn: Algorithm-based fault tolerance for convolutional neural networks," IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 7, pp. 1677--1689, 2020.
[65]
M. H. Rahman, A. Shamji, S. Guo, and G. Li, "Peppa-x: finding program test inputs to bound silent data corruption vulnerability in hpc applications," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1--13.
[66]
M. H. Rahman, S. Di, S. Guo, X. Lu, G. Li, and F. Cappello, "Druto: Upper-bounding silent data corruption vulnerability in gpu applications," in 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2024, pp. 582--594.
[67]
M. H. Rahman, S. Laskar, and G. Li, "Investigating the impact of transient hardware faults on deep learning neural network inference," Software Testing, Verification and Reliability, p. e1873, 2024.
[68]
B. Zhang, L. Yang, G. Li, and H. Xu, "Investigating the impact of high-level software design on low-level hardware fault resilience," in 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S). IEEE, 2023, pp. 163--167.
[69]
H. Yue, X. Wei, G. Li, J. Zhao, N. Jiang, and J. Tan, "G-sepm: building an accurate and efficient soft error prediction model for gpgpus," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1--15.
[70]
B. Zhang, Y. Huang, and G. Li, "Salus: A novel data-driven monitor that enables real-time safety in autonomous driving systems," in 2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2022, pp. 85--94.
[71]
Z. Chen, T. Verrecchia, H. Sun, J. Booth, and P. Raghavan, "Dynamic selective protection of sparse iterative solvers via ml prediction of soft error impacts," in Proceedings of the SC'23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, 2023, pp. 488--491.
[72]
N. Murphy and M. Barr, "Watchdog timers," Embedded Systems Programming, vol. 14, no. 11, pp. 79--80, 2001.
[73]
N. R. Saxena and E. J. McCluskey, "Control-flow checking using watchdog assists and extended-precision checksums," IEEE Transactions on Computers, vol. 39, no. 4, pp. 554--559, 1990.
[74]
E. Borin, C. Wang, Y. Wu, and G. Araujo, "Software-based transparent and comprehensive control-flow error detection," in International Symposium on Code Generation and Optimization (CGO'06). IEEE, 2006, pp. 13-pp.
[75]
Z. Alkhalifa, V. S. Nair, N. Krishnamurthy, and J. A. Abraham, "Design and evaluation of system-level checks for on-line control flow error detection," IEEE Transactions on Parallel and Distributed Systems, vol. 10, no. 6, pp. 627--641, 1999.
[76]
E. Chielle, F. Rosa, G. S. Rodrigues, L. A. Tambara, J. Tonfat, E. Macchione, F. Aguirre, N. Added, N. Medina, V. Aguiar et al., "Reliability on arm processors against soft errors through sihft techniques," IEEE Transactions on Nuclear Science, vol. 63, no. 4, pp. 2208--2216, 2016.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2024
1758 pages
ISBN:9798350352917

Sponsors

Publisher

IEEE Press

Publication History

Published: 17 November 2024

Check for updates

Author Tags

  1. Code Transformation
  2. Compiler
  3. Datapath Protection
  4. High-Performance Computing (HPC)
  5. Reliability
  6. Soft Errors

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 137
    Total Downloads
  • Downloads (Last 12 months)137
  • Downloads (Last 6 weeks)69
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media