skip to main content
research-article
Open access

ArmorAll: Compiler-based Resilience Targeting GPU Applications

Published: 29 May 2020 Publication History

Abstract

The vulnerability of GPUs to soft errors has become a first-class design concern as they are increasingly being used in accuracy-sensitive and safety-critical domains. Existing solutions used to enhance the reliability of GPUs come with significant overhead in terms of area, power, and/or performance. In this article, we propose ArmorAll, a light-weight, adaptive, selective, and portable software solution to protect GPUs against soft errors. ArmorAll consists of a set of purely compiler-based redundancy schemes designed to optimize instruction duplication on GPUs, thereby enabling much more reliable execution. The choice of the scheme determines the subset of instructions that must be duplicated in an application, allowing adaptable fault coverage for different applications. ArmorAll can intelligently select a redundancy scheme that provides the best coverage to an application with an accuracy of 91.7%. The high coverage provided by ArmorAll comes at an average improvement of 64.5% in runtime when using the selected redundancy scheme as compared to the state-of-the-art.

References

[1]
[n.d.]. Enabling on-the-fly manipulations with LLVM IR code of CUDA sources. Retrieved from https://github.com/apc-llc/nvcc-llvm-ir.
[2]
Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. 1986. Compilers: Principles, Techniques, and Tools. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.
[3]
E. Alerstam, T. Svensson, and S. Andersson-Engels. 2008. Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration. Journal of Biomedical Optics 13, 6, Article 060504 (2008) https://doi.org/10.1117/1.3041496
[4]
W. Bartlett and L. Spainhower. 2004. Commercial fault tolerance: A tale of two systems. IEEE Transactions on Dependable and Secure Computing 1, 1 (Jan. 2004), 87--96.
[5]
Ian Briggs, Arnab Das, Mark Baranowski, Vishal Sharma, Sriram Krishnamoorthy, Zvonimir Rakamariundefined, and Ganesh Gopalakrishnan. 2019. FailAmp: Relativization transformation for soft error detection in structured address generation. ACM Transactions on Architecture and Code Optimization 16, 4, Article 50 (Dec. 2019), 21 pages.
[6]
Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently computing static single assignment form and the control dependence graph. ACM Transactions on Programming Languages and Systems 13, 4 (Oct. 1991), 451--490.
[7]
Moslem Didehban and Aviral Shrivastava. 2016. NZDC: A compiler technique for near zero silent data corruption. In Proceedings of the 53rd Annual Design Automation Conference (DAC).
[8]
Martin Dimitrov, Mike Mantor, and Huiyang Zhou. 2009. Understanding software approaches for GPGPU reliability. In Proceedings of the 2nd Workshop on General Purpose Processing on Graphics Processing Units (GPGPU-2). ACM, New York, NY, 94--104.
[9]
Jack Dongarra, Pete Beckman, Terry Moore, Patrick Aerts, Giovanni Aloisio, Jean-Claude Andre, David Barkai, Jean-Yves Berthou, Taisuke Boku, Bertrand Braunschweig, Franck Cappello, Barbara Chapman, Xuebin Chi, Alok Choudhary, Sudip Dosanjh, Thom Dunning, Sandro Fiore, Al Geist, Bill Gropp, Robert Harrison, Mark Hereld, Michael Heroux, Adolfy Hoisie, Koh Hotta, Zhong Jin, Yutaka Ishikawa, Fred Johnson, Sanjay Kale, Richard Kenway, David Keyes, Bill Kramer, Jesus Labarta, Alain Lichnewsky, Thomas Lippert, Bob Lucas, Barney Maccabe, Satoshi Matsuoka, Paul Messina, Peter Michielse, Bernd Mohr, Matthias S. Mueller, Wolfgang E. Nagel, Hiroshi Nakashima, Michael E. Papka, Dan Reed, Mitsuhisa Sato, Ed Seidel, John Shalf, David Skinner, Marc Snir, Thomas Sterling, Rick Stevens, Fred Streitz, Bob Sugar, Shinji Sumimoto, William Tang, John Taylor, Rajeev Thakur, Anne Trefethen, Mateo Valero, Aad Van Der Steen, Jeffrey Vetter, Peg Williams, Robert Wisniewski, and Kathy Yelick. 2011. The international exascale software project roadmap. International Journal of High Performance Computing Applications 25, 1 (Feb. 2011), 3--60.
[10]
Shuguang Feng, Shantanu Gupta, Amin Ansari, and Scott Mahlke. 2010. Shoestring: Probabilistic soft error reliability on the cheap. In Proceedings of the 15th Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, 385--396.
[11]
Rohan Garg, Apoorve Mohan, Michael Sullivan, and Gene Cooperman. 2018. CRUM: Checkpoint-restart support for CUDA’s unified memory. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster’18).
[12]
Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). ACM, New York, NY, 127--136.
[13]
M. Gupta, D. Lowell, J. Kalamatianos, S. Raasch, V. Sridharan, D. Tullsen, and R. Gupta. 2017. Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading. In 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). 1--6.
[14]
H. Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPU. In 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 37--47.
[15]
Manolis Kaliorakis, Sotiris Tselonis, Athanasios Chatzidimitriou, and Dimitris Gizopoulos. 2015. Accelerated microarchitectural fault injection-based reliability assessment.
[16]
Charu Kalra. 2015. MS Thesis: Design and Evaluation of Register Allocation on GPUs. Master’s Thesis. Northeastern University.
[17]
Charu Kalra. 2019. Compiler-based Resilience Prediction and Enhancements for GPU Applications. Ph.D. Dissertation. Northeastern University.
[18]
Charu Kalra, Daniel Lowell, John Kalamatianos, Vilas Sridharan, and David Kaeli. 2016. Performance evaluation of compiler-based software rmt in an hsa environment. In The 12th Workshop on Silicon Errors in Logic-System Effects, SELSE.
[19]
Charu Kalra, Fritz Previlon, Xiangyu Li, Norman Rubin, and David Kaeli. 2018. Analyzing the vulnerability of vector-scalar execution on data-parallel architectures. In The 14th Workshop on Silicon Errors in Logic - System Effects, SELSE.
[20]
Charu Kalra, Fritz Previlon, Xiangyu Li, Norman Rubin, and David Kaeli. 2018. PRISM: Predicting resilience of GPU applications using statistical methods. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18). IEEE Press, Piscataway, NJ, Article 69, 14 pages. http://dl.acm.org/citation.cfm?id=3291656.3291748
[21]
Andrew Kerr, Gregory Diamos, and Sudhakar Yalamanchili. 2009. A characterization and analysis of PTX kernels. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC’09). IEEE Computer Society, Washington, DC, 3--12.
[22]
Junsung Kim, Hyoseung Kim, Karthik Lakshmanan, and Ragunathan Rajkumar. 2013. Parallel scheduling for cyber-physical systems: Analysis and case study on a self-driving car. In Proceedings of the ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS).
[23]
R. Koo and S. Toueg. 1987. Checkpointing and rollback-recovery for distributed systems. IEEE Transactions on Software Engineering SE-13, 1 (Jan 1987), 23--31.
[24]
Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis 8 transformation. In International Symposium on Code Generation and Optimization (CGO’04). IEEE, 75--86.
[25]
R. Leveugle, A. Calvez, P. Maistri, and P. Vanhauwaert. 2009. Statistical fault injection: Quantified error and confidence. In Design, Automation Test in Europe Conference Exhibition (DATE’09). 502--506.
[26]
G. Li, K. Pattabiraman, C. Y. Cher, and P. Bose. 2016. Understanding error propagation in GPGPU applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 240--251.
[27]
Guanpeng Li, Karthik Pattabiraman, Siva Kumar Sastry Hari, Michael Sullivan, and Timothy Tsai. 2018. Modeling soft-error propagation in programs. In IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE.
[28]
Robert Lucas, James Ang, Keren Bergman, and Shekhar et al. Borkar. 2014. Top Ten Exascale Research Challenges. http://science.energy.gov/ /media/ascr/ascac/pdf/meetings/20140210/Top10reportFEB14.pdf.
[29]
Abdulrahman Mahmoud, Siva Hari, Mike Sullivan, Timothy Tsai, and Steve Keckler. 2018. Optimizing software-directed instruction replication for GPU error detection. In International Conference for High-Performance Computing, Networking, Storage and Analysis (SC).
[30]
Jose Maiz, Scott Hareland, Kevin Zhang, and Patrick Armstrong. 2003. Characterization of multi-bit soft error events in advanced SRAMs. In IEEE International Technical Digest Electron Devices Meeting (IEDM’03). IEEE, 21--4.
[31]
Mei-Chen Hsueh, T. K. Tsai, and R. K. Iyer. 1997. Fault injection techniques and tools. Computer 30, 4, 75--82.
[32]
Harshitha Menon, Michael O. Lam, Daniel Osei-Kuffuor, Markus Schordan, Scott Lloyd, Kathryn Mohror, and Jeffrey Hittinger. 2018. ADAPT: Algorithmic differentiation applied to floating-point precision tuning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’18). IEEE Press, Piscataway, NJ, Article 48, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291720
[33]
Shubu Mukherjee. 2011. Architecture Design for Soft Errors. Morgan Kaufmann.
[34]
Bin Nie, Lishan Yang, Adwait Jog, and Evgenia Smirni. 2018. Fault site pruning for practical reliability analysis of gpgpu applications. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 749--761.
[35]
NVIDIA. [n.d.]. CUDA binary utilities. Retrieved from http://docs.nvidia.com/cuda/pdf/CUDA-Binary-Utilities.pdf.
[36]
NVIDIA. [n.d.]. NVCC, https://developer.nvidia.com/cuda-llvm-compiler.
[37]
NVIDIA. [n.d.]. NVIDIA, CUDA SDK, V6.0.
[38]
NVIDIA. [n.d.]. NVIDIA Kepler GK110 architecture white paper.
[39]
NVIDIA. [n.d.]. Tegra K1 techinical reference manual.
[40]
Hamza Omar, Qingchuan Shi, Masab Ahmad, Halit Dogan, and Omer Khan. 2018. Declarative resilience: A holistic soft-error resilient multicore architecture that trades off program accuracy for efficiency. ACM Transactions on Embedded Computing Systems 17, 4, Article 76 (July 2018), 27 pages.
[41]
Fritz G. Previlon, Babatunde Egbantan, Devesh Tiwari, Paolo Rech, and David R. Kaeli. 2017. Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience. In 2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, 898--901.
[42]
Fritz G. Previlon, Charu Kalra, David R. Kaeli, and Paolo Rech. 2019. A comprehensive evaluation of the effects of input data on the resilience of GPU applications. In 2019 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT). IEEE, 1--6.
[43]
Fritz G. Previlon, Charu Kalra, Devesh Tiwari, and David R. Kaeli. 2019. PCFI: Program counter guided fault injection for accelerating GPU reliability assessment. In 2019 Design, Automation Test in Europe Conference Exhibition (DATE). 308--311.
[44]
Sandra Rapps and Elaine J. Weyuker. 1982. Data flow analysis techniques for test data selection. In Proceedings of the 6th International Conference on Software Engineering (ICSE’82). IEEE Computer Society Press, Los Alamitos, CA, 272--278. http://dl.acm.org/citation.cfm?id=800254.807769
[45]
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. 2005. SWIFT: Software implemented fault tolerance. In International Symposium on Code Generation and Optimization. 243--254.
[46]
Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, Steve Keckler, and Joel Emer. 2017. SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 249--258.
[47]
Michael C. Schatz, Cole Trapnell, Arthur L. Delcher, and Amitabh Varshney. 2007. High-throughput sequence alignment using graphics processing units. BMC Bioinformatics 8, 1 (2007), 1--10.
[48]
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible software profiling of GPU architectures. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY, 185--197.
[49]
J. E. Stone, J. C. Philips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and K. Schulten. 2007. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry 28 (2007), 2618--2640.
[50]
Michael B. Sullivan, Siva Kumar Sastry Hari, Brian Zimmer, Timothy Tsai, and Stephen W. Keckler. 2018. SwapCodes: Error codes for hardware-software cooperative GPU pipeline error detection. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 762--774.
[51]
Jingweijia Tan and Xin Fu. 2012. RISE: Improving the streaming processors reliability against soft errors in Gpgpus. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT’12). ACM, New York, NY, 191--200.
[52]
Yash Ukidave, Fanny Nina Paravecino, Leiming Yu, Charu Kalra, Amir Momeni, Zhongliang Chen, Nick Materise, Brett Daley, Perhaad Mistry, and David Kaeli. 2015. NUPAR: A benchmark suite for modern GPU architectures. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (ICPE’15). ACM, New York, NY, 253--264.
[53]
Dani Voitsechov, Arslan Zulfiqar, Mark Stephenson, Mark Gebhart, and Stephen W. Keckler. 2018. Software-directed techniques for improved GPU register file utilization. ACM Transactions on Architecture and Code Optimization (TACO) 15, 3, 38.
[54]
Jack Wadden, Alexander Lyashevsky, Sudhanva Gurumurthi, Vilas Sridharan, and Kevin Skadron. 2014. Real-world design and evaluation of compiler-managed GPU redundant multithreading. In Proceeding of the 41st Annual International Symposium on Computer Architecture. IEEE Press, 73--84.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 2
June 2020
169 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3403597
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 May 2020
Online AM: 07 May 2020
Accepted: 01 February 2020
Revised: 01 December 2019
Received: 01 July 2019
Published in TACO Volume 17, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. LLVM
  3. fault tolerance
  4. soft errors

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)659
  • Downloads (Last 6 weeks)73
Reflects downloads up to 15 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media