• Venkatesha S and Parthasarathi R. (2024). Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and Reliability. ACM Computing Surveys. 56:11. (1-76). Online publication date: 30-Nov-2024.

    https://doi.org/10.1145/3663672

  • Huang Y, Di S, Zhang Z, Lu X and Li G. Versatile Datapath Soft Error Detection on the Cheap for HPC Applications. Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. (1-15).

    https://doi.org/10.1109/SC41406.2024.00061

  • Kaya E and Öz I. (2024). Compiler-Managed Replication of CUDA Kernels for Reliable Execution of GPGPU Applications. Journal of Circuits, Systems and Computers. 10.1142/S0218126624502542. 33:14. Online publication date: 30-Sep-2024.

    https://www.worldscientific.com/doi/10.1142/S0218126624502542

  • He Z, Xu H and Li G. (2024). A Fast Low-Level Error Detection Technique 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). 10.1109/DSN58291.2024.00023. 979-8-3503-4105-8. (90-98).

    https://ieeexplore.ieee.org/document/10646930/

  • Rahman M, Di S, Guo S, Lu X, Li G and Cappello F. (2024). Druto: Upper-Bounding Silent Data Corruption Vulnerability in GPU Applications 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 10.1109/IPDPS57955.2024.00058. 979-8-3503-8711-7. (582-594).

    https://ieeexplore.ieee.org/document/10579167/

  • Wei X, Jiang N, Yue H, Wang X, Zhao J, Li G and Qiu M. ApproxDup: Developing an Approximate Instruction Duplication Mechanism for Efficient SDC Detection in GPGPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 10.1109/TCAD.2023.3330821. 43:4. (1051-1064).

    https://ieeexplore.ieee.org/document/10312777/

  • Braga G, Gobatto L, Gonçalves M and Rodrigo Azambuja J. (2024). Improving GPU Reliability with Software-Managed Pipeline Parity for Error Detection and Correction 2024 IEEE 15th Latin America Symposium on Circuits and Systems (LASCAS). 10.1109/LASCAS60203.2024.10538027. 979-8-3503-8122-1. (1-5).

    https://ieeexplore.ieee.org/document/10538027/

  • Braga G, Gobatto L, Gonçalves M and Azambuja J. (2024). An Investigation into Fault Detection and Correction in GPU Pipelines with a Hybrid XOR Approach 2024 IEEE 15th Latin America Symposium on Circuits and Systems (LASCAS). 10.1109/LASCAS60203.2024.10506130. 979-8-3503-8122-1. (1-5).

    https://ieeexplore.ieee.org/document/10506130/

  • Braga G, Gobatto L, Gonçalves M and Azambuja J. (2024). An Investigation into Fault Detection and Correction in GPU Pipelines with a Hybrid XOR Approach 2024 IEEE 15th Latin America Symposium on Circuits and Systems (LASCAS). 10.1109/LASCAS60203.2024.10506122. 979-8-3503-8122-1. (1-5).

    https://ieeexplore.ieee.org/document/10506122/

  • He Z, Huang Y, Xu H, Tao D and Li G. Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (1-13).

    https://doi.org/10.1145/3581784.3607078

  • Wei X, Wu Y, Jiang N and Yue H. Detecting SDCs in GPGPUs Through Efficient Partial Thread Redundancy. Algorithms and Architectures for Parallel Processing. (224-239).

    https://doi.org/10.1007/978-981-97-0862-8_14

  • Wei X, Zhao J, Jiang N and Yue H. GLAM-SERP: Building a Graph Learning-Assisted Model for Soft Error Resilience Prediction in GPGPUs. Algorithms and Architectures for Parallel Processing. (419-435).

    https://doi.org/10.1007/978-981-97-0859-8_25

  • Huang Y, He Z, Li L and Li G. (2023). Characterizing Runtime Performance Variation in Error Detection by Duplicating Instructions 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). 10.1109/ISSRE59848.2023.00043. 979-8-3503-1594-3. (730-741).

    https://ieeexplore.ieee.org/document/10301234/

  • Braga G, Gonçalves M and Azambuja J. (2023). Software-controlled pipeline parity in GPU architectures for error detection. Microelectronics Reliability. 10.1016/j.microrel.2023.115155. 148. (115155). Online publication date: 1-Sep-2023.

    https://linkinghub.elsevier.com/retrieve/pii/S002627142300255X

  • Braga G, Gonçalves M and Azambuja J. (2023). Evaluating an XOR-based Hybrid Fault Tolerance Technique to Detect Faults in GPU Pipelines 2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). 10.1109/ISVLSI59464.2023.10238657. 979-8-3503-2769-4. (1-6).

    https://ieeexplore.ieee.org/document/10238657/

  • Zhang B, Yang L, Li G and Xu H. (2023). Investigating the Impact of High-Level Software Design on Low-Level Hardware Fault Resilience 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S). 10.1109/DSN-S58398.2023.00044. 979-8-3503-2545-4. (163-167).

    https://ieeexplore.ieee.org/document/10206895/

  • Huang Y, Guo S, Di S, Li G and Cappello F. Mitigating silent data corruptions in HPC applications across multiple program inputs. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-14).

    /doi/10.5555/3571885.3571907

  • Huang Y, Guo S, Di S, Li G and Cappello F. (2022). Mitigating Silent Data Corruptions in HPC Applications across Multiple Program Inputs SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. 10.1109/SC41404.2022.00022. 978-1-6654-5444-5. (1-14).

    https://ieeexplore.ieee.org/document/10046091/

  • Ma J, Duan Z and Tang L. (2022). Deep Soft Error Propagation Modeling Using Graph Attention Network. Journal of Electronic Testing: Theory and Applications. 38:3. (303-319). Online publication date: 1-Jun-2022.

    https://doi.org/10.1007/s10836-022-06005-y

  • Goncalves M, Condia J, Reorda M, Sterpone L and Azambuja J. (2022). Evaluating low-level software-based hardening techniques for configurable GPU architectures. The Journal of Supercomputing. 78:6. (8081-8105). Online publication date: 1-Apr-2022.

    https://doi.org/10.1007/s11227-021-04154-z

  • Yue H, Wei X, Li G, Zhao J, Jiang N and Tan J. G-SEPM. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (1-15).

    https://doi.org/10.1145/3458817.3476170

  • Öz I and Karadaş Ö. (2021). Regional soft error vulnerability and error propagation analysis for GPGPU applications. The Journal of Supercomputing. 10.1007/s11227-021-04026-6.

    https://link.springer.com/10.1007/s11227-021-04026-6

  • Wei X, Jiang N, Wang X and Yue H. Detecting SDCs in GPGPUs Through an Efficient Instruction Duplication Mechanism. Knowledge Science, Engineering and Management. (571-584).

    https://doi.org/10.1007/978-3-030-82153-1_47

  • Ma J, Duan Z and Tang L. (2021). GATPS: An attention-based graph neural network for predicting SDC-causing instructions 2021 IEEE 39th VLSI Test Symposium (VTS). 10.1109/VTS50974.2021.9441056. 978-1-6654-1949-9. (1-7).

    https://ieeexplore.ieee.org/document/9441056/

  • Anwer A, Li G, Pattabiraman K, Sullivan M, Tsai T and Hari S. GPU-trident. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (1-15).

    /doi/10.5555/3433701.3433818

  • Anwer A, Li G, Pattabiraman K, Sullivan M, Tsai T and Hari S. (2020). GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 10.1109/SC41405.2020.00092. 978-1-7281-9998-6. (1-15).

    https://ieeexplore.ieee.org/document/9355257/

  • Previlon F, Kalra C, tiwari d and Kaeli D. Characterizing and Exploiting Soft Error Vulnerability Phase Behavior in GPU Applications. IEEE Transactions on Dependable and Secure Computing. 10.1109/TDSC.2020.2991136. (1-1).

    https://ieeexplore.ieee.org/document/9080079/