skip to main content
10.1145/3572848.3577496acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article
Public Access

Improving Energy Saving of One-Sided Matrix Decompositions on CPU-GPU Heterogeneous Systems

Published: 21 February 2023 Publication History

Editorial Notes

The authors have requested minor, non-substantive changes to the Version of Record and, in accordance with ACM policies, a Corrected Version of Record was published on April 27, 2023. For reference purposes, the VoR may still be accessed via the Supplemental Material section on this page.

Abstract

One-sided dense matrix decompositions (e.g., Cholesky, LU, and QR) are the key components in scientific computing in many different fields. Although their design has been highly optimized for modern processors, they still consume a considerable amount of energy. As CPU-GPU heterogeneous systems are commonly used for matrix decompositions, in this work, we aim to further improve the energy saving of onesided matrix decompositions on CPU-GPU heterogeneous systems. We first build an Algorithm-Based Fault Tolerance protected overclocking technique (ABFT-OC) to enable us to exploit reliable overclocking for key matrix decomposition operations. Then, we design an energy-saving matrix decomposition framework, Bi-directional Slack Reclamation (BSR), that can intelligently combine the capability provided by ABFT-OC and DVFS to maximize energy saving and maintain performance and reliability. Experiments show that BSR is able to save up to 11.7% more energy compared with the current best energy saving optimization approach with no performance degradation and up to 14.1% Energy×Delay2 reduction. Also, BSR enables the Pareto efficient performance-energy trade-off, which is able to provide up to 1.43× performance improvement without costing extra energy.

Supplementary Material

3577496-vor (3577496-vor.pdf)
Version of Record for "Improving Energy Saving of One-Sided Matrix Decompositions on CPU-GPU Heterogeneous Systems" by Chen et al., Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP '23).

References

[1]
Heiko Burau, Renée Widera, Wolfgang Hönig, Guido Juckeland, Alexander Debus, Thomas Kluge, Ulrich Schramm, Tomas E Cowan, Roland Sauerbrey, and Michael Bussmann. 2010. PIConGPU: a fully relativistic particle-in-cell code for a GPU cluster. IEEE Transactions on Plasma Science 38, 10 (2010), 2831--2839.
[2]
Aurélien Cavelan, Yves Robert, Hongyang Sun, and Frédéric Vivien. 2015. Voltage overscaling algorithms for energy-efficient workflow computations with timing errors. In Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale. 27--34.
[3]
Choong-Seock Chang, Seunghoe Ku, and H Weitzner. 2004. Numerical study of neoclassical plasma pedestal in a tokamak geometry. Physics of Plasmas 11, 5 (2004).
[4]
Jieyang Chen, Hongbo Li, Sihuan Li, Xin Liang, Panruo Wu, Dingwen Tao, Kaiming Ouyang, Yuanlai Liu, Kai Zhao, Qiang Guan, et al. 2018. Fault tolerant one-sided matrix decompositions on heterogeneous systems with GPUs. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 854--865.
[5]
Jieyang Chen, Sihuan Li, and Zizhong Chen. 2016. GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs. In Networking, Architecture and Storage (NAS), 2016 International Conference on.
[6]
Jieyang Chen, Xin Liang, and Zizhong Chen. 2016. Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs. In 2016 International Parallel and Distributed Processing Symposium (IPDPS).
[7]
Jieyang Chen, Li Tan, Panruo Wu, Dingwen Tao, Hongbo Li, Xin Liang, Sihuan Li, Rong Ge, Laxmi Bhuyan, and Zizhong Chen. 2016. GreenLA: green linear algebra software for GPU-accelerated heterogeneous computing. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 667--677.
[8]
Jieyang Chen, Lipeng Wan, Xin Liang, Ben Whitney, Qing Liu, David Pugmire, Nicholas Thompson, Jong Youl Choi, Matthew Wolf, Todd Munson, et al. 2021. Accelerating multigrid-based hierarchical scientific data refactoring on gpus. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 859--868.
[9]
Jieyang Chen, Nan Xiong, Xin Liang, Dingwen Tao, Sihuan Li, Kaiming Ouyang, Kai Zhao, Nathan DeBardeleben, Qiang Guan, and Zizhong Chen. 2019. TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs. In Proceedings of the ACM International Conference on Supercomputing. 106--116.
[10]
Longxiang Chen, Dingwen Tao, Panruo Wu, and Zizhong Chen. [n.d.]. Extending Checksum-Based ABFT to Tolerate Soft Errors Online in Iterative Methods. ([n. d.]).
[11]
Zizhong Chen. 2008. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments. In 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 1--8.
[12]
Zizhong Chen. 2009. Optimal real number codes for fault tolerant matrix operations. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. 1--10.
[13]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).
[14]
Teresa Davies and Zizhong Chen. 2013. Correcting soft errors online in LU factorization. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing.
[15]
Jack Dongarra, Mark Gates, Azzam Haidar, Jakub Kurzak, Piotr Luszczek, Stanimire Tomov, and Ichitaro Yamazaki. 2014. Accelerating Numerical Dense Linear Algebra Calculations with GPUs. Numerical Computations with GPUs (2014), 1--26.
[16]
Jack J Dongarra, Hans W Meuer, Erich Strohmaier, et al. 1997. TOP500 supercomputer sites. Supercomputer 13 (1997), 89--111.
[17]
R. Efraim, R. Ginosar, C. Weiser, and A. Mendelson. 2014. Energy aware race to halt: A down to EARtH approach for platform energy management. IEEE Computer Architecture Letters 13, 1 (Jan. 2014), 25--28.
[18]
Olivier Guyon and Jared Males. 2017. Adaptive optics predictive control with empirical orthogonal functions (EOFs). arXiv preprint arXiv:1707.00570 (2017).
[19]
Doug Hakkarinen and Zizhong Chen. 2012. Multilevel diskless checkpointing. IEEE Trans. Comput. 62, 4 (2012), 772--783.
[20]
Doug Hakkarinen, Panruo Wu, and Zizhong Chen. 2014. Fail-stop failure algorithm-based fault tolerance for cholesky decomposition. IEEE Transactions on Parallel and Distributed Systems 26, 5 (2014), 1323--1335.
[21]
Keliang He, Elizabeth Martin, and Matt Zucker. 2013. Multigrid CHOMP with local smoothing. In 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids). IEEE, 315--322.
[22]
Tohru Ishihara and Hiroto Yasuura. 1998. Voltage scheduling problem for dynamically variable voltage processors. In Proceedings of the 1998 international symposium on Low power electronics and design. ACM, 197--202.
[23]
Vahid Jalili-Marandi, Zhiyin Zhou, and Venkata Dinavahi. 2012. Large-scale transient stability simulation of electrical power systems on parallel GPUs. In 2012 IEEE Power and Energy Society General Meeting. IEEE, 1--11.
[24]
Wayne Joubert, Deborah Weighill, David Kainer, Sharlee Climer, Amy Justice, Kjiersten Fagnan, and Daniel Jacobson. 2018. Attacking the opioid epidemic: Determining the epistatic and pleiotropic genetic architectures for chronic pain and opioid addiction. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 717--730.
[25]
S Ku, Choong-Seock Chang, and PH Diamond. 2009. Full-f gyrokinetic particle simulation of centrally heated global ITG turbulence from magnetic axis to edge pedestal top in a realistic tokamak geometry. Nuclear Fusion 49, 11 (2009).
[26]
Jakub Kurzak and Jack Dongarra. 2006. Implementing linear algebra routines on multi-core processors with pipelining and a look ahead. In International Workshop on Applied Parallel Computing. Springer, 147--156.
[27]
Jingwen Leng et al. 2016. Guardband management in heterogeneous architectures. Ph.D. Dissertation.
[28]
Jingwen Leng, Alper Buyuktosunoglu, Ramon Bertran, Pradip Bose, and Vijay Janapa Reddi. 2015. Safe limits on voltage reduction efficiency in GPUs: a direct measurement approach. In Microarchitecture (MICRO), 2015 48th Annual IEEE/ACM International Symposium on. IEEE, 294--307.
[29]
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling energy optimizations in GPGPUs. ACM SIGARCH Computer Architecture News 41, 3 (2013), 487--498.
[30]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Jiajia Li, Xu Liu, Nathan R Tallent, and Kevin J Barker. 2019. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2019), 94--110.
[31]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: evaluating modern GPU interconnect via a multi-GPU benchmark suite. In 2018 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 191--202.
[32]
Sihuan Li, Hongbo Li, Xin Liang, Jieyang Chen, Elisabeth Giem, Kaiming Ouyang, Kai Zhao, Sheng Di, Franck Cappello, and Zizhong Chen. 2019. Ft-isort: Efficient fault tolerance for introsort. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--17.
[33]
Xin Liang, Jieyang Chen, Dingwen Tao, Sihuan Li, Panruo Wu, Hongbo Li, Kaiming Ouyang, Yuanlai Liu, Fengguang Song, and Zizhong Chen. 2017. Correcting Soft Errors Online in Fast Fourier Transform. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 30.
[34]
Kenneth Moreland, Christopher Sewell, William Usher, Li-ta Lo, Jeremy Meredith, David Pugmire, James Kress, Hendrik Schroots, Kwan-Liu Ma, Hank Childs, et al. 2016. Vtk-m: Accelerating the visualization toolkit for massively threaded architectures. IEEE computer graphics and applications 36, 3 (2016), 48--58.
[35]
Lawrence M Murray, EM Jones, and J Parslow. 2012. On collapsed state-space models and the particle marginal Metropolis-Hastings sampler. arXiv. org (2012).
[36]
Cody Rivera, Jieyang Chen, Nan Xiong, Jing Zhang, Shuaiwen Leon Song, and Dingwen Tao. 2021. Tsm2x: High-performance tall-and-skinny matrix-matrix multiplication on gpus. J. Parallel and Distrib. Comput. 151 (2021), 70--85.
[37]
Nikzad Babaii Rizvandi, Javid Taheri, and Albert Y Zomaya. 2011. Some observations on optimal frequency selection in DVFS-based energy consumption minimization. J. Parallel and Distrib. Comput. 71, 8 (2011), 1154--1164.
[38]
Jingweijia Tan, Nilanjan Goswami, Tao Li, and Xin Fu. 2011. Analyzing soft-error vulnerability on GPGPU microarchitecture. In Workload Characterization (IISWC), 2011 IEEE International Symposium on.
[39]
Li Tan and Zizhong Chen. 2015. Slow down or halt: Saving the optimal energy for scalable HPC systems. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering. ACM, 241--244.
[40]
Dingwen Tao, Sheng Di, Xin Liang, Zizhong Chen, and Franck Cappello. 2018. Improving performance of iterative methods by lossy checkponting. In Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing. ACM, 52--65.
[41]
Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren Kerbyson, and Zizhong Chen. 2016. New-Sum: A Novel Online ABFT Scheme For General Iterative Methods. In Proceedings of the 25th International Symposium on High-Performance Parallel and Distributed Computing.
[42]
Jiannan Tian, Sheng Di, Kai Zhao, Cody Rivera, Megan Hickman Fulp, Robert Underwood, Sian Jin, Xin Liang, Jon Calhoun, Dingwen Tao, et al. 2020. Cusz: An efficient gpu-based error-bounded lossy compression framework for scientific data. arXiv preprint arXiv:2007.09625 (2020).
[43]
Jiannan Tian, Cody Rivera, Sheng Di, Jieyang Chen, Xin Liang, Dingwen Tao, and Franck Cappello. 2021. Revisiting huffman coding: Toward extreme performance on modern gpu architectures. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 881--891.
[44]
Stanimire Tomov, Jack Dongarra, and Marc Baboulin. 2010. Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput. 36, 5--6 (June 2010), 232--240.
[45]
Stanimire Tomov, Rajib Nath, Hatem Ltaief, and Jack Dongarra. 2010. Dense Linear Algebra Solvers for Multicore with GPU Accelerators. In Proc. of the IEEE IPDPS'10. IEEE Computer Society, Atlanta, GA, 1--8.
[46]
Yash Ukidave, Xiangyu Li, and David Kaeli. 2016. Mystic: Predictive scheduling for gpu based cloud servers using machine learning. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 353--362.
[47]
Panruo Wu and Zizhong Chen. 2014. FT-ScaLAPACK: Correcting soft errors on-line for ScaLAPACK Cholesky, QR, and LU factorization routines. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing.
[48]
Panruo Wu, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, and Zizhong Chen. 2017. Silent Data Corruption Resilient Two-sided Matrix Factorizations. In Proceedings of the 22nd Principles and Practice of Parallel Programming.
[49]
Panruo Wu, Qiang Guan, Nathan DeBardeleben, Sean Blanchard, Dingwen Tao, Xin Liang, Jieyang Chen, and Zizhong Chen. 2016. Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra. In Proceedings of the 25th International Symposium on High-Performance Parallel and Distributed Computing.
[50]
Erlin Yao, Jiutian Zhang, Mingyu Chen, Guangming Tan, and Ninghui Sun. 2015. Detection of soft errors in LU decomposition with partial pivoting using algorithm-based fault tolerance. The International Journal of High Performance Computing Applications 29, 4 (2015), 422--436.
[51]
Hadi Zamani, Yuanlai Liu, Devashree Tripathy, Laxmi Bhuyan, and Zizhong Chen. 2019. GreenMM: energy efficient GPU matrix multiplication through undervolting. In Proceedings of the ACM International Conference on Supercomputing. 308--318.
[52]
Hadi Zamani, Devashree Tripathy, Laxmi Bhuyan, and Zizhong Chen. 2020. SAOU: safe adaptive overclocking and undervolting for energy-efficient GPU computing. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 205--210.
[53]
Tomás Zegard and Glaucio H Paulino. 2013. Toward GPU accelerated topology optimization on unstructured meshes. Structural and multidisciplinary optimization 48, 3 (2013), 473--485.

Cited By

View all

Index Terms

  1. Improving Energy Saving of One-Sided Matrix Decompositions on CPU-GPU Heterogeneous Systems

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      PPoPP '23: Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming
      February 2023
      480 pages
      ISBN:9798400700156
      DOI:10.1145/3572848
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 February 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      Author Tags

      1. GPU
      2. energy saving
      3. fault tolerance
      4. matrix decomposition

      Qualifiers

      • Research-article

      Funding Sources

      • U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Scientific Discovery through the Advanced Computing (SciDAC) program
      • National Science Foundation

      Conference

      PPoPP '23

      Acceptance Rates

      Overall Acceptance Rate 230 of 1,014 submissions, 23%

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)172
      • Downloads (Last 6 weeks)17
      Reflects downloads up to 05 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media