research-article

A STT-RAM-based low-power hybrid register file for GPGPUs

Authors:

Henry Hoffmann,

Huazhong YangAuthors Info & Claims

DAC '15: Proceedings of the 52nd Annual Design Automation Conference

Article No.: 103, Pages 1 - 6

https://doi.org/10.1145/2744769.2744785

Published: 07 June 2015 Publication History

Abstract

Recently, general-purpose graphics processing units (GPGPUs) have been widely used to accelerate computing in various applications. To store the contexts of thousands of concurrent threads on a GPU, a large static random-access memory (SRAM)-based register file is employed. Due to high leakage power of SRAM, the register file consumes 20% to 40% of the total GPU power consumption. Thus, hybrid memory system, which combines SRAM and the emerging non-volatile memory (NVM), has been employed for register file design on GPUs. Although it has shown strong potential to alleviate the power issue of GPUs, existing hybrid memory solutions might not exploit the intrinsic feature of GPU register file. By leveraging the warp schedule on GPU, this paper proposes a hybrid register architecture which consists of a NVM-based register file and mixed SRAM-based write buffers with a warp-aware write back strategy. Simulation results show that our design can eliminate 64% of write accesses to NVM and reduce power of register file by 66% on average, with only 4.2% performance degradation. After we apply the power gating technique, the register power is further reduced to 25% of SRAM counterpart on average.

References

[1]

Volodymyr V Kindratenko et al. GPU clusters for high-performance computing. In CLUSTR, pages 1--8, Aug 2009.

[2]

Piyush Sao et al. A Distributed CPU-GPU Sparse Direct Solver. In Euro-Par 2014 Parallel Processing, volume 8632 of Lecture Notes in Computer Science, pages 487--498, 2014.

[3]

Xiaoming Chen et al. GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling. TPDS, 26(3):786--795, 2015.

[4]

Yun Liang et al. Real-time implementation and performance optimization of 3D sound localization on GPUs. In DATE, pages 832--835, March 2012.

Digital Library

[5]

NVIDIA Corporation. NVIDIA Tesla K80.

[6]

Goswami Nilanjan et al. Power-performance co-optimization of throughput core architecture using resistive memory. In HPCA, pages 342--353, Feb 2013.

Digital Library

[7]

Jishen Zhao and Yuan Xie. Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration. In ICCAD, pages 81--87, Nov 2012.

Digital Library

[8]

Jishen Zhao et al. Energy-efficient GPU Design with Reconfigurable In-package Graphics Memory. In ISLPED, pages 403--408, 2012.

Digital Library

[9]

Bin Wang et al. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In PACT, pages 93--102, Sept 2013.

Digital Library

[10]

Dongki Kim et al. Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU. In DAC, pages 888--896, June 2012.

Digital Library

[11]

Prateeksha Satyamoorthy and Sonali Parthasarathy. MRAM for Shared Memory in GPGPUs. Technical report, University of Virginia, 2011.

[12]

Prateeksha Satyamoorthy. STT-RAM for Shared Memory in GPUs. Master's thesis, University of Virginia, 2011.

[13]

Naifeng Jing et al. An Energy-efficient and Scalable eDRAM-based Register File Architecture for GPGPU. In ISCA, pages 344--355, 2013.

Digital Library

[14]

Naifeng Jing et al. Compiler assisted dynamic register file in GPGPU. In ISLPED, pages 3--8, Sept 2013.

Digital Library

[15]

Mohammad Abdel-Majeed et al. Warped Register File: A Power Efficient Register File for GPGPUs. In HPCA, pages 412--423, 2013.

Digital Library

[16]

Mark Gebhart et al. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA, pages 235--246. IEEE, 2011.

Digital Library

[17]

NVIDIA Corporation. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110 (White Paper), 2012.

[18]

NVIDIA Corporation. Parallel thread execution.

[19]

Cong Xu et al. Device-architecture co-optimization of stt-ram based memory for low power embedded systems. In ICCAD, pages 463--470. IEEE Press, 2011.

Digital Library

[20]

Clinton W Smullen et al. Relaxing non-volatility for fast and energy-efficient stt-ram caches. In HPCA, pages 50--61. IEEE, 2011.

Digital Library

[21]

Adwait Jog et al. Cache revive: architecting volatile stt-ram caches for enhanced performance in cmps. In DAC, pages 243--252. ACM, 2012.

Digital Library

[22]

Xiangyu Dong et al. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. TCAD, 31(7):994--1007, 2012.

Digital Library

[23]

NVIDIA Corporation. NVIDIA's Fermi: The First Complete GPU Computing Architecture, 2009.

[24]

NVIDIA. Computing sdk. Gpu computing sdk, Avaliable at: https://developer.nvidia.com/gpu-computing-sdk, 22(07):2013, 2013.

[25]

Shuai Che et al. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. In IISWC, pages 1--11. IEEE, 2010.

Digital Library

[26]

Yasuhisa Takeyama et al. A low leakage sram macro with replica cell biasing scheme. Solid-State Circuits, IEEE Journal of, 41(4):815--822, 2006.

[27]

Pradeep Nair et al. A quasi-power-gated low-leakage stable sram cell. In MWSCAS, pages 761--764. IEEE, 2010.

[28]

Hailin Jiang et al. Benefits and costs of power-gating technique. In ICCD, pages 559--566. IEEE, 2005.

Digital Library

[29]

Ali Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174, April 2009.

[30]

Veynu Narasiman et al. Improving gpu performance via large warps and two-level warp scheduling. In MICRO, pages 308--317. ACM, 2011.

Digital Library

[31]

Adwait Jog et al. Owl: cooperative thread array aware scheduling techniques for improving gpgpu performance. ACM SIGARCH, 41(1):395--406, 2013.

Digital Library

[32]

Wilson WL Fung et al. Dynamic warp formation and scheduling for efficient gpu control flow. In MICRO, pages 407--420. IEEE Computer Society, 2007.

Digital Library

[33]

Timothy G Rogers et al. Cache-conscious wavefront scheduling. In MICRO, pages 72--83. ACM, 2012.

Digital Library

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Jeong EPark EKoo GOh YYoon M(2024)Conflict-aware compiler for hierarchical register file on GPUsJournal of Systems Architecture10.1016/j.sysarc.2024.103099(103099)Online publication date: Feb-2024
https://doi.org/10.1016/j.sysarc.2024.103099
Inci AIsgenc MMarculescu D(2023)Efficient Deep Learning Using Non-volatile Memory Technology in GPU ArchitecturesEmbedded Machine Learning for Cyber-Physical, IoT, and Edge Computing10.1007/978-3-031-19568-6_8(225-252)Online publication date: 1-Oct-2023
https://doi.org/10.1007/978-3-031-19568-6_8
Show More Cited By

Index Terms

A STT-RAM-based low-power hybrid register file for GPGPUs
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

A low power STT-RAM based register file for GPGPUs
SAC '16: Proceedings of the 31st Annual ACM Symposium on Applied Computing

State-of-the-art general-purpose graphics processing units (GPGPUs) execute a large number of threads simultaneously to hide latency of memory hierarchy and functional units. The extreme multithreading requires a large register file to hold the state of ...
Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression
DAC '16: Proceedings of the 53rd Annual Design Automation Conference

To facilitate efficient context switches, GPUs usually employ a large-capacity register file to accommodate a massive amount of context information. However, the large register file introduces high power consumption, flowing to high leakage power SRAM ...
Power management of hybrid DRAM/PRAM-based main memory
DAC '11: Proceedings of the 48th Design Automation Conference

Hybrid main memory consisting of DRAM and non-volatile memory is attractive since the non-volatile memory can give the advantage of low standby power while DRAM provides high performance and better active power. In this work, we address the power ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

DAC '15: Proceedings of the 52nd Annual Design Automation Conference

June 2015

1204 pages

ISBN:9781450335201

DOI:10.1145/2744769

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions
973 project
National Natural Science Foundation of China
Tsinghua University Initiative Scientific Research Program
Brain Inspired Computing Research, Tsinghua university

Conference

DAC '15

Sponsor:

SIGDA

DAC '15: The 52nd Annual Design Automation Conference 2015

June 7 - 11, 2015

California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25

Sponsor:
sigda

62nd ACM/IEEE Design Automation Conference

June 22 - 26, 2025

San Francisco , CA , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
299
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)1

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Shoushtary MArnau JMurgadas JGonzalez A(2024)Memento: An Adaptive, Compiler-Assisted Register File Cache for GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00075(978-990)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00075
Jeong EPark EKoo GOh YYoon M(2024)Conflict-aware compiler for hierarchical register file on GPUsJournal of Systems Architecture10.1016/j.sysarc.2024.103099(103099)Online publication date: Feb-2024
https://doi.org/10.1016/j.sysarc.2024.103099
Inci AIsgenc MMarculescu D(2023)Efficient Deep Learning Using Non-volatile Memory Technology in GPU ArchitecturesEmbedded Machine Learning for Cyber-Physical, IoT, and Edge Computing10.1007/978-3-031-19568-6_8(225-252)Online publication date: 1-Oct-2023
https://doi.org/10.1007/978-3-031-19568-6_8
Inci AIsgenc MMarculescu D(2022)DeepNVM++: Cross-Layer Modeling and Optimization Framework of Nonvolatile Memories for Deep LearningIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.312714841:10(3426-3437)Online publication date: Oct-2022
https://doi.org/10.1109/TCAD.2021.3127148
Yue HWei XTan JJiang NQiu M(2022)Eff-ECC: Protecting GPGPUs Register File With a Unified Energy-Efficient ECC MechanismIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.310452941:7(2080-2093)Online publication date: Jul-2022
https://doi.org/10.1109/TCAD.2021.3104529
Oh YJeong IRo WYoon M(2022)CASH-RF: A Compiler-Assisted Hierarchical Register File in GPUsIEEE Embedded Systems Letters10.1109/LES.2022.316374914:4(187-190)Online publication date: Dec-2022
https://doi.org/10.1109/LES.2022.3163749
Ze TJun ZXianglong RFeihu FYue C(2022)Survey of Shared Register File design for Unified Shader Array in GPUs2022 IEEE 9th International Conference on Cyber Security and Cloud Computing (CSCloud)/2022 IEEE 8th International Conference on Edge Computing and Scalable Cloud (EdgeCom)10.1109/CSCloud-EdgeCom54986.2022.00042(201-206)Online publication date: Jun-2022
https://doi.org/10.1109/CSCloud-EdgeCom54986.2022.00042
Jeong IOh YRo WYoon M(2022)TEA-RC: Thread Context-Aware Register Cache for GPUsIEEE Access10.1109/ACCESS.2022.319614910(82049-82062)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3196149
Inci AMeric Isgenc MMarculescu D(2020)DeepNVM: A Framework for Modeling and Analysis of Non-Volatile Memory Technologies for Deep Learning Applications2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE48585.2020.9116263(1295-1298)Online publication date: Mar-2020
https://doi.org/10.23919/DATE48585.2020.9116263
Deng QZhang YZhao ZZhang SZhang MYang J(2020)FRF: Toward Warp-Scheduler Friendly STT-RAM/SRAM Fine-Grained Hybrid GPGPU Register File DesignIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2019.294680839:10(2396-2409)Online publication date: Oct-2020
https://doi.org/10.1109/TCAD.2019.2946808
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents