skip to main content
10.1145/2744769.2744785acmconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

A STT-RAM-based low-power hybrid register file for GPGPUs

Published: 07 June 2015 Publication History

Abstract

Recently, general-purpose graphics processing units (GPGPUs) have been widely used to accelerate computing in various applications. To store the contexts of thousands of concurrent threads on a GPU, a large static random-access memory (SRAM)-based register file is employed. Due to high leakage power of SRAM, the register file consumes 20% to 40% of the total GPU power consumption. Thus, hybrid memory system, which combines SRAM and the emerging non-volatile memory (NVM), has been employed for register file design on GPUs. Although it has shown strong potential to alleviate the power issue of GPUs, existing hybrid memory solutions might not exploit the intrinsic feature of GPU register file. By leveraging the warp schedule on GPU, this paper proposes a hybrid register architecture which consists of a NVM-based register file and mixed SRAM-based write buffers with a warp-aware write back strategy. Simulation results show that our design can eliminate 64% of write accesses to NVM and reduce power of register file by 66% on average, with only 4.2% performance degradation. After we apply the power gating technique, the register power is further reduced to 25% of SRAM counterpart on average.

References

[1]
Volodymyr V Kindratenko et al. GPU clusters for high-performance computing. In CLUSTR, pages 1--8, Aug 2009.
[2]
Piyush Sao et al. A Distributed CPU-GPU Sparse Direct Solver. In Euro-Par 2014 Parallel Processing, volume 8632 of Lecture Notes in Computer Science, pages 487--498, 2014.
[3]
Xiaoming Chen et al. GPU-Accelerated Sparse LU Factorization for Circuit Simulation with Performance Modeling. TPDS, 26(3):786--795, 2015.
[4]
Yun Liang et al. Real-time implementation and performance optimization of 3D sound localization on GPUs. In DATE, pages 832--835, March 2012.
[5]
NVIDIA Corporation. NVIDIA Tesla K80.
[6]
Goswami Nilanjan et al. Power-performance co-optimization of throughput core architecture using resistive memory. In HPCA, pages 342--353, Feb 2013.
[7]
Jishen Zhao and Yuan Xie. Optimizing bandwidth and power of graphics memory with hybrid memory technologies and adaptive data migration. In ICCAD, pages 81--87, Nov 2012.
[8]
Jishen Zhao et al. Energy-efficient GPU Design with Reconfigurable In-package Graphics Memory. In ISLPED, pages 403--408, 2012.
[9]
Bin Wang et al. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In PACT, pages 93--102, Sept 2013.
[10]
Dongki Kim et al. Hybrid DRAM/PRAM-based main memory for single-chip CPU/GPU. In DAC, pages 888--896, June 2012.
[11]
Prateeksha Satyamoorthy and Sonali Parthasarathy. MRAM for Shared Memory in GPGPUs. Technical report, University of Virginia, 2011.
[12]
Prateeksha Satyamoorthy. STT-RAM for Shared Memory in GPUs. Master's thesis, University of Virginia, 2011.
[13]
Naifeng Jing et al. An Energy-efficient and Scalable eDRAM-based Register File Architecture for GPGPU. In ISCA, pages 344--355, 2013.
[14]
Naifeng Jing et al. Compiler assisted dynamic register file in GPGPU. In ISLPED, pages 3--8, Sept 2013.
[15]
Mohammad Abdel-Majeed et al. Warped Register File: A Power Efficient Register File for GPGPUs. In HPCA, pages 412--423, 2013.
[16]
Mark Gebhart et al. Energy-efficient mechanisms for managing thread context in throughput processors. In ISCA, pages 235--246. IEEE, 2011.
[17]
NVIDIA Corporation. NVIDIA's Next Generation CUDA Compute Architecture: Kepler GK110 (White Paper), 2012.
[18]
NVIDIA Corporation. Parallel thread execution.
[19]
Cong Xu et al. Device-architecture co-optimization of stt-ram based memory for low power embedded systems. In ICCAD, pages 463--470. IEEE Press, 2011.
[20]
Clinton W Smullen et al. Relaxing non-volatility for fast and energy-efficient stt-ram caches. In HPCA, pages 50--61. IEEE, 2011.
[21]
Adwait Jog et al. Cache revive: architecting volatile stt-ram caches for enhanced performance in cmps. In DAC, pages 243--252. ACM, 2012.
[22]
Xiangyu Dong et al. Nvsim: A circuit-level performance, energy, and area model for emerging nonvolatile memory. TCAD, 31(7):994--1007, 2012.
[23]
NVIDIA Corporation. NVIDIA's Fermi: The First Complete GPU Computing Architecture, 2009.
[24]
NVIDIA. Computing sdk. Gpu computing sdk, Avaliable at: https://developer.nvidia.com/gpu-computing-sdk, 22(07):2013, 2013.
[25]
Shuai Che et al. A characterization of the rodinia benchmark suite with comparison to contemporary cmp workloads. In IISWC, pages 1--11. IEEE, 2010.
[26]
Yasuhisa Takeyama et al. A low leakage sram macro with replica cell biasing scheme. Solid-State Circuits, IEEE Journal of, 41(4):815--822, 2006.
[27]
Pradeep Nair et al. A quasi-power-gated low-leakage stable sram cell. In MWSCAS, pages 761--764. IEEE, 2010.
[28]
Hailin Jiang et al. Benefits and costs of power-gating technique. In ICCD, pages 559--566. IEEE, 2005.
[29]
Ali Bakhoda et al. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS, pages 163--174, April 2009.
[30]
Veynu Narasiman et al. Improving gpu performance via large warps and two-level warp scheduling. In MICRO, pages 308--317. ACM, 2011.
[31]
Adwait Jog et al. Owl: cooperative thread array aware scheduling techniques for improving gpgpu performance. ACM SIGARCH, 41(1):395--406, 2013.
[32]
Wilson WL Fung et al. Dynamic warp formation and scheduling for efficient gpu control flow. In MICRO, pages 407--420. IEEE Computer Society, 2007.
[33]
Timothy G Rogers et al. Cache-conscious wavefront scheduling. In MICRO, pages 72--83. ACM, 2012.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DAC '15: Proceedings of the 52nd Annual Design Automation Conference
June 2015
1204 pages
ISBN:9781450335201
DOI:10.1145/2744769
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. general-purpose graphics processing unit (GPGPU)
  2. hybrid register file
  3. power
  4. spintorque-transfer random-access memory (STT-RAM)

Qualifiers

  • Research-article

Funding Sources

  • The Importation and Development of High-Caliber Talents Project of Beijing Municipal Institutions
  • 973 project
  • National Natural Science Foundation of China
  • Tsinghua University Initiative Scientific Research Program
  • Brain Inspired Computing Research, Tsinghua university

Conference

DAC '15
Sponsor:
DAC '15: The 52nd Annual Design Automation Conference 2015
June 7 - 11, 2015
California, San Francisco

Acceptance Rates

Overall Acceptance Rate 1,770 of 5,499 submissions, 32%

Upcoming Conference

DAC '25
62nd ACM/IEEE Design Automation Conference
June 22 - 26, 2025
San Francisco , CA , USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media