skip to main content
10.1145/2540708.2540717acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

A locality-aware memory hierarchy for energy-efficient GPU architectures

Published: 07 December 2013 Publication History

Abstract

As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multi-threading of GPUs and the simplicity of their cache hierarchies make CPU-specific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a locality-aware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our locality-aware memory hierarchy improves GPU performance, energy-efficiency, and memory throughput for a large range of applications.

References

[1]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IEEE International Symposium on Workload Characterization (IISWC-2009), October 2009.
[2]
B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, "A MapReduce Framework on Graphics Processors," in 17th International Conference on Parallel Architecture and Compilation Techniques (PACT-17), 2008.
[3]
M. Burtscher, R. Nasre, and K. Pingali, "A Quantitative Study of Irregular Programs on GPUs," in IEEE International Symposium on Workload Characterization (IISWC-2012), 2012.
[4]
W. W. Fung, I. Sham, G. Yuan, and T. M. Aamodt, "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," in 40th International Symposium on Microarchitecture (MICRO-40), December 2007.
[5]
D. Tarjan, J. Meng, and K. Skadron, "Increasing memory miss tolerance for SIMD cores," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC-09), 2009.
[6]
J. Meng, D. Tarjan, and K. Skadron, "Dynamic warp subdivision for integrated branch and memory divergence tolerance," in 37th International Symposium on Computer Architecture (ISCA-37), 2010.
[7]
W. W. Fung and T. M. Aamodt, "Thread Block Compaction for Efficient SIMT Control Flow," in 17th International Symposium on High Performance Computer Architecture (HPCA-17), February 2011.
[8]
V. Narasiman and et al., "Improving GPU Performance via Large Warps and Two-Level Warp Scheduling," in 44th International Symposium on Microarchitecture (MICRO-44), December 2011.
[9]
M. Rhu and M. Erez, "CAPRI: Prediction of Compaction-Adequacy for Handling Control-Divergence in GPGPU Architectures," in 39th International Symposium on Computer Architecture (ISCA-39), June 2012.
[10]
T. Rogers, M. O'Connor, and T. Aamodt, "Cache-Conscious Wavefront Scheduling," in 45th International Symposium on Microarchitecture (MICRO-45), December 2012.
[11]
M. Rhu and M. Erez, "The Dual-Path Execution Model for Efficient GPU Control Flow," in 19th International Symposium on High-Performance Computer Architecture (HPCA-19), February 2013.
[12]
A. Jog and et al., "OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance," in 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-13), 2013.
[13]
M. Rhu and M. Erez, "Maximizing SIMD Resource Utilization in GPGPUs with SIMD Lane Permutation," in 40th International Symposium on Computer Architecture (ISCA-40), June 2013.
[14]
A. Vaidya and et al., "SIMD Divergence Optimization through Intra-Warp Compaction," in 40th International Symposium on Computer Architecture (ISCA-40), June 2013.
[15]
A. Jog and et al., "Orchestrated Scheduling and Prefetching for GPGPUs," in 40th International Symposium on Computer Architecture (ISCA-40), 2013.
[16]
S. Keckler, W. Dally, B. Khailany, M. Garland, and D. Glasco, "GPUs and the Future of Parallel Computing," in IEEE Micro, October 2011.
[17]
D. H. Yoon, M. K. Jeong, and M. Erez, "Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput," in 38th International Symposium on Computer Architecture (ISCA-38), 2011.
[18]
D. H. Yoon, M. Sullivan, M. K. Jeong, and M. Erez, "The dynamic granularity memory system," in 39th International Symposium on Computer Architecture (ISCA-39), 2012.
[19]
NVIDIA Corporation, "NVIDIA's Next Generation CUDA Compute Architecture: Fermi," 2009.
[20]
NVIDIA Corporation, "Whitepaper: NVIDIA GeForce GTX 680," 2012.
[21]
AMD Corporation, "AMD Radeon HD 6900M Series Specifications," 2010.
[22]
NVIDIA Corporation, "NVIDIA CUDA Programming Guide," 2011.
[23]
AMD Corporation, "ATI Stream Computing OpenCL Programming Guide," August 2010.
[24]
1Gb (32Mx32) GDDR5 SGRAM, H5GQ1H24AFR, Hynix, 2009.
[25]
J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, "Future scaling of processor-memmory interfaces," in Proc. the Int'l Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2009.
[26]
J. H. Ahn, J. Leverich, R. Schreiber, and N. P. Jouppi, "Multicore DIMM: An energy efficient memory module with independently controlled DRAMs," IEEE Computer Architecture Letters, vol. 8, no. 1, pp. 5--8, Jan. - Jun. 2009.
[27]
F. A. Ware and C. Hampel, "Micro-threaded row and column operations in a DRAM core," in Proc. the first Workshop on Unique Chips and Systems (UCAS), Mar. 2005.
[28]
F. A. Ware and C. Hampel, "Improving power and data efficiency with threaded memory modules," in Proceedings of the International Conference on Computer Design (ICCD), 2006.
[29]
H. Zheng and et al., "Mini-rank: Adaptive DRAM architecture for improving memory power efficiency," in 41st International Symposium on Microarchitecture (MICRO-41), Nov. 2008.
[30]
T. M. Brewer, "Instruction set innovations for the Convey HC-1 computer," IEEE Micro, vol. 30, no. 2, pp. 70--79, 2010.
[31]
J. S. Liptay, "Structural aspects of the system/360 model 85, part II: The cache," IBM Systems Journal, vol. 7, pp. 15--21, 1968.
[32]
S. Li and et al., "Mage: adaptive granularity and ecc for resilient and power efficient memory systems," in High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for. IEEE, 2012, pp. 1--11.
[33]
D. Abts and et al., "The Cray Black Widow: A highly scalable vector multiprocessor," in Proc. the Int'l Conf. High Performance Computing, Networking, Storage, and Analysis (SC), Nov. 2007.
[34]
S. Kumar and C. Wilkerson, "Exploiting spatial locality in data caches using spatial footprints," in 25th International Symposium on Computer Architecture (ISCA-25), 1998.
[35]
C. Chen, S.-H. Yang, B. Falsafi, and A. Moshovos, "Accurate and complexity-effective spatial pattern prediction," in 10th International Symposium on High Performance Computer Architecture (HPCA-10), 2004.
[36]
W. Jia, K. Shaw, and M. Martonosi, "Characterizing and Improving the Use of Demand-Fetched Caches in GPUs," in 26th International Supercomputing Conference (ICS'26), 2012.
[37]
B. Bloom, "Space/Time Trade-Offs in Hash Coding with Allowable Errors," in ACM Communications, 1970.
[38]
L. Fan, P. Cao, J. Almeida, and A. Z. Broder, "Summary cache: a scalable wide-area web cache sharing protocol," IEEE/ACM Transactions on Networking (TON), vol. 8, no. 3, pp. 281--293, 2000.
[39]
M. Ramakrishna and et al., "Efficient Hardware Hashing Functions for High Performance Computers," in IEEE Transactions on Computers, 1997.
[40]
E. Sprangle, R. S. Chappell, M. Alsup, and Y. N. Patt, "The agree predictor: A mechanism for reducing negative branch history interference," in 17th International Symposium on Computer Architecture (ISCA-17), 1997.
[41]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, "Analyzing CUDA workloads using a detailed GPU simulator," in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2009), April 2009.
[42]
"GPGPU-Sim," http://www.gpgpu-sim.org.
[43]
"DrSim," http://lph.ece.utexas.edu/public/DrSim.
[44]
M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez, "Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems," in 18th International Symposium on High Performance Computer Architecture (HPCA-18), February 2012.
[45]
T.-Y. Oh and et al., "A 7Gb/s/pin 1 Gbit GDDR5 SDRAM With 2.5 ns Bank to Bank Active Time and No Bank Group Restriction," in IEEE Journal of Solid-State Circuits, 2011.
[46]
"GPGPU-Sim Manual," http://www.gpgpu-sim.org/manual.
[47]
J. Leng and et al., "GPUWattch: Enabling Energy Optimizations in GPGPUs," in 40th International Symposium on Computer Architecture (ISCA-40), June 2013.
[48]
NVIDIA Corporation, "CUDA C/C++ SDK CODE Samples," 2011.
[49]
M. Schatz, C. Trapnell, A. Delcher, and A. Varshney, "High-throughput sequence alignment using graphics processing units," BMC Bioinformatics, vol. 8, no. 1, p. 474, 2007.
[50]
M. Gebhart, D. Johnson, D. Tarjan, S. Keckler, W. Dally, E. Lindholm, and K. Skadron, "Energy-efficient mechanisms for managing thread context in throughput processors," in 38th International Symposium on Computer Architecture (ISCA-38), 2011.
[51]
HMC, "Hybrid memory cube specification 1.0," Hybrid Memory Cube Consortium, 2013.
[52]
Hynix, "Blazing a trail to high performance graphics," Hynix Semiconductor, Inc., 2011.
[53]
JEDEC, "JESD 229 Wide I/O SDR," 2011.
[54]
D. H. Yoon and M. Erez, "Virtualized and flexible ECC for main memory," in Proc. the 15th Int'l. Conf. Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 2010.
[55]
A. Seznec, "Decoupled sectored caches: Conciliating low tag implementation cost," in Proc. the 21st Ann. Int'l Symp. Computer Architecture (ISCA), Apr. 1994.
[56]
J. B. Rothman and A. J. Smith, "The pool of subsectors cache design," in Proc. the 13th Int'l Conf. Supercomputing (ICS), Jun. 1999.
[57]
A. Gonzalez, C. Aliagas, and M. Valero, "A data cache with multiple caching strategies tuned to different types of locality," in Proc. the Int'l Conf. Supercomputing (ICS), Jul. 1995.
[58]
F. Deng and D. Rafiei, "Approximately detecting duplicates for streaming data using stable bloom filters," in Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006, pp. 25--36.
[59]
"bcache: A Linux kernel block layer cache," http://bcache.evilpiepirate.org/.
[60]
F. Chang, W.-c. Feng, and K. Li, "Approximate caches for packet classification," in INFOCOM 2004. Twenty-third AnnualJoint Conference of the IEEE Computer and Communications Societies, vol. 4. IEEE, 2004, pp. 2196--2207.
[61]
C. Ungureanu and et al., "TBF: A memory-efficient replacement policy for flash-based caches," in Data Engineering (ICDE), 2013 IEEE 29th International Conference on, 2013, pp. 1117--1128.
[62]
M. Yoon, "Aging bloom filter with two active buffers for dynamic sets," Knowledge and Data Engineering, IEEE Transactions on, vol. 22, no. 1, pp. 134--138, 2010.

Cited By

View all

Index Terms

  1. A locality-aware memory hierarchy for energy-efficient GPU architectures

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO-46: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
    December 2013
    498 pages
    ISBN:9781450326384
    DOI:10.1145/2540708
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 December 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPU
    2. SIMD
    3. SIMT
    4. adaptive granularity memory
    5. fine-grained memory access
    6. irregular memory access patterns

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MICRO-46
    Sponsor:

    Acceptance Rates

    MICRO-46 Paper Acceptance Rate 39 of 239 submissions, 16%;
    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Upcoming Conference

    MICRO '24

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)79
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 15 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media