research-article

Public Access

WarpPool: sharing requests with inter-warp coalescing for throughput processors

Authors:

John Kloosterman,

Jonathan Beaumont,

Ron Dreslinski,

Scott MahlkeAuthors Info & Claims

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

Pages 433 - 444

https://doi.org/10.1145/2830772.2830830

Published: 05 December 2015 Publication History

Abstract

Although graphics processing units (GPUs) are capable of high compute throughput, their memory systems need to supply the arithmetic pipelines with data at a sufficient rate to avoid stalls. For benchmarks that have divergent access patterns or cause the L1 cache to run out of resources, the link between the GPU's load/store unit and the L1 cache becomes a bottleneck in the memory system, leading to low utilization of compute resources. While current GPU memory systems are able to coalesce requests between threads in the same warp, we identify a form of spatial locality between threads in multiple warps. We use this locality, which is overlooked in current systems, to merge requests being sent to the L1 cache. This relieves the bottleneck between the load/store unit and the cache, and provides an opportunity to prioritize requests to minimize cache thrashing. Our implementation, WarpPool, yields a 38% speedup on memory throughput-limited kernels by increasing the throughput to the L1 by 8% and the reducing the number of L1 misses by 23%. We also demonstrate that WarpPool can improve GPU programmability by achieving high performance without the need to optimize workloads' memory access patterns. A Verilog implementation including place-and route shows WarpPool requires 1.0% added GPU area and 0.8% added power.

References

[1]

A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163--174, April 2009.

[2]

M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In R. Gupta, editor, Compiler Construction, volume 6011 of Lecture Notes in Computer Science, pages 244--263. Springer Berlin Heidelberg, 2010.

Digital Library

[3]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE.

Digital Library

[4]

X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu. Adaptive cache management for energy-efficient gpu computing. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 343--355. IEEE, 2014.

Digital Library

[5]

J. W. Davidson and S. Jinturkar. Memory access coalescing: A technique for eliminating redundant memory accesses. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, PLDI '94, pages 186--195, New York, NY, USA, 1994. ACM.

Digital Library

[6]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In Innovative Parallel Computing (InPar), 2012, pages 1--10, May 2012.

[7]

M. Harris. An efficient matrix transpose in cuda c/c++. http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc. Accessed: April 2015.

[8]

J. Hestness, S. Keckler, and D. Wood. A comparative analysis of microarchitecture effects on cpu and gpu memory system behavior. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 150--160, Oct 2014.

[9]

W. Jia, K. Shaw, and M. Martonosi. Mrpb: Memory request prioritization for massively parallel processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 272--283, Feb 2014.

[10]

A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 395--406, New York, NY, USA, 2013. ACM.

Digital Library

[11]

A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for gpgpus. ACM SIGARCH Computer Architecture News, 41(3):332--343, 2013.

Digital Library

[12]

T. Juan, J. J. Navarro, and O. Temam. Data caches for superscalar processors. In Proceedings of the 11th International Conference on Supercomputing, ICS '97, pages 60--67, New York, NY, USA, 1997. ACM.

Digital Library

[13]

H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on gpus. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 63--74, Dec 2014.

Digital Library

[14]

M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 260--271, Feb 2014.

[15]

S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: A compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 101--110, New York, NY, USA, 2009. ACM.

Digital Library

[16]

J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 487--498, New York, NY, USA, 2013. ACM.

Digital Library

[17]

C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal. A detailed gpu cache model based on reuse distance theory. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 37--48, Feb 2014.

[18]

Nvidia. GeForce GTX 480. http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/specifications.

[19]

Nvidia. GeForce GTX 680. http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf.

[20]

K. Olukotun, M. Rosenblum, and K. Wilson. Increasing cache port efficiency for dynamic superscalar microprocessors. In Computer Architecture, 1996 23rd Annual International Symposium on, pages 147--147, May 1996.

Digital Library

[21]

F. Quintana, J. Corbal, R. Espasa, and M. Valero. Adding a vector unit to a superscalar processor. In Proceedings of the 13th international conference on Supercomputing, pages 1--10. ACM, 1999.

Digital Library

[22]

J. A. Rivers, G. S. Tyson, E. S. Davidson, and T. M. Austin. On high-bandwidth data cache design for multi-issue processors. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 46--56. IEEE, 1997.

Digital Library

[23]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83, Washington, DC, USA, 2012. IEEE.

Digital Library

[24]

T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 99--110, New York, NY, USA, 2013. ACM.

Digital Library

[25]

A. Sethia, D. Jamshidi, and S. Mahlke. Mascar: Speeding up gpu warps by reducing memory pitstops. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 174--185, Feb 2015.

[26]

A. Sethia and S. Mahlke. Equalizer: Dynamic tuning of gpu resources for efficient execution. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 647--658, Dec 2014.

Digital Library

[27]

J. A. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign.

[28]

S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. Cacti 5.0. HP Laboratories, Technical Report, 2007.

[29]

Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pages 86--97, New York, NY, USA, 2010. ACM.

Digital Library

[30]

Z. Zheng, Z. Wang, and M. Lipasti. Adaptive cache and concurrency allocation on gpgpus. Computer Architecture Letters, PP(99):1--1, 2014.

Cited By

Jin ZRocca CKim JKasan HRhu MBakhoda AAamodt TKim J(2024)Uncovering Real GPU NoC Characteristics: Implications on Interconnect Architecture2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00070(885-898)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00070
Jiang FLi CZhang WXu JKim T(2024)Collaborative Coalescing of Redundant Memory Access for GPU SystemProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473837(195-200)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1109/ASP-DAC58780.2024.10473837
Khojasteh HTabatabaei H(2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00342
Show More Cited By

Index Terms

WarpPool: sharing requests with inter-warp coalescing for throughput processors
1. Computer systems organization
  1. Architectures
    1. Parallel architectures

Recommendations

On-the-fly elimination of dynamic irregularities for GPU computing
ASPLOS '11

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control ...
On-the-fly elimination of dynamic irregularities for GPU computing
ASPLOS XVI: Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control ...
On-the-fly elimination of dynamic irregularities for GPU computing
ASPLOS '11

The power-efficient massively parallel Graphics Processing Units (GPUs) have become increasingly influential for general-purpose computing over the past few years. However, their efficiency is sensitive to dynamic irregular memory references and control ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture

December 2015

787 pages

ISBN:9781450340342

DOI:10.1145/2830772

General Chair:
Milos Prvulovic
Georgia Tech

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

IEEE Computer Society TC-uARCH
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

ARM, Inc.
Defense Advanced Research Projects Agency

Conference

MICRO-48

Sponsor:

SIGMICRO

MICRO-48: The 48th Annual IEEE/ACM International Symposium of Microarchitecture

December 5 - 9, 2015

Waikiki, Hawaii

Acceptance Rates

MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

29
Total Citations
View Citations
771
Total Downloads

Downloads (Last 12 months)152
Downloads (Last 6 weeks)22

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Jin ZRocca CKim JKasan HRhu MBakhoda AAamodt TKim J(2024)Uncovering Real GPU NoC Characteristics: Implications on Interconnect Architecture2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00070(885-898)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00070
Jiang FLi CZhang WXu JKim T(2024)Collaborative Coalescing of Redundant Memory Access for GPU SystemProceedings of the 29th Asia and South Pacific Design Automation Conference10.1109/ASP-DAC58780.2024.10473837(195-200)Online publication date: 22-Jan-2024
https://dl.acm.org/doi/10.1109/ASP-DAC58780.2024.10473837
Khojasteh HTabatabaei H(2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00342
Karimi EAgostini NDong SKaeli D(2022)VCSR: An Efficient GPU Memory-Aware Sparse FormatIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.317729133:12(3977-3989)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3177291
Lee SHwang SKim MChoi JAhn J(2022)Future Scaling of Memory Hierarchy for Tensor Cores and Eliminating Redundant Shared Memory Traffic Using Inter-Warp MulticastingIEEE Transactions on Computers10.1109/TC.2022.3207134(1-12)Online publication date: 2022
https://doi.org/10.1109/TC.2022.3207134
Bitalebi HSafaei F(2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
https://doi.org/10.1007/s11227-022-04657-3
Fang JWei ZYang H(2021)Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPUMicromachines10.3390/mi1210126212:10(1262)Online publication date: 17-Oct-2021
https://doi.org/10.3390/mi12101262
Asiatici MIenne P(2021)Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAsACM Transactions on Reconfigurable Technology and Systems10.1145/346682315:2(1-33)Online publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1145/3466823
Wang XTumeo ALeidel JLi JChen Y(2021)HAM: Hotspot-Aware Manager for Improving Communications With 3D-Stacked MemoryIEEE Transactions on Computers10.1109/TC.2021.306698270:6(833-848)Online publication date: 1-Jun-2021
https://doi.org/10.1109/TC.2021.3066982
Wang XWilliams BLeidel JEhret AKinsy MChen YLi Z(2020)Remote atomic extension (RAE) for scalable high performance computingProceedings of the 57th ACM/EDAC/IEEE Design Automation Conference10.5555/3437539.3437691(1-6)Online publication date: 20-Jul-2020
https://dl.acm.org/doi/10.5555/3437539.3437691
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents