skip to main content
10.1145/2830772.2830830acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article
Public Access

WarpPool: sharing requests with inter-warp coalescing for throughput processors

Published: 05 December 2015 Publication History

Abstract

Although graphics processing units (GPUs) are capable of high compute throughput, their memory systems need to supply the arithmetic pipelines with data at a sufficient rate to avoid stalls. For benchmarks that have divergent access patterns or cause the L1 cache to run out of resources, the link between the GPU's load/store unit and the L1 cache becomes a bottleneck in the memory system, leading to low utilization of compute resources. While current GPU memory systems are able to coalesce requests between threads in the same warp, we identify a form of spatial locality between threads in multiple warps. We use this locality, which is overlooked in current systems, to merge requests being sent to the L1 cache. This relieves the bottleneck between the load/store unit and the cache, and provides an opportunity to prioritize requests to minimize cache thrashing. Our implementation, WarpPool, yields a 38% speedup on memory throughput-limited kernels by increasing the throughput to the L1 by 8% and the reducing the number of L1 misses by 23%. We also demonstrate that WarpPool can improve GPU programmability by achieving high performance without the need to optimize workloads' memory access patterns. A Verilog implementation including place-and route shows WarpPool requires 1.0% added GPU area and 0.8% added power.

References

[1]
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on, pages 163--174, April 2009.
[2]
M. M. Baskaran, J. Ramanujam, and P. Sadayappan. Automatic c-to-cuda code generation for affine programs. In R. Gupta, editor, Compiler Construction, volume 6011 of Lecture Notes in Computer Science, pages 244--263. Springer Berlin Heidelberg, 2010.
[3]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE.
[4]
X. Chen, L.-W. Chang, C. I. Rodrigues, J. Lv, Z. Wang, and W.-M. Hwu. Adaptive cache management for energy-efficient gpu computing. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 343--355. IEEE, 2014.
[5]
J. W. Davidson and S. Jinturkar. Memory access coalescing: A technique for eliminating redundant memory accesses. In Proceedings of the ACM SIGPLAN 1994 Conference on Programming Language Design and Implementation, PLDI '94, pages 186--195, New York, NY, USA, 1994. ACM.
[6]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. Auto-tuning a high-level language targeted to gpu codes. In Innovative Parallel Computing (InPar), 2012, pages 1--10, May 2012.
[7]
M. Harris. An efficient matrix transpose in cuda c/c++. http://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc. Accessed: April 2015.
[8]
J. Hestness, S. Keckler, and D. Wood. A comparative analysis of microarchitecture effects on cpu and gpu memory system behavior. In Workload Characterization (IISWC), 2014 IEEE International Symposium on, pages 150--160, Oct 2014.
[9]
W. Jia, K. Shaw, and M. Martonosi. Mrpb: Memory request prioritization for massively parallel processors. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 272--283, Feb 2014.
[10]
A. Jog, O. Kayiran, N. Chidambaram Nachiappan, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Owl: Cooperative thread array aware scheduling techniques for improving gpgpu performance. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 395--406, New York, NY, USA, 2013. ACM.
[11]
A. Jog, O. Kayiran, A. K. Mishra, M. T. Kandemir, O. Mutlu, R. Iyer, and C. R. Das. Orchestrated scheduling and prefetching for gpgpus. ACM SIGARCH Computer Architecture News, 41(3):332--343, 2013.
[12]
T. Juan, J. J. Navarro, and O. Temam. Data caches for superscalar processors. In Proceedings of the 11th International Conference on Supercomputing, ICS '97, pages 60--67, New York, NY, USA, 1997. ACM.
[13]
H. Lee, K. Brown, A. Sujeeth, T. Rompf, and K. Olukotun. Locality-aware mapping of nested parallel patterns on gpus. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 63--74, Dec 2014.
[14]
M. Lee, S. Song, J. Moon, J. Kim, W. Seo, Y. Cho, and S. Ryu. Improving gpgpu resource utilization through alternative thread block scheduling. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 260--271, Feb 2014.
[15]
S. Lee, S.-J. Min, and R. Eigenmann. Openmp to gpgpu: A compiler framework for automatic translation and optimization. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '09, pages 101--110, New York, NY, USA, 2009. ACM.
[16]
J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi. Gpuwattch: Enabling energy optimizations in gpgpus. In Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA '13, pages 487--498, New York, NY, USA, 2013. ACM.
[17]
C. Nugteren, G.-J. van den Braak, H. Corporaal, and H. Bal. A detailed gpu cache model based on reuse distance theory. In High Performance Computer Architecture (HPCA), 2014 IEEE 20th International Symposium on, pages 37--48, Feb 2014.
[18]
Nvidia. GeForce GTX 480. http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480/specifications.
[19]
Nvidia. GeForce GTX 680. http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf.
[20]
K. Olukotun, M. Rosenblum, and K. Wilson. Increasing cache port efficiency for dynamic superscalar microprocessors. In Computer Architecture, 1996 23rd Annual International Symposium on, pages 147--147, May 1996.
[21]
F. Quintana, J. Corbal, R. Espasa, and M. Valero. Adding a vector unit to a superscalar processor. In Proceedings of the 13th international conference on Supercomputing, pages 1--10. ACM, 1999.
[22]
J. A. Rivers, G. S. Tyson, E. S. Davidson, and T. M. Austin. On high-bandwidth data cache design for multi-issue processors. In Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 46--56. IEEE, 1997.
[23]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Cache-conscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 72--83, Washington, DC, USA, 2012. IEEE.
[24]
T. G. Rogers, M. O'Connor, and T. M. Aamodt. Divergence-aware warp scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 99--110, New York, NY, USA, 2013. ACM.
[25]
A. Sethia, D. Jamshidi, and S. Mahlke. Mascar: Speeding up gpu warps by reducing memory pitstops. In High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, pages 174--185, Feb 2015.
[26]
A. Sethia and S. Mahlke. Equalizer: Dynamic tuning of gpu resources for efficient execution. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 647--658, Dec 2014.
[27]
J. A. Stratton et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report IMPACT-12-01, University of Illinois at Urbana-Champaign.
[28]
S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. Cacti 5.0. HP Laboratories, Technical Report, 2007.
[29]
Y. Yang, P. Xiang, J. Kong, and H. Zhou. A gpgpu compiler for memory optimization and parallelism management. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pages 86--97, New York, NY, USA, 2010. ACM.
[30]
Z. Zheng, Z. Wang, and M. Lipasti. Adaptive cache and concurrency allocation on gpgpus. Computer Architecture Letters, PP(99):1--1, 2014.

Cited By

View all

Index Terms

  1. WarpPool: sharing requests with inter-warp coalescing for throughput processors

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
    December 2015
    787 pages
    ISBN:9781450340342
    DOI:10.1145/2830772
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 05 December 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. GPGPU
    2. memory coalescing
    3. memory divergence

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MICRO-48
    Sponsor:

    Acceptance Rates

    MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;
    Overall Acceptance Rate 484 of 2,242 submissions, 22%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)152
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media