research-article

Dynamic warp subdivision for integrated branch and memory divergence tolerance

Authors:

Kevin SkadronAuthors Info & Claims

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Pages 235 - 246

https://doi.org/10.1145/1815961.1815992

Published: 19 June 2010 Publication History

Abstract

SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely costly. Multi-threading can hide latencies by interleaving the execution of multiple warps, but deep multi-threading using many warps dramatically increases the cost of the register files (multi-threading depth x SIMD width), and cache contention can make performance worse. Instead, intra-warp latency hiding should first be exploited. This allows threads that are ready but stalled by SIMD restrictions to use these idle cycles and reduces the need for multi-threading among warps. This paper introduces dynamic warp subdivision (DWS), which allows a single warp to occupy more than one slot in the scheduler without requiring extra register file space. Independent scheduling entities allow divergent branch paths to interleave their execution, and allow threads that hit to run ahead. The result is improved latency hiding and memory level parallelism (MLP). We evaluate the technique on a coherent cache hierarchy with private L1 caches and a shared L2 cache. With an area overhead of less than 1%, experiments with eight data-parallel benchmarks show our technique improves performance on average by 1.7X.

References

[1]

NVIDIA's next generation CUDA compute architecture: Fermi. NVIDIA Corporation, 2009.

[2]

ATI. Radeon 9700 Pro. http://mirror.ati.com/products/pc/radeon9700pro, 2002.

[3]

N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006.

Digital Library

[4]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general purpose applications on graphics processors using CUDA. JPDC, 2008.

Digital Library

[5]

S. Choi and D. Yeung. Learning-based SMT processor resource distribution via hill-climbing. In ISCA, 2006.

Digital Library

[6]

J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen. Dynamic speculative precomputation. In MICRO 34, 2001.

Digital Library

[7]

Intel Corporation. Intel AVX: New frontiers in performance improvements and energy efficiency, 2009.

[8]

NVIDIA Corporation. GeForce GTX 280 specifications. 2008.

[9]

R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hern, T. Juan, G. Lowney, M Mattina, and A. Seznec. Tarantula: A vector extension to the Alpha architecture. In ISCA, 2002.

Digital Library

[10]

W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, 2007.

Digital Library

[11]

M. Gschwind. Chip multiprocessing and the Cell Broadband Engine. In CF, 2006.

Digital Library

[12]

U. J. Kapasi, J. Dally, W, S. Rixner, P. R. Mattson, J. D. Owens, and B. Khailany. Efficient conditional operations for data-parallel architectures. In MICRO 33, 2000.

Digital Library

[13]

C. Kozyrakis. A media-enhanced vector architecture for embedded memory systems. Technical report, University of California, Berkeley, 1999.

Digital Library

[14]

R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The Vector-Thread architecture. In ISCA, 2004.

Digital Library

[15]

R. A. Lorie and H. R. Strong. Method for conditional branch execution in SIMD vector processors. US Patent 4,435,758, 1984.

[16]

J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In IPDPS, 2010.

[17]

J. Meng and K. Skadron. Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. In ICCD, 2007.

Digital Library

[18]

J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance: Extended results. U.Va. Tech. Report CS-2010-5, 2010.

[19]

R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. IISWC, 2006.

[20]

Y. Nishikawa, M. Koibuchi, M. Yoshimi, K. Miura, and H. Amano. Performance improvement methodology for ClearSpeed's CSX600. In ICPP, 2007.

Digital Library

[21]

S. E. Orcutt. Implementation of permutation functions in illiac iv-type computers. IEEE Trans. Comput., 25(9), 1976.

Digital Library

[22]

M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA, 2006.

Digital Library

[23]

T. Ramirez, A. Pajuelo, O.J. Santana, and M. Valero. Runahead threads to improve SMT performance. HPCA, 2008.

[24]

S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis. Vector lane threading. In ICPP, 2006.

Digital Library

[25]

S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. López-Lagunas, P. R. Mattson, and J. D. Owens. A bandwidth-efficient architecture for media processing. In MICRO 31, 1998.

Digital Library

[26]

R. M. Russell. The CRAY-1 computer system. Commun. ACM, 21(1), 1978.

Digital Library

[27]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3), 2008.

Digital Library

[28]

Y. Takahashi. A mechanism for SIMD execution of SPMD programs. In HPC-ASIA, 1997.

Digital Library

[29]

D. Talla and L. K. John. Cost-effective hardware acceleration of multimedia applications. In ICCD, 2001.

[30]

D. Tarjan, J. Meng, and K. Skadron. Increasing memory miss tolerance for SIMD cores. In SC, 2009.

Digital Library

[31]

D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO 34, 2001.

Digital Library

[32]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. ISCA, 1995.

Digital Library

Cited By

Bezruchenko AEgunov V(2024)Investigation of the Effectiveness of Programs Optimization Methods for Parallel Computing Systems with GPUHerald of Dagestan State Technical University. Technical Sciences10.21822/2073-6185-2023-50-4-59-7450:4(59-74)Online publication date: 21-Jan-2024
https://doi.org/10.21822/2073-6185-2023-50-4-59-74
Pang WJiang XLiu SQiao LFu KGao LYi WDe V(2024)Control Flow Divergence Optimization by Exploiting Tensor CoresProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3658462(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3658462
Tomás AAragón JParcerisa JGonzález A(2024)LIBRA: Memory Bandwidth- and Locality-Aware Parallel Tile Rendering2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00081(1058-1072)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00081
Show More Cited By

Index Terms

Dynamic warp subdivision for integrated branch and memory divergence tolerance
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
      2. Systolic arrays

Recommendations

On-GPU Thread-Data Remapping for Branch Divergence Reduction

General Purpose GPU computing (GPGPU) plays an increasingly vital role in high performance computing and other areas like deep learning. However, arising from the SIMD execution model, the branch divergence issue lowers efficiency of conditional ...
Dynamic warp subdivision for integrated branch and memory divergence tolerance
ISCA '10

SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in ...
Concurrent warp execution: improving performance of GPU-likely SIMD architecture by increasing resource utilization

Hardware parallelism should be exploited to improve the performance of computing systems. Single instruction multiple data (SIMD) architecture has been widely used to maximize the throughput of computing systems by exploiting hardware parallelism. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

June 2010

520 pages

ISBN:9781450300537

DOI:10.1145/1815961

General Chair:
André Seznec
INRIA Rennes
,
Program Chairs:
Uri Weiser
Technion
,
Ronny Ronen
Intel

ACM SIGARCH Computer Architecture News Volume 38, Issue 3
ISCA '10
June 2010
508 pages
ISSN:0163-5964
DOI:10.1145/1816038
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA '10

Sponsor:

SIGARCH

ISCA '10: The 37th Annual International Symposium on Computer Architecture

June 19 - 23, 2010

Saint-Malo, France

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

218
Total Citations
View Citations
1,278
Total Downloads

Downloads (Last 12 months)82
Downloads (Last 6 weeks)16

Reflects downloads up to 25 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Bezruchenko AEgunov V(2024)Investigation of the Effectiveness of Programs Optimization Methods for Parallel Computing Systems with GPUHerald of Dagestan State Technical University. Technical Sciences10.21822/2073-6185-2023-50-4-59-7450:4(59-74)Online publication date: 21-Jan-2024
https://doi.org/10.21822/2073-6185-2023-50-4-59-74
Pang WJiang XLiu SQiao LFu KGao LYi WDe V(2024)Control Flow Divergence Optimization by Exploiting Tensor CoresProceedings of the 61st ACM/IEEE Design Automation Conference10.1145/3649329.3658462(1-6)Online publication date: 23-Jun-2024
https://dl.acm.org/doi/10.1145/3649329.3658462
Tomás AAragón JParcerisa JGonzález A(2024)LIBRA: Memory Bandwidth- and Locality-Aware Parallel Tile Rendering2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00081(1058-1072)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00081
Alawneh AKang NKhairy MRogers T(2024)ThreadFuser: A SIMT Analysis Framework for MIMD Programs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00078(1013-1026)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00078
Hyun BKim TLee DRhu M(2024)Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00029(263-279)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00029
Mustafa DAlkhasawneh RObeidat FShatnawi A(2024)MIMD Programs Execution Support on SIMD Machines: A Holistic SurveyIEEE Access10.1109/ACCESS.2024.337299012(34354-34377)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3372990
Cuneo BBailey M(2023)Divergence Reduction in Monte Carlo Neutron Transport with On-GPU Asynchronous SchedulingACM Transactions on Modeling and Computer Simulation10.1145/3626957Online publication date: 19-Oct-2023
https://dl.acm.org/doi/10.1145/3626957
Zheng BYu CWang JDing YLiu YWang YPekhimenko G(2023)Grape: Practical and Efficient Graphed Execution for Dynamic Deep Neural Networks on GPUsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614248(1364-1380)Online publication date: 28-Oct-2023
https://dl.acm.org/doi/10.1145/3613424.3614248
Klashtorny AWu ZKaushik APatel H(2023)Predictable GPU Wavefront Splitting for Safety-Critical SystemsACM Transactions on Embedded Computing Systems10.1145/360910222:5s(1-25)Online publication date: 31-Oct-2023
https://dl.acm.org/doi/10.1145/3609102
Jeong JYoon MOh YKoo G(2023)Warped-MC: An Efficient Memory Controller Scheme for Massively Parallel ProcessorsProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605645(546-555)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3605573.3605645
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents