skip to main content
10.1145/1815961.1815992acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Dynamic warp subdivision for integrated branch and memory divergence tolerance

Published: 19 June 2010 Publication History

Abstract

SIMD organizations amortize the area and power of fetch, decode, and issue logic across multiple processing units in order to maximize throughput for a given area and power budget. However, throughput is reduced when a set of threads operating in lockstep (a warp) are stalled due to long latency memory accesses. The resulting idle cycles are extremely costly. Multi-threading can hide latencies by interleaving the execution of multiple warps, but deep multi-threading using many warps dramatically increases the cost of the register files (multi-threading depth x SIMD width), and cache contention can make performance worse. Instead, intra-warp latency hiding should first be exploited. This allows threads that are ready but stalled by SIMD restrictions to use these idle cycles and reduces the need for multi-threading among warps. This paper introduces dynamic warp subdivision (DWS), which allows a single warp to occupy more than one slot in the scheduler without requiring extra register file space. Independent scheduling entities allow divergent branch paths to interleave their execution, and allow threads that hit to run ahead. The result is improved latency hiding and memory level parallelism (MLP). We evaluate the technique on a coherent cache hierarchy with private L1 caches and a shared L2 cache. With an area overhead of less than 1%, experiments with eight data-parallel benchmarks show our technique improves performance on average by 1.7X.

References

[1]
NVIDIA's next generation CUDA compute architecture: Fermi. NVIDIA Corporation, 2009.
[2]
ATI. Radeon 9700 Pro. http://mirror.ati.com/products/pc/radeon9700pro, 2002.
[3]
N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The M5 simulator: Modeling networked systems. IEEE Micro, 26(4), 2006.
[4]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, and K. Skadron. A performance study of general purpose applications on graphics processors using CUDA. JPDC, 2008.
[5]
S. Choi and D. Yeung. Learning-based SMT processor resource distribution via hill-climbing. In ISCA, 2006.
[6]
J. D. Collins, D. M. Tullsen, H. Wang, and J. P. Shen. Dynamic speculative precomputation. In MICRO 34, 2001.
[7]
Intel Corporation. Intel AVX: New frontiers in performance improvements and energy efficiency, 2009.
[8]
NVIDIA Corporation. GeForce GTX 280 specifications. 2008.
[9]
R. Espasa, F. Ardanaz, J. Emer, S. Felix, J. Gago, R. Gramunt, I. Hern, T. Juan, G. Lowney, M Mattina, and A. Seznec. Tarantula: A vector extension to the Alpha architecture. In ISCA, 2002.
[10]
W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt. Dynamic warp formation and scheduling for efficient GPU control flow. In MICRO, 2007.
[11]
M. Gschwind. Chip multiprocessing and the Cell Broadband Engine. In CF, 2006.
[12]
U. J. Kapasi, J. Dally, W, S. Rixner, P. R. Mattson, J. D. Owens, and B. Khailany. Efficient conditional operations for data-parallel architectures. In MICRO 33, 2000.
[13]
C. Kozyrakis. A media-enhanced vector architecture for embedded memory systems. Technical report, University of California, Berkeley, 1999.
[14]
R. Krashinsky, C. Batten, M. Hampton, S. Gerding, B. Pharris, J. Casper, and K. Asanovic. The Vector-Thread architecture. In ISCA, 2004.
[15]
R. A. Lorie and H. R. Strong. Method for conditional branch execution in SIMD vector processors. US Patent 4,435,758, 1984.
[16]
J. Meng, J. W. Sheaffer, and K. Skadron. Exploiting inter-thread temporal locality for chip multithreading. In IPDPS, 2010.
[17]
J. Meng and K. Skadron. Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. In ICCD, 2007.
[18]
J. Meng, D. Tarjan, and K. Skadron. Dynamic warp subdivision for integrated branch and memory divergence tolerance: Extended results. U.Va. Tech. Report CS-2010-5, 2010.
[19]
R. Narayanan, B. Ozisikyilmaz, J. Zambreno, G. Memik, and A. Choudhary. Minebench: A benchmark suite for data mining workloads. IISWC, 2006.
[20]
Y. Nishikawa, M. Koibuchi, M. Yoshimi, K. Miura, and H. Amano. Performance improvement methodology for ClearSpeed's CSX600. In ICPP, 2007.
[21]
S. E. Orcutt. Implementation of permutation functions in illiac iv-type computers. IEEE Trans. Comput., 25(9), 1976.
[22]
M. K. Qureshi, D. N. Lynch, O. Mutlu, and Y. N. Patt. A case for MLP-aware cache replacement. In ISCA, 2006.
[23]
T. Ramirez, A. Pajuelo, O.J. Santana, and M. Valero. Runahead threads to improve SMT performance. HPCA, 2008.
[24]
S. Rivoire, R. Schultz, T. Okuda, and C. Kozyrakis. Vector lane threading. In ICPP, 2006.
[25]
S. Rixner, W. J. Dally, U. J. Kapasi, B. Khailany, A. López-Lagunas, P. R. Mattson, and J. D. Owens. A bandwidth-efficient architecture for media processing. In MICRO 31, 1998.
[26]
R. M. Russell. The CRAY-1 computer system. Commun. ACM, 21(1), 1978.
[27]
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3), 2008.
[28]
Y. Takahashi. A mechanism for SIMD execution of SPMD programs. In HPC-ASIA, 1997.
[29]
D. Talla and L. K. John. Cost-effective hardware acceleration of multimedia applications. In ICCD, 2001.
[30]
D. Tarjan, J. Meng, and K. Skadron. Increasing memory miss tolerance for SIMD cores. In SC, 2009.
[31]
D. M. Tullsen and J. A. Brown. Handling long-latency loads in a simultaneous multithreading processor. In MICRO 34, 2001.
[32]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. ISCA, 1995.

Cited By

View all

Index Terms

  1. Dynamic warp subdivision for integrated branch and memory divergence tolerance

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
      June 2010
      520 pages
      ISBN:9781450300537
      DOI:10.1145/1815961
      • cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
        ISCA '10
        June 2010
        508 pages
        ISSN:0163-5964
        DOI:10.1145/1816038
        Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      In-Cooperation

      • IEEE CS

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 June 2010

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. branch divergence
      2. cache
      3. latency hiding
      4. memory divergence
      5. simd
      6. warp

      Qualifiers

      • Research-article

      Conference

      ISCA '10
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)82
      • Downloads (Last 6 weeks)16
      Reflects downloads up to 25 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media