skip to main content
research-article
Public Access

Extracting SIMD Parallelism from Recursive Task-Parallel Programs

Published: 26 December 2019 Publication History

Abstract

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to execute data-parallel computations in a vectorized manner efficiently. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This article presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel’s SSE4.2 vector units, as well as accelerators using Intel’s AVX512 units. We then show through rigorous sampling that, in practice, our vectorization techniques are effective for a much larger class of programs.

References

[1]
Timo Aila and Samuli Laine. 2009. Understanding the efficiency of ray traversal on GPUs. In HPG’09. 145--149.
[2]
Barcelona OpenMP Task Suite (BOTS) 2012. Barcelona OpenMP Task Suite (BOTS). https://pm.bsc.es/projects/bots.
[3]
Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, Cesar Kunz, and Mark Marron. 2013. From relational verification to SIMD loop synthesis. In PPoPP’13. 123--134.
[4]
Lars Bergstrom, Matthew Fluet, Mike Rainey, John Reppy, Stephen Rosen, and Adam Shaw. 2013. Data-only flattening for nested data parallelism. ACM SIGPLAN Notices, 48. ACM, 81--92.
[5]
Guy E. Blelloch and Phillip B. Gibbons. 2004. Effectively sharing a cache among threads. In SPAA’04: Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, New York, 235--244.
[6]
Guy E. Blelloch and Gary W. Sabot. 1990. Compiling collection-oriented languages onto massively parallel computers. Journal of Parallel and Distributed Computing 8, 2 (1990), 119--134.
[7]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. 1995. Cilk: An efficient multithreaded runtime system. In PPOPP’95. 207--216.
[8]
Tiago Carneiro Pessoa, Jan Gmys, Francisco Heron de Carvalho Júnior, Nouredine Melab, and Daniel Tuyttens. 2018. GPU-accelerated backtracking using CUDA dynamic parallelism. Concurrency and Computation: Practice and Experience 30, 9 (2018), e4374.
[9]
Daniel Cederman and Philippas Tsigas. 2008. On dynamic load balancing on graphics processors. In Proceedings of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware. Eurographics Association, 57--64.
[10]
Manuel M. T. Chakravarty, Gabriele Keller, Roman Lechtchinsky, and Wolf Pfannenstiel. 2001. Nepal--nested data parallelism in Haskell. In European Conference on Parallel Processing. Springer, 524--534.
[11]
Jatin Chhugani, Changkyu Kim, Hemant Shukla, Jongsoo Park, Pradeep Dubey, John Shalf, and Horst D. Simon. 2012. Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems. In SC’12. Article 1, 11 pages.
[12]
Cilk 2010. Cilk. http://supertech.csail.mit.edu/cilk/.
[13]
Holger Dammertz, Johannes Hanika, and Alexander Keller. 2008. Shallow bounding volume hierarchies for fast SIMD ray tracing of incoherent rays. In EGSR’08. 1225--1233.
[14]
John S. Danaher, I.-Ting Angelina Lee, and Charles E. Leiserson. 2006. Programming with exceptions in JCilk. Sci. Comput. Program. 63, 2 (Dec. 2006), 147--171.
[15]
J. O. Eklundh. 1972. A fast computer method for matrix transposing. IEEE Trans. Comput. 21, 7 (July 1972), 801--803.
[16]
Izzat El Hajj, Juan Gómez-Luna, Cheng Li, Li-Wen Chang, Dejan Milojicic, and Wen-mei Hwu. 2016. KLAP: Kernel launch aggregation and promotion for optimizing dynamic parallelism. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 1--12.
[17]
Matteo Frigo, Pablo Halpern, Charles E. Leiserson, and Stephen Lewin-Berlin. 2009. Reducers and other Cilk++ hyperobjects. In SPAA’09. 79--90.
[18]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. 1998. The implementation of the Cilk-5 multithreaded language. In PLDI’98. 212--223.
[19]
B.R. Gaster and L. Howes. 2012. Can GPGPU programming be liberated from the data-parallel bottleneck? Computer 45, 8 (August 2012), 42--52.
[20]
Yi Guo, R. Barik, R. Raman, and V. Sarkar. 2009. Work-first and help-first scheduling policies for async-finish task parallelism. In IPDPS’09. 1--12.
[21]
Kshitij Gupta, Jeff A. Stuart, and John D. Owens. 2012. A study of persistent threads style GPU programming for GPGPU workloads. In 2012 Innovative Parallel Computing (InPar). IEEE, 1--14.
[22]
Jiri Havel and Adam Herout. 2010. Yet faster ray-triangle intersection (using SSE4). IEEE Transactions on Visualization and Computer Graphics 16, 3 (May 2010), 434--438.
[23]
Lars Hernquist. 1990. Vectorization of tree traversals. J. Comput. Phys. 87, 1 (March 1990), 137--147.
[24]
R. D. Hornung and J. A. Keasler. 2013. A Case for Improved C++ Compiler Support to Enable Performance Portability in Large Physics Simulation Codes. Technical Report. Tech. rep., Lawrence Livermore National Laboratory (LLNL), Livermore, CA.
[25]
Qiming Hou, Xin Sun, Kun Zhou, Christian Lauterbach, and Dinesh Manocha. 2011. Memory-scalable GPU spatial hierarchy construction. IEEE Transactions on Visualization and Computer Graphics 17, 4 (2011), 466--474.
[26]
Paul Hudak and Eric Mohr. 1988. Graphinators and the duality of SIMD and MIMD. In LFP’88. 224--234.
[27]
Xin Huo, Sriram Krishnamoorthy, and Gagan Agrawal. 2013. Efficient scheduling of recursive control flow on GPUs. In ICS’13. 409--420.
[28]
Youngjoon Jo, Michael Goldfarb, and Milind Kulkarni. 2013. Automatic vectorization of tree traversals. In PACT’13. 363--374.
[29]
Youngjoon Jo and Milind Kulkarni. 2011. Enhancing locality for recursive traversals of recursive structures. In OOPSLA’11. 463--482.
[30]
Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D. Nguyen, Tim Kaldewey, Victor W. Lee, Scott A. Brandt, and Pradeep Dubey. 2010. FAST: Fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD’10. 339--350.
[31]
Seonggun Kim and Hwansoo Han. 2012. Efficient SIMD code generation for irregular kernels. In PPoPP’12. 55--64.
[32]
Sriram Krishnamoorthy, Gerald Baumgartner, Daniel Cociorva, Chi-Chung Lam, and P. Sadayappan. 2004. Efficient parallel out-of-core matrix transposition. International Journal of High Performance Computing and Networking 2, 2 (2004), 110--119.
[33]
Vidyadhar Kulkarni. 1990. Generating random combinatorial objects. Journal of Algorithms 11, 2 (1990), 185—207.
[34]
Da Li, Hancheng Wu, and Michela Becchi. 2015. Nested parallelism on GPU: Exploring parallelization templates for irregular loops and recursive computations. In 2015 44th International Conference on Parallel Processing. IEEE, 979--988.
[35]
Yisheng Liao, Alex Rubinsteyn, Russell Power, and Jinyang Li. 2013. Learning random forests on the GPU. New York University, Department of Computer Science (2013).
[36]
Erkki Mäkinen. 1999. Generating random binary trees - A survey. Inf. Sci. 115, 1--4 (April 1999), 123--136.
[37]
Saeed Maleki, Yaoqing Gao, Maria J. Garzarán, Tommy Wong, and David A. Padua. 2011. An evaluation of vectorizing compilers. In PACT’11. 372--382.
[38]
Emanuele Manca, Andrea Manconi, Alessandro Orro, Giuliano Armano, and Luciano Milanesi. 2016. CUDA-quicksort: An improved GPU-based implementation of quicksort. Concurrency and Computation: Practice and Experience 28, 1 (2016), 21--43.
[39]
H. W. Martin and B. J. Orr. 1989. A random binary tree generator. In Proceedings of the 17th Conference on ACM Annual Computer Science Conference (CSC’89). ACM, New York, 33--38.
[40]
Todd Mytkowicz, Madanlal Musuvathi, and Wolfram Schulte. 2014. Data-parallel finite-state machines. In ASPLOS’14. 529--542.
[41]
B. Neelima, Bharath Shamsundar, Anjjan Narayan, Rithesh Prabhu, and Crystal Gomes. 2017. Kepler GPU accelerated recursive sorting using dynamic parallelism. Concurrency and Computation: Practice and Experience 29, 4 (2017), e3865.
[42]
Dorit Nuzman and Ayal Zaks. 2008. Outer-loop vectorization: Revisited for short SIMD architectures. In PACT’08. 2--11.
[43]
NVIDIA. 2015. CUDA. http://www.nvidia.com/object/cuda_home_new.html.
[44]
Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In LCPC’06. 235--250.
[45]
OpenMP Architecture Review Board. 2008. OpenMP Specification and Features. http://openmp.org/wp/.
[46]
Marc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, and David A. Wood. 2014. Fine-grain task aggregation and coordination on GPUs. In ISCA’14. 181--192.
[47]
Anjul Patney and John D. Owens. 2008. Real-time Reyes-style adaptive surface subdivision. ACM Transactions on Graphics (TOG) 27, 5 (2008), 143.
[48]
Markus Puschel, José M. F. Moura, Jeremy R. Johnson, David Padua, Manuela M. Veloso, Bryan W. Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen Voronenko, Kang Chen, Robert W. Johnson, and Nicholas Rizzolo. 2005. SPIRAL: Code generation for DSP transforms. Proc. IEEE 93, 2 (2005), 232--275.
[49]
James Reinders. 2007. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly.
[50]
Bin Ren, Gagan Agrawal, James R. Larus, Todd Mytkowicz, Tomi Poutanen, and Wolfram Schulte. 2013. SIMD parallelization of applications that traverse irregular data structures. In CGO’13. 1--10.
[51]
Bin Ren, Youngjoon Jo, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2015. Efficient execution of recursive programs on commodity vector hardware. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2015). ACM, New York, NY, USA, 509--520.
[52]
Bin Ren, Sriram Krishnamoorthy, Kunal Agrawal, and Milind Kulkarni. 2017. Exploiting vector and multicore parallelism for recursive, data-and task-parallel programs. ACM SIGPLAN Notices, 52. ACM, 117--130.
[53]
Jarmo Siltaneva and Erkki Makinen. 2002. A comparison of random binary tree generators. Comput. J. 45, 6 (2002), 653--660.
[54]
Michael Steffen and Joseph Zambreno. 2010. Improving SIMT efficiency of global rendering algorithms with architectural support for dynamic micro-kernels. In MICRO’43. 237--248.
[55]
Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, and Dieter Schmalstieg. 2014. Whippletree: Task-based scheduling of dynamic workloads on the GPU. ACM Transactions on Graphics (TOG) 33, 6 (2014), 228.
[56]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. IEEE Des. Test 12, 3 (May 2010), 66--73.
[57]
TPL 2007. The Task Parallel Library. http://msdn.microsoft.com/en-us/magazine/cc163340.aspx.
[58]
Stanley Tzeng, Anjul Patney, and John D. Owens. 2010. Task management for irregular-parallel workloads on the GPU. In HPG’10. 29--37.
[59]
Nicolas Weber, Florian Schmidt, Mathias Niepert, and Felipe Huici. 2018. BrainSlug: Transparent acceleration of deep learning through depth-first parallelism. arXiv preprint arXiv:1804.08378 (2018).
[60]
Thomas Weber, Michael Wimmer, and John D. Owens. 2015. Parallel Reyes-style adaptive subdivision with bounded memory usage. In Proceedings of the 19th Symposium on Interactive 3D Graphics and Games. ACM, 39--45.
[61]
Zhimin Wu, Yang Liu, Jun Sun, Jianqi Shi, and Shengchao Qin. 2015. GPU accelerated on-the-fly reachability checking. In 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 100--109.
[62]
X10 2006. The X10 Programming Language. www.research.ibm.com/x10/.
[63]
Feng Zhang, Peng Di, Hao Zhou, Xiangke Liao, and Jingling Xue. 2016. RegTT: Accelerating tree traversals on GPUs by exploiting regularities. In 2016 45th International Conference on Parallel Processing (ICPP). IEEE, 562--571.
[64]
Jing Zhang, Ashwin M. Aji, Michael L. Chu, Hao Wang, and Wu-chun Feng. 2018. Taming irregular applications via advanced dynamic parallelism on GPUs. In Proceedings of the 15th ACM International Conference on Computing Frontiers. ACM, 146--154.
[65]
Tao Zhang, Wei Shu, and Min-You Wu. 2014. CUIRRE: An open-source library for load balancing and characterizing irregular applications on GPUs. Journal of Parallel and Distributed Computing 74, 10 (2014), 2951--2966.
[66]
Kun Zhou, Qiming Hou, Rui Wang, and Baining Guo. 2008. Real-time KD-tree construction on graphics hardware. ACM Transactions on Graphics (TOG) 27. ACM, 126.

Cited By

View all

Index Terms

  1. Extracting SIMD Parallelism from Recursive Task-Parallel Programs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Parallel Computing
    ACM Transactions on Parallel Computing  Volume 6, Issue 4
    December 2019
    188 pages
    ISSN:2329-4949
    EISSN:2329-4957
    DOI:10.1145/3372747
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 December 2019
    Accepted: 01 September 2019
    Revised: 01 September 2019
    Received: 01 July 2015
    Published in TOPC Volume 6, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Recursive programs
    2. task parallelism
    3. vectorization

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • NSF
    • Battelle for DOE
    • U.S. Department of Energy's (DOE) Office of Science, Office of Advanced Scientific Computing Research, under DOE Early Career

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)135
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media