Low-Span Parallel Algorithms for the Binary-Forking Model
Pages 22 - 34
Abstract
The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Θ(łog n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging.
In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT).
Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(łog n) span and O(nłog n) work, both w.h.p. in n. (2) An optimal O(łog n) span algorithm for Strassen's matrix multiplication (MM) with only a łogłog n -factor blow-up in work as well as a near-optimal O(łog n łogłog łog n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(łog n łogłogłog n) span Fast Fourier Transform (FFT) algorithm with less than a łog n-factor blow-up in work for all practical values of n (i.e., n łe 10 ^10,000 ).
References
[1]
Umut A Acar, Guy E Blelloch, and Robert D Blumofe. 2000. The data locality of work stealing. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. 1--12.
[2]
Kunal Agrawal, Jeremy T Fineman, Kefu Lu, Brendan Sheridan, Jim Sukha, and Robert Utterback. 2014. Provably good scheduling for parallel programs that use data structures through implicit batching. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 84--95.
[3]
Naama Ben-David, Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, Yan Gu, Charles McGuffey, and Julian Shun. 2016. Parallel algorithms for asymmetric readwrite costs. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 145--156.
[4]
Guy E Blelloch, Rezaul Chowdhury, Phillip B Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 501--510.
[5]
Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 355--366.
[6]
Guy E Blelloch, Jeremy T Fineman, Yan Gu, and Yihan Sun. 2020. Optimal parallel algorithms in the binary-forking model. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 89--102.
[7]
Guy E Blelloch and Phillip B Gibbons. 2004. Effectively sharing a cache among threads. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 235--244.
[8]
Guy E Blelloch, Phillip B Gibbons, and Harsha Vardhan Simhadri. 2010. Low depth cache-oblivious algorithms. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 189--199.
[9]
Robert D Blumofe and Charles E Leiserson. 1998. Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 1 (1998), 202--229.
[10]
Georg Bruun. 1978. z-transform DFT filters and FFT's. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1 (1978), 56--63.
[11]
Richard Cole. 1988. Parallel merge sort. SIAM J. Comput. 17, 4 (1988), 770--785.
[12]
Richard Cole and Vijaya Ramachandran. 2017. Resource oblivious sorting on multicores. ACM Transactions on Parallel Computing 3, 4 (2017), 1--31.
[13]
James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Math. Comp. 19, 90 (1965), 297--301.
[14]
Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press.
[15]
Rathish Das. 2021. Algorithmic Foundation of Parallel Paging and Scheduling under Memory Constraints. Ph.D. Dissertation. Stony Brook University.
[16]
Rathish Das, Shih-Yu Tsai, Sharmila Duppala, Jayson Lynch, Esther M Arkin, Rezaul Chowdhury, Joseph SB Mitchell, and Steven Skiena. 2019. Data races and the discrete resource-time tradeoff problem with resource reuse over paths. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 359--368.
[17]
Jack Dongarra and Francis Sullivan. 2000. Guest editors' introduction: The top 10 algorithms. IEEE Annals of the History of Computing 2, 01 (2000), 22--23.
[18]
Fork/Join (Oracle Java Documentation). http://docs.oracle.com/javase/tutorial/ essential/concurrency/forkjoin.html.
[19]
W Donald Frazer and Archie C McKellar. 1970. Samplesort: A sampling approach to minimal storage tree sorting. J. ACM 17, 3 (1970), 496--507.
[20]
Matteo Frigo, Charles E Leiserson, and Keith H Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 212--223.
[21]
Phillip B Gibbons, Yossi Matias, and Vijaya Ramachandran. 1996. The queueread queue-write asynchronous PRAM model. In Proceedings of the European Conference on Parallel Processing. Springer, 277--292.
[22]
Irving John Good. 1958. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society: Series B (Methodological) 20, 2 (1958), 361--372.
[23]
Michael T Goodrich, Riko Jacob, and Nodari Sitchinava. 2021. Atomic power in forks: A super-logarithmic lower bound for implementing butterfly networks in the nonatomic binary fork-join model. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 2141--2153.
[24]
J. JaJa. 1997. An Introduction to Parallel Algorithms. Addison Wesley.
[25]
Steven G Johnson and Matteo Frigo. 2006. A modified split-radix FFT with fewer arithmetic operations. IEEE Transactions on Signal Processing 55, 1 (2006), 111--119.
[26]
Roland C Le Bail. 1972. Use of fast Fourier transforms for solving partial differential equations in physics. J. Comput. Phys. 9, 3 (1972), 440--465.
[27]
V. Y. Pan. 1978. Strassen's algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the Symposium on Foundations of Computer Science. 166--176.
[28]
Vijaya Ramachandran and Elaine Shi. 2021. Data oblivious algorithms for multicores. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (To Appear).
[29]
Daniel N Rockmore. 2000. The FFT: An algorithm the whole family can use. Computing in Science & Engineering 2, 1 (2000), 60--64.
[30]
Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356.
[31]
Task Parallel Library (TPL). https://msdn.microsoft.com/en-us/library/dd460717.
[32]
Llewellyn H Thomas. 1963. Using a computer to solve problems in physics. Applications of Digital Computers (1963), 44--45.
[33]
Threading Building Blocks (TBB). https://www.threadingbuildingblocks.org.
[34]
Shmuel Winograd. 1978. On computing the discrete Fourier transform. Math. Comp. 32, 141 (1978), 175--199.
Index Terms
- Low-Span Parallel Algorithms for the Binary-Forking Model
Recommendations
Work-Time Optimal k-Merge Algorithms on the PRAM
For 2 k n, the k-merge problem is to merge a collection of k sorted sequences of total length n into a new sorted sequence. The k-merge problem is fundamental as it provides a common generalization of both merging and sorting. The main contribution of ...
Comments
Information & Contributors
Information
Published In
July 2021
463 pages
Copyright © 2021 ACM.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
- SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
- SIGARCH: ACM Special Interest Group on Computer Architecture
- EATCS: European Association for Theoretical Computer Science
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Published: 06 July 2021
Check for updates
Author Tags
Qualifiers
- Research-article
Funding Sources
- NSF
- XSEDE
Conference
SPAA '21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures
July 6 - 8, 2021
Virtual Event, USA
Acceptance Rates
Overall Acceptance Rate 447 of 1,461 submissions, 31%
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 360Total Downloads
- Downloads (Last 12 months)128
- Downloads (Last 6 weeks)16
Reflects downloads up to 23 Dec 2024
Other Metrics
Citations
Cited By
View allView Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in