skip to main content
10.1145/3409964.3461802acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article
Public Access

Low-Span Parallel Algorithms for the Binary-Forking Model

Published: 06 July 2021 Publication History

Abstract

The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Θ(łog n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging.
In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT).
Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(łog n) span and O(nłog n) work, both w.h.p. in n. (2) An optimal O(łog n) span algorithm for Strassen's matrix multiplication (MM) with only a łogłog n -factor blow-up in work as well as a near-optimal O(łog n łogłog łog n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(łog n łogłogłog n) span Fast Fourier Transform (FFT) algorithm with less than a łog n-factor blow-up in work for all practical values of n (i.e., n łe 10 ^10,000 ).

References

[1]
Umut A Acar, Guy E Blelloch, and Robert D Blumofe. 2000. The data locality of work stealing. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. 1--12.
[2]
Kunal Agrawal, Jeremy T Fineman, Kefu Lu, Brendan Sheridan, Jim Sukha, and Robert Utterback. 2014. Provably good scheduling for parallel programs that use data structures through implicit batching. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 84--95.
[3]
Naama Ben-David, Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, Yan Gu, Charles McGuffey, and Julian Shun. 2016. Parallel algorithms for asymmetric readwrite costs. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 145--156.
[4]
Guy E Blelloch, Rezaul Chowdhury, Phillip B Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 501--510.
[5]
Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 355--366.
[6]
Guy E Blelloch, Jeremy T Fineman, Yan Gu, and Yihan Sun. 2020. Optimal parallel algorithms in the binary-forking model. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 89--102.
[7]
Guy E Blelloch and Phillip B Gibbons. 2004. Effectively sharing a cache among threads. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 235--244.
[8]
Guy E Blelloch, Phillip B Gibbons, and Harsha Vardhan Simhadri. 2010. Low depth cache-oblivious algorithms. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 189--199.
[9]
Robert D Blumofe and Charles E Leiserson. 1998. Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 1 (1998), 202--229.
[10]
Georg Bruun. 1978. z-transform DFT filters and FFT's. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1 (1978), 56--63.
[11]
Richard Cole. 1988. Parallel merge sort. SIAM J. Comput. 17, 4 (1988), 770--785.
[12]
Richard Cole and Vijaya Ramachandran. 2017. Resource oblivious sorting on multicores. ACM Transactions on Parallel Computing 3, 4 (2017), 1--31.
[13]
James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Math. Comp. 19, 90 (1965), 297--301.
[14]
Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press.
[15]
Rathish Das. 2021. Algorithmic Foundation of Parallel Paging and Scheduling under Memory Constraints. Ph.D. Dissertation. Stony Brook University.
[16]
Rathish Das, Shih-Yu Tsai, Sharmila Duppala, Jayson Lynch, Esther M Arkin, Rezaul Chowdhury, Joseph SB Mitchell, and Steven Skiena. 2019. Data races and the discrete resource-time tradeoff problem with resource reuse over paths. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 359--368.
[17]
Jack Dongarra and Francis Sullivan. 2000. Guest editors' introduction: The top 10 algorithms. IEEE Annals of the History of Computing 2, 01 (2000), 22--23.
[18]
Fork/Join (Oracle Java Documentation). http://docs.oracle.com/javase/tutorial/ essential/concurrency/forkjoin.html.
[19]
W Donald Frazer and Archie C McKellar. 1970. Samplesort: A sampling approach to minimal storage tree sorting. J. ACM 17, 3 (1970), 496--507.
[20]
Matteo Frigo, Charles E Leiserson, and Keith H Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 212--223.
[21]
Phillip B Gibbons, Yossi Matias, and Vijaya Ramachandran. 1996. The queueread queue-write asynchronous PRAM model. In Proceedings of the European Conference on Parallel Processing. Springer, 277--292.
[22]
Irving John Good. 1958. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society: Series B (Methodological) 20, 2 (1958), 361--372.
[23]
Michael T Goodrich, Riko Jacob, and Nodari Sitchinava. 2021. Atomic power in forks: A super-logarithmic lower bound for implementing butterfly networks in the nonatomic binary fork-join model. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 2141--2153.
[24]
J. JaJa. 1997. An Introduction to Parallel Algorithms. Addison Wesley.
[25]
Steven G Johnson and Matteo Frigo. 2006. A modified split-radix FFT with fewer arithmetic operations. IEEE Transactions on Signal Processing 55, 1 (2006), 111--119.
[26]
Roland C Le Bail. 1972. Use of fast Fourier transforms for solving partial differential equations in physics. J. Comput. Phys. 9, 3 (1972), 440--465.
[27]
V. Y. Pan. 1978. Strassen's algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the Symposium on Foundations of Computer Science. 166--176.
[28]
Vijaya Ramachandran and Elaine Shi. 2021. Data oblivious algorithms for multicores. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (To Appear).
[29]
Daniel N Rockmore. 2000. The FFT: An algorithm the whole family can use. Computing in Science & Engineering 2, 1 (2000), 60--64.
[30]
Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356.
[31]
Task Parallel Library (TPL). https://msdn.microsoft.com/en-us/library/dd460717.
[32]
Llewellyn H Thomas. 1963. Using a computer to solve problems in physics. Applications of Digital Computers (1963), 44--45.
[33]
Threading Building Blocks (TBB). https://www.threadingbuildingblocks.org.
[34]
Shmuel Winograd. 1978. On computing the discrete Fourier transform. Math. Comp. 32, 141 (1978), 175--199.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '21: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures
July 2021
463 pages
ISBN:9781450380706
DOI:10.1145/3409964
  • General Chair:
  • Kunal Agrawal,
  • Program Chair:
  • Yossi Azar
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 July 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. binary-forking model
  2. fast fourier transform (fft)
  3. parallel computation
  4. sorting
  5. strassen's matrix multiplication

Qualifiers

  • Research-article

Funding Sources

  • NSF
  • XSEDE

Conference

SPAA '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 360
    Total Downloads
  • Downloads (Last 12 months)128
  • Downloads (Last 6 weeks)16
Reflects downloads up to 23 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media