research-article

Public Access

Low-Span Parallel Algorithms for the Binary-Forking Model

Authors:

Rezaul Chowdhury,

Pramod Ganapathi,

Mohammad Mahdi JavanmardAuthors Info & Claims

SPAA '21: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures

Pages 22 - 34

https://doi.org/10.1145/3409964.3461802

Published: 06 July 2021 Publication History

Abstract

The binary-forking model is a parallel computation model, formally defined by Blelloch et al., in which a thread can fork a concurrent child thread, recursively and asynchronously. The model incurs a cost of Θ(łog n) to spawn or synchronize n tasks or threads. The binary-forking model realistically captures the performance of parallel algorithms implemented using modern multithreaded programming languages on multicore shared-memory machines. In contrast, the widely studied theoretical PRAM model does not consider the cost of spawning and synchronizing threads, and as a result, algorithms achieving optimal performance bounds in the PRAM model may not be optimal in the binary-forking model. Often, algorithms need to be redesigned to achieve optimal performance bounds in the binary-forking model and the non-constant synchronization cost makes the task challenging.

In this paper, we show that in the binary-forking model we can achieve optimal or near-optimal span with negligible or no asymptotic blowup in work for comparison-based sorting, Strassen's matrix multiplication (MM), and the Fast Fourier Transform (FFT).

Our major results are as follows: (1) A randomized comparison-based sorting algorithm with optimal O(łog n) span and O(nłog n) work, both w.h.p. in n. (2) An optimal O(łog n) span algorithm for Strassen's matrix multiplication (MM) with only a łogłog n -factor blow-up in work as well as a near-optimal O(łog n łogłog łog n) span algorithm with no asymptotic blow-up in work. (3) A near-optimal O(łog n łogłogłog n) span Fast Fourier Transform (FFT) algorithm with less than a łog n-factor blow-up in work for all practical values of n (i.e., n łe 10 ^10,000 ).

References

[1]

Umut A Acar, Guy E Blelloch, and Robert D Blumofe. 2000. The data locality of work stealing. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures. 1--12.

Digital Library

[2]

Kunal Agrawal, Jeremy T Fineman, Kefu Lu, Brendan Sheridan, Jim Sukha, and Robert Utterback. 2014. Provably good scheduling for parallel programs that use data structures through implicit batching. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 84--95.

Digital Library

[3]

Naama Ben-David, Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, Yan Gu, Charles McGuffey, and Julian Shun. 2016. Parallel algorithms for asymmetric readwrite costs. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 145--156.

[4]

Guy E Blelloch, Rezaul Chowdhury, Phillip B Gibbons, Vijaya Ramachandran, Shimin Chen, and Michael Kozuch. 2008. Provably good multicore cache performance for divide-and-conquer algorithms. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 501--510.

[5]

Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, and Harsha Vardhan Simhadri. 2011. Scheduling irregular parallel computations on hierarchical caches. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 355--366.

Digital Library

[6]

Guy E Blelloch, Jeremy T Fineman, Yan Gu, and Yihan Sun. 2020. Optimal parallel algorithms in the binary-forking model. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 89--102.

Digital Library

[7]

Guy E Blelloch and Phillip B Gibbons. 2004. Effectively sharing a cache among threads. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 235--244.

Digital Library

[8]

Guy E Blelloch, Phillip B Gibbons, and Harsha Vardhan Simhadri. 2010. Low depth cache-oblivious algorithms. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 189--199.

Digital Library

[9]

Robert D Blumofe and Charles E Leiserson. 1998. Space-efficient scheduling of multithreaded computations. SIAM J. Comput. 27, 1 (1998), 202--229.

Digital Library

[10]

Georg Bruun. 1978. z-transform DFT filters and FFT's. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1 (1978), 56--63.

[11]

Richard Cole. 1988. Parallel merge sort. SIAM J. Comput. 17, 4 (1988), 770--785.

Digital Library

[12]

Richard Cole and Vijaya Ramachandran. 2017. Resource oblivious sorting on multicores. ACM Transactions on Parallel Computing 3, 4 (2017), 1--31.

Digital Library

[13]

James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Math. Comp. 19, 90 (1965), 297--301.

[14]

Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2009. Introduction to Algorithms. MIT Press.

[15]

Rathish Das. 2021. Algorithmic Foundation of Parallel Paging and Scheduling under Memory Constraints. Ph.D. Dissertation. Stony Brook University.

[16]

Rathish Das, Shih-Yu Tsai, Sharmila Duppala, Jayson Lynch, Esther M Arkin, Rezaul Chowdhury, Joseph SB Mitchell, and Steven Skiena. 2019. Data races and the discrete resource-time tradeoff problem with resource reuse over paths. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures. 359--368.

Digital Library

[17]

Jack Dongarra and Francis Sullivan. 2000. Guest editors' introduction: The top 10 algorithms. IEEE Annals of the History of Computing 2, 01 (2000), 22--23.

[18]

Fork/Join (Oracle Java Documentation). http://docs.oracle.com/javase/tutorial/ essential/concurrency/forkjoin.html.

[19]

W Donald Frazer and Archie C McKellar. 1970. Samplesort: A sampling approach to minimal storage tree sorting. J. ACM 17, 3 (1970), 496--507.

Digital Library

[20]

Matteo Frigo, Charles E Leiserson, and Keith H Randall. 1998. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 212--223.

Digital Library

[21]

Phillip B Gibbons, Yossi Matias, and Vijaya Ramachandran. 1996. The queueread queue-write asynchronous PRAM model. In Proceedings of the European Conference on Parallel Processing. Springer, 277--292.

[22]

Irving John Good. 1958. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society: Series B (Methodological) 20, 2 (1958), 361--372.

[23]

Michael T Goodrich, Riko Jacob, and Nodari Sitchinava. 2021. Atomic power in forks: A super-logarithmic lower bound for implementing butterfly networks in the nonatomic binary fork-join model. In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. 2141--2153.

[24]

J. JaJa. 1997. An Introduction to Parallel Algorithms. Addison Wesley.

Digital Library

[25]

Steven G Johnson and Matteo Frigo. 2006. A modified split-radix FFT with fewer arithmetic operations. IEEE Transactions on Signal Processing 55, 1 (2006), 111--119.

Digital Library

[26]

Roland C Le Bail. 1972. Use of fast Fourier transforms for solving partial differential equations in physics. J. Comput. Phys. 9, 3 (1972), 440--465.

[27]

V. Y. Pan. 1978. Strassen's algorithm is not optimal trilinear technique of aggregating, uniting and canceling for constructing fast algorithms for matrix operations. In Proceedings of the Symposium on Foundations of Computer Science. 166--176.

Digital Library

[28]

Vijaya Ramachandran and Elaine Shi. 2021. Data oblivious algorithms for multicores. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (To Appear).

Digital Library

[29]

Daniel N Rockmore. 2000. The FFT: An algorithm the whole family can use. Computing in Science & Engineering 2, 1 (2000), 60--64.

Digital Library

[30]

Volker Strassen. 1969. Gaussian elimination is not optimal. Numer. Math. 13, 4 (1969), 354--356.

Digital Library

[31]

Task Parallel Library (TPL). https://msdn.microsoft.com/en-us/library/dd460717.

[32]

Llewellyn H Thomas. 1963. Using a computer to solve problems in physics. Applications of Digital Computers (1963), 44--45.

[33]

Threading Building Blocks (TBB). https://www.threadingbuildingblocks.org.

[34]

Shmuel Winograd. 1978. On computing the discrete Fourier transform. Math. Comp. 32, 141 (1978), 175--199.

Cited By

Index Terms

Low-Span Parallel Algorithms for the Binary-Forking Model
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Shared memory algorithms
2. Theory of computation
  1. Design and analysis of algorithms
    1. Algorithm design techniques
      1. Divide and conquer
    2. Parallel algorithms
      1. Shared memory algorithms

Recommendations

Work-Time Optimal k-Merge Algorithms on the PRAM

For 2 k n, the k-merge problem is to merge a collection of k sorted sequences of total length n into a new sorted sequence. The k-merge problem is fundamental as it provides a common generalization of both merging and sorting. The main contribution of ...
Parallel algorithms with ultra-fast expected times
Multi-point sampling in parallel algorithms for combinatorial search

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '21: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures

July 2021

463 pages

ISBN:9781450380706

DOI:10.1145/3409964

General Chair:
Kunal Agrawal
Washington University in St. Louis, USA
,
Program Chair:
Yossi Azar
Tel Aviv University, Israel

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGARCH: ACM Special Interest Group on Computer Architecture
EATCS: European Association for Theoretical Computer Science

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 July 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF
XSEDE

Conference

SPAA '21

Sponsor:

SPAA '21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures

July 6 - 8, 2021

Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
360
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)16

Reflects downloads up to 23 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents