skip to main content
10.1145/3409964.3461796acmconferencesArticle/Chapter ViewAbstractPublication PagesspaaConference Proceedingsconference-collections
research-article

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs

Published: 06 July 2021 Publication History

Abstract

Determining I/O lower bounds is a crucial step in obtaining communication-efficient parallel algorithms, both across the memory hierarchy and between processors. Current approaches either study specific algorithms individually, disallow programmatic motifs such as recomputation, or produce asymptotic bounds that exclude important constants. We propose a novel approach for obtaining precise I/O lower bounds on a general class of programs, which we call Simple Overlap Access Programs (SOAP). SOAP analysis covers a wide variety of algorithms, from ubiquitous computational kernels to full scientific computing applications. Using the red-blue pebble game and combinatorial methods, we are able to bound the I/O of the SOAP-induced Computational Directed Acyclic Graph (CDAG), taking into account multiple statements, input/output reuse, and optimal tiling. To deal with programs that are outside of our representation (e.g., non-injective access functions), we describe methods to approximate them with SOAP. To demonstrate our method, we analyze 38 different applications, including kernels from the Polybench benchmark suite, deep learning operators, and --- for the first time --- applications in unstructured physics simulations, numerical weather prediction stencil compositions, and full deep neural networks. We derive tight I/O bounds for several linear algebra kernels, such as Cholesky decomposition, improving the existing reported bounds by a factor of two. For stencil applications, we improve the existing bounds by a factor of up to 14. We implement our method as an open-source tool, which can derive lower bounds directly from provided C code.

References

[1]
D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericas, “Trends in Data Locality Abstractions for HPC Systems,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 28, no. 10, Oct. 2017.
[2]
G. Kestor, R. Gioiosa, D. J. Kerbyson, and A. Hoisie, “Quantifying the energy cost of data movement in scientific applications,” in 2013 IEEE international symposium on workload characterization (IISWC). hskip 1em plus 0.5em minus 0.4emrelax IEEE, 2013, pp. 56--65.
[3]
A. Tate, A. Kamil, A. Dubey, A. Größlinger, B. Chamberlain, B. Goglin, C. Edwards, C. J. Newburn, D. Padua, D. Unat et al., “Programming abstractions for data locality.”hskip 1em plus 0.5em minus 0.4emrelax PADAL Workshop 2014.
[4]
D. Unat, A. Dubey, T. Hoefler, J. Shalf, M. Abraham, M. Bianco, B. L. Chamberlain, R. Cledat, H. C. Edwards, H. Finkel, K. Fuerlinger, F. Hannig, E. Jeannot, A. Kamil, J. Keasler, P. H. J. Kelly, V. Leung, H. Ltaief, N. Maruyama, C. J. Newburn, and M. Pericás, “Trends in data locality abstractions for hpc systems,” IEEE Transactions on Parallel and Distributed Systems, pp. 3007--3020, 2017.
[5]
E. Solomonik et al., “Scaling Betweenness Centrality using Communication-Efficient Sparse Matrix Multiplication,” in SC, 2017.
[6]
E. Solomonik, E. Carson, N. Knight, and J. Demmel, “Trade-offs between synchronization, communication, and computation in parallel linear algebra computations,” ACM Transactions on Parallel Computing (TOPC), vol. 3, no. 1, pp. 1--47, 2017.
[7]
M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick, “Communication lower bounds and optimal algorithms for programs that reference arrays--part 1,” arXiv preprint arXiv:1308.0068, 2013.
[8]
J. Hong and H. Kung, “I/O complexity: The red-blue pebble game,” in STOC, 1981, pp. 326--333.
[9]
M. Del Ben et al., “Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution,” Comp. Phys. Comm., pp. 120--129, 2015.
[10]
Q. Zheng and J. D. Lafferty, “Convergence analysis for rectangular matrix completion using burer-monteiro factorization and gradient descent,” CoRR, 2016.
[11]
T. Ben-Nun and T. Hoefler, “Demystifying parallel and distributed deep learning: An in-depth concurrency analysis,” ACM Comput. Surv., vol. 52, no. 4, 2019.
[12]
C. D. Meyer, Matrix analysis and applied linear algebra. hskip 1em plus 0.5em minus 0.4emrelax SIAM, 2000.
[13]
A. Krishnamoorthy and D. Menon, “Matrix inversion using Cholesky decomposition,” in 2013 signal processing: Algorithms, architectures, arrangements, and applications (SPA). hskip 1em plus 0.5em minus 0.4emrelax IEEE, 2013, pp. 70--72.
[14]
E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Demmel, “A massively parallel tensor contraction framework for coupled-cluster computations,” Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3176--3190, 2014.
[15]
V. Elango, F. Rastello, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan, “On characterizing the data access complexity of programs,” in Proceedings of the 42Nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ser. POPL '15.hskip 1em plus 0.5em minus 0.4emrelax New York, NY, USA: ACM, 2015.
[16]
G. Kwasniewski, M. Kabi", M. Besta, J. VandeVondele, R. Solcà, and T. Hoefler, “Red-Blue Pebbling Revisited: Near Optimal Parallel Matrix-Matrix Multiplication,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC19), Nov. 2019.
[17]
E. D. Demaine and Q. C. Liu, “Red-blue pebble game: Complexity of computing the trade-off between cache size and memory transfers,” in Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures, 2018, pp. 195--204.
[18]
V. Elango et al., “Data access complexity: The red/blue pebble game revisited,” Tech. Rep., 2013.
[19]
A. Olivry, J. Langou, L.-N. Pouchet, P. Sadayappan, and F. Rastello, “Automated derivation of parametric data movement lower bounds for affine programs,” in Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, 2020, pp. 808--822.
[20]
X. Zhang, J. Xiao, and G. Tan, “I/O lower bounds for auto-tuning of convolutions in CNNs,” 2020.
[21]
M. Christ, J. Demmel, N. Knight, T. Scanlon, and K. Yelick, “Communication lower bounds and optimal algorithms for programs that reference arrays--part 1,” arXiv preprint arXiv:1308.0068, 2013.
[22]
J. Demmel and A. Rusciano, “Parallelepipeds obtaining HBL lower bounds,” arXiv preprint arXiv:1611.05944, 2016.
[23]
J. Demmel and G. Dinh, “Communication-optimal convolutional neural nets,” arXiv preprint arXiv:1802.06905, 2018.
[24]
G. Dinh and J. Demmel, “Communication-optimal tilings for projective nested loops with arbitrary bounds,” arXiv preprint arXiv:2003.00119, 2020.
[25]
J. E. Savage, “Extending the hong-kung model to memory hierarchies,” in International Computing and Combinatorics Conference. hskip 1em plus 0.5em minus 0.4emrelax Springer, 1995, pp. 270--281.
[26]
Q. Liu, “Red-blue and standard pebble games : Complexity and applications in the sequential and parallel models,” 2018.
[27]
G. Kwasniewski, T. Ben-Nun, A. N. Ziogas, T. Schneider, M. Besta, and T. Hoefler, “On the parallel I/O optimality of linear algebra kernels: Near-optimal LU factorization,” 2020.
[28]
BIBentryALTinterwordspacingL. N. Pouchet, “PolyBench: The Polyhedral Benchmark suite,” 2016. [Online]. Available: https://sourceforge.net/projects/polybenchBIBentrySTDinterwordspacing
[29]
T. Ben-Nun, J. de Fine Licht, A. N. Ziogas, T. Schneider, and T. Hoefler, “Stateful dataflow multigraphs: A data-centric model for performance portability on heterogeneous architectures,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '19, 2019.
[30]
BIBentryALTinterwordspacingJ. Keasler and USDOE, “Livermore unstructured lagrange explicit shock hydrodynamics,” 9 2010. [Online]. Available: https://www.osti.gov//servlets/purl/1231396BIBentrySTDinterwordspacing
[31]
M. Baldauf, A. Seifert, J. Förstner, D. Majewski, and M. Raschendorfer, “Operational convective-scale numerical weather prediction with the COSMO model: Description and sensitivities.” Monthly Weather Review, 139:3387--3905, 2011.
[32]
A. Ivanov, N. Dryden, T. Ben-Nun, S. Li, and T. Hoefler, “Data movement is all you need: A case study on optimizing transformers,” 2020.
[33]
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, 1998, pp. 2278--2324.
[34]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.
[35]
R. Sethi, “Complete register allocation problems,” in STOC, 1973.
[36]
W. J. Paul and R. E. Tarjan, “Time-space trade-offs in a pebble game,” Acta Informatica, vol. 10, no. 2, pp. 111--115, Jun 1978.
[37]
P. W. Dymond and M. Tompa, “Speedups of deterministic machines by synchronous parallel machines,” Journal of Computer and System Sciences, vol. 30, no. 2, pp. 149--161, 1985.
[38]
A. Aggarwal and S. Vitter, Jeffrey, “The input/output complexity of sorting and related problems,” CACM, Sep. 1988.
[39]
P. A. Papp and R. Wattenhofer, “On the hardness of red-blue pebble games,” in Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures, 2020, pp. 419--429.
[40]
D. Irony, S. Toledo, and A. Tiskin, “Communication lower bounds for distributed-memory matrix multiplication,” Journal of Parallel and Distributed Computing, vol. 64, no. 9, pp. 1017 -- 1026, 2004.
[41]
BIBentryALTinterwordspacingE. Solomonik and J. Demmel, “Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms,” in Euro-Par 2011 Parallel Processing, ser. Lecture Notes in Computer Science, E. Jeannot, R. Namyst, and J. Roman, Eds.hskip 1em plus 0.5em minus 0.4emrelax Springer Berlin Heidelberg, 2011, vol. 6853, pp. 90--109. [Online]. Available: http://dx.doi.org/10.1007/978--3--642--23397--5_10BIBentrySTDinterwordspacing
[42]
J. Demmel et al., “Communication-optimal parallel recursive rectangular matrix multiplication,” in IPDPS, 2013, pp. 261--272.
[43]
G. Ballard, J. Demmel, O. Holtz, and O. Schwartz, “Communication-optimal parallel and sequential Cholesky decomposition,” SIAM Journal on Scientific Computing, vol. 32, no. 6, pp. 3495--3523, 2010.
[44]
E. Hutter and E. Solomonik, “Communication-avoiding Cholesky-QR2 for rectangular matrices,” in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS). hskip 1em plus 0.5em minus 0.4emrelax IEEE, 2019, pp. 89--100.
[45]
G. Ballard, E. Carson, J. Demmel, M. Hoemmen, N. Knight, and O. Schwartz, “Communication lower bounds and optimal algorithms for numerical linear algebra,” Acta Numerica, vol. 23, p. 1, 2014.
[46]
X. Chen, Y. Han, and Y. Wang, “Communication lower bound in convolution accelerators,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). hskip 1em plus 0.5em minus 0.4emrelax IEEE, 2020, pp. 529--541.
[47]
BIBentryALTinterwordspacingU. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model. hskip 1em plus 0.5em minus 0.4emrelax Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 132--146. [Online]. Available: https://doi.org/10.1007/978--3--540--78791--4_9BIBentrySTDinterwordspacing
[48]
“Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model,” in International Conference on Compiler Construction (ETAPS CC), Apr. 2008.
[49]
T. Grosser, A. Groesslinger, and C. Lengauer, “Polly-performing polyhedral optimizations on a low-level intermediate representation,” Parallel Processing Letters, vol. 22, no. 04, p. 1250010, 2012.
[50]
T. Hoefler and G. Kwasniewski, “Automatic complexity analysis of explicitly parallel programs,” in Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures, 2014, pp. 226--235.
[51]
A. Darte, “On the complexity of loop fusion,” in PACT, 1999, pp. 149--157.
[52]
V. Elango, F. Rastello, L.-N. Pouchet, J. Ramanujam, and P. Sadayappan, “On characterizing the data movement complexity of computational DAGs for parallel execution,” in Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, 2014, pp. 296--306.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SPAA '21: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures
July 2021
463 pages
ISBN:9781450380706
DOI:10.1145/3409964
  • General Chair:
  • Kunal Agrawal,
  • Program Chair:
  • Yossi Azar
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 July 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. i/o complexity
  2. parallel scheduling model
  3. red-blue pebble game

Qualifiers

  • Research-article

Funding Sources

Conference

SPAA '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)3
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Brief Announcement: Red-Blue Pebbling with Multiple Processors: Time, Communication and Memory Trade-offsProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660269(285-287)Online publication date: 17-Jun-2024
  • (2024)Parallel and Distributed Graph Neural Networks: An In-Depth Concurrency AnalysisIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.330343146:5(2584-2606)Online publication date: May-2024
  • (2023)Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph QueriesACM Computing Surveys10.1145/360493256:2(1-40)Online publication date: 15-Sep-2023
  • (2023)High-Performance and Programmable Attentional Graph Neural Networks with Global Tensor FormulationsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607067(1-16)Online publication date: 12-Nov-2023
  • (2023)Bridging Control-Centric and Data-Centric OptimizationProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580018(173-185)Online publication date: 17-Feb-2023
  • (2022)Deinsum: Practically I/O Optimal Multi-Linear AlgebraSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00030(1-15)Online publication date: Nov-2022
  • (2022)I/O-Optimal Cache-Oblivious Sparse Matrix-Sparse Matrix Multiplication2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00013(36-46)Online publication date: May-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media