research-article

Open access

A systematic approach to improving data locality across Fourier transforms and linear algebra operations

Authors:

Doru Thom Popovici,

Andrew Canning,

John ShalfAuthors Info & Claims

ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing

Pages 329 - 341

https://doi.org/10.1145/3447818.3460354

Published: 04 June 2021 Publication History

Abstract

The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highly optimized libraries for Fourier transforms, dense linear algebra (orthogonalization) and sparse linear algebra (non-local projectors in real space). Although vendor-tuned libraries offer efficient implementations for each standalone mathematical kernel, the partitioning of those calls into sequentially invoked kernels inhibits cross-kernel optimizations that could improve data locality across memory bound operations. In this work we show that, by expressing these kernels as an operation on high dimensional tensors, cross-kernel dataflow optimizations that span FFT, dense and sparse linear algebra, can be readily exposed and exploited. We outline a systematic way of merging the Fourier transforms with the linear algebra computations, improving data locality and reducing data movement to main memory. We show that compared to conventional implementations, this streaming/dataflow approach offers 2x speedup on GPUs and 8x/12x speedup on CPUs compared to a baseline code that uses vendor-optimized libraries. Although we use Density Functional Theory to demonstrate the value of our approach, our methodology is broadly applicable to other applications that use Fourier transforms and linear algebra operations as building blocks.

References

[1]

2009. Intel Math Kernel Library. Reference Manual. Intel Corporation.

[2]

AMD. 2018. AMD Optimized FFTW. https://github.com/amd/amd-fftw.

[3]

Brian Austin. 2018. NERSC NERSC-10 Workload Analysis. https://portal.nersc.gov/project/m888/nersc10/workload/N10_Workload_Analysis.latest.pdf.

[4]

E. Binder, T. M. Low, and D. T. Popovici. 2019. A Portable GPU Framework for SNP Comparisons. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 199--208.

[5]

Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (Tucson, AZ, USA) (PLDI '08). Association for Computing Machinery, New York, NY, USA, 101--113.

Digital Library

[6]

Aydin Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures (Calgary, AB, Canada) (SPAA '09). Association for Computing Machinery, New York, NY, USA, 233--244.

Digital Library

[7]

Andrew Canning. 2008. Scalable Parallel 3d FFTs for Electronic Structure Codes. In High Performance Computing for Computational Science - VECPAR 2008, José M. Laginha M. Palma, Patrick R. Amestoy, Michel Daydé, Marta Mattoso, and João Correia Lopes (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 280--286.

Digital Library

[8]

R. Car and M. Parrinello. 1985. Unified Approach for Molecular Dynamics and Density-Functional Theory. Phys. Rev. Lett. 55 (Nov 1985), 2471--2474. Issue 22.

[9]

James Cooley and John Tukey. 1965a. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comp. 19, 90 (1965), 297--301.

[10]

James W Cooley and John W Tukey. 1965b. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297--301.

[11]

Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990), 1--17.

Digital Library

[12]

B. Driscoll and Z. Zhao. 2020. Automation of NERSC Application Usage Report. In 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools). 10--18.

[13]

Franz Franchetti, Tze Meng Low, Doru-Thom Popovici, Richard Michael Veras, Daniele G. Spampinato, Jeremy R. Johnson, Markus Püschel, James C. Hoe, and José M. F. Moura. 2018a. SPIRAL: Extreme Performance Portability. Proc. IEEE 106, 11 (2018), 1935--1968.

[14]

Franz Franchetti, Daniele G. Spampinato, Anuva Kulkarni, Doru Thom Popovici, Tze Meng Low, Michael Franusich, Andrew Canning, Peter McCorquodale, Brian Van Straalen, and Phillip Colella. 2018b. FFTX and SpectralPack: A First Look. In 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW). 18--27.

[15]

Matteo Frigo and Steven G. Johnson. 1997. The Fastest Fourier Transform in the West. Technical Report MIT-LCS-TR-728. Massachusetts Institute of Technology.

[16]

Kazushige Goto and Robert van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Soft. 34, 3 (May 2008), 12:1--12:25. 0098-3500

Digital Library

[17]

Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-Polyhedral optimization in LLVM. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), Vol. 2011. 1.

[18]

Weile Jia, Zongyan Cao, Long Wang, Jiyun Fu, Xuebin Chi, Weiguo Gao, and Lin-Wang Wang. 2013. The analysis of a plane wave pseudopotential density functional theory code on a GPU machine. Computer Physics Communications 184, 1 (2013), 9--18. 0010-4655

[19]

Kyungjoo Kim, Timothy B. Costa, Mehmet Deveci, Andrew M. Bradley, Simon D. Hammond, Murat E. Guney, Sarah Knepper, Shane Story, and Sivasankaran Rajamanickam. 2017. Designing Vector-Friendly Compact BLAS and LAPACK Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '17). Association for Computing Machinery, New York, NY, USA, Article 55, 12 pages.

Digital Library

[20]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (October 2017), 29 pages. 2475-1421

Digital Library

[21]

Ricardo A Lebensohn, Anand K Kanjarla, and Philip Eisenlohr. 2012. An elasto-viscoplastic formulation based on fast Fourier transforms for the prediction of micromechanical fields in polycrystalline materials. International Journal of Plasticity 32 (2012), 59--69.

[22]

Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC '18). IEEE Press, Article 19, 15 pages.

Digital Library

[23]

Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (August 2016), 18 pages. 0098-3500

Digital Library

[24]

Nvidia. 2018. NVidia Cuda Based FFT Library. https://developer.nvidia.com/cufft.

[25]

NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. 2020. CUDA, release: 10.2.89. https://developer.nvidia.com/cuda-toolkit

[26]

OpenBLAS 2012. http://xianyi.github.com/OpenBLAS/.

[27]

OpenMP Architecture Review Board. 2020. OpenMP Application Program Interface Version 5.0.

[28]

Doru-Thom Popovici, Franz Franchetti, and Tze Meng Low. 2017. Mixed data layout kernels for vectorized complex arithmetic. In 2017 IEEE High Performance Extreme Computing Conference, HPEC 2017, Waltham, MA, USA, September 12-14, 2017. 1--7.

[29]

Doru-Thom Popovici, Martin D. Schatz, Franz Franchetti, and Tze Meng Low. 2020. A Flexible Framework for Multidimensional DFTs. SIAM J. Sci. Comput. 42, 5 (2020), C245--C264.

Digital Library

[30]

Doru Thom Popovici. 2018. An Approach to Specifying and Automatically Optimizing Fourier Transform Based Operations. Ph.D. Dissertation. Electrical and Computer Engineering, Carnegie Mellon University.

[31]

Doru-Thom Popovici, Tze-Meng Low, and Franz Franchetti. 2018. Large Bandwidth-Efficient FFTs on Multicore and Multi-Socket Systems. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.

[32]

Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matthew Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Plasticine: A Reconfigurable Accelerator for Parallel Patterns. IEEE Micro 38, 3 (2018), 20--31.

[33]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, Seattle, WA, USA, June 16-19, 2013, Hans-Juergen Boehm and Cormac Flanagan (Eds.). ACM, 519--530.

Digital Library

[34]

SambaNova Inc. 2020. SambaNova home. http://sambanova.ai.

[35]

Shaden Smith and George Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas) (IA3 '15). Association for Computing Machinery, New York, NY, USA, Article 5, 7 pages.

Digital Library

[36]

Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In IPDPS '14: Proceedings of the International Parallel and Distributed Processing Symposium. To appear.

[37]

Field G. Van Zee and Robert A. van de Geijn. 2013. BLIS: A Framework for Rapid Instantiation of BLAS Functionality. ACM Trans. Math. Soft. (2013). In review.

[38]

Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014).

[39]

J-L Vay, A Almgren, J Bell, L Ge, DP Grote, M Hogan, O Kononenko, R Lehe, A Myers, C Ng, et al. 2018. Warp-X: A new exascale computing platform for beam--plasma simulations. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment (2018).

[40]

Long Wang, Yue Wu, Weile Jia, Weiguo Gao, Xuebin Chi, and Lin-Wang Wang. 2011. Large scale plane wave pseudopotential density functional theory calculations on GPU clusters. In SC'11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--10.

Digital Library

[41]

Lin-Wang Wang, Byounghak Lee, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, and David H Bailey. 2008. Linearly scaling 3D fragment method for large-scale electronic structure calculations. In SC'08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, 1--10.

Digital Library

[42]

Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High performance zero-memory overhead direct convolutions. In International Conference on Machine Learning. PMLR, 5776--5785.

Cited By

Ramakrishnaiah VBeckmann BEhrett PVan Oostrum RLowery K(2024)Cache Cohort GPU SchedulingProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649415(19-25)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3649411.3649415
Fox DMonsalve Diaz JLi X(2023)A gem5 Implementation of the Sequential Codelet Model: Reducing Overhead and Expanding the Software Memory InterfaceProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624152(839-846)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624152

Index Terms

A systematic approach to improving data locality across Fourier transforms and linear algebra operations
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms
      1. Massively parallel algorithms
      2. Shared memory algorithms
2. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

The fast generalized discrete Fourier transforms: a unified approach to the discrete sinusoidal transforms computation
Quadratic Phase Quaternion Domain Fourier Transform
Advances in Computer Graphics
Abstract
Based on the quaternion domain Fourier transform (QDFT) of 2016 and the quadratic-phase Fourier transform of 2018, we introduce the quadratic-phase quaternion domain Fourier transform (QPQDFT) and study some of its properties, like its ...
Semi-Fast Fourier Transforms over GF(2m).

An algorithm which computes the Fourier transform of a sequence of length n over GF(2^m) using approximately 2nm multiplications and n²+ nm additions is developed. The number of multiplications is thus considerably smaller than the n²multiplications ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing

June 2021

506 pages

ISBN:9781450383356

DOI:10.1145/3447818

General Chairs:
Huiyang Zhou
North Carolina State University
,
Jose Moreira
IBM Research
,
Program Chairs:
Frank Mueller
North Carolina State University
,
Yoav Etsion
Technion

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2021

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

U.S. Department of Energy
Oak Ridge National Laboratory
Lawrence Berkeley National Laboratory

Conference

ICS '21

Sponsor:

SIGARCH

ICS '21: 2021 International Conference on Supercomputing

June 14 - 17, 2021

Virtual Event, USA

Acceptance Rates

ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
636
Total Downloads

Downloads (Last 12 months)151
Downloads (Last 6 weeks)18

Reflects downloads up to 01 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ramakrishnaiah VBeckmann BEhrett PVan Oostrum RLowery K(2024)Cache Cohort GPU SchedulingProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649415(19-25)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1145/3649411.3649415
Fox DMonsalve Diaz JLi X(2023)A gem5 Implementation of the Sequential Codelet Model: Reducing Overhead and Expanding the Software Memory InterfaceProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624152(839-846)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3624062.3624152

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents