skip to main content
10.1145/3447818.3460354acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Open access

A systematic approach to improving data locality across Fourier transforms and linear algebra operations

Published: 04 June 2021 Publication History

Abstract

The performance of most scientific applications depends on efficient mathematical libraries. For example, scientific applications like the plane wave based Density Functional Theory approach for electronic structure calculations uses highly optimized libraries for Fourier transforms, dense linear algebra (orthogonalization) and sparse linear algebra (non-local projectors in real space). Although vendor-tuned libraries offer efficient implementations for each standalone mathematical kernel, the partitioning of those calls into sequentially invoked kernels inhibits cross-kernel optimizations that could improve data locality across memory bound operations. In this work we show that, by expressing these kernels as an operation on high dimensional tensors, cross-kernel dataflow optimizations that span FFT, dense and sparse linear algebra, can be readily exposed and exploited. We outline a systematic way of merging the Fourier transforms with the linear algebra computations, improving data locality and reducing data movement to main memory. We show that compared to conventional implementations, this streaming/dataflow approach offers 2x speedup on GPUs and 8x/12x speedup on CPUs compared to a baseline code that uses vendor-optimized libraries. Although we use Density Functional Theory to demonstrate the value of our approach, our methodology is broadly applicable to other applications that use Fourier transforms and linear algebra operations as building blocks.

References

[1]
2009. Intel Math Kernel Library. Reference Manual. Intel Corporation.
[2]
AMD. 2018. AMD Optimized FFTW. https://github.com/amd/amd-fftw.
[3]
Brian Austin. 2018. NERSC NERSC-10 Workload Analysis. https://portal.nersc.gov/project/m888/nersc10/workload/N10_Workload_Analysis.latest.pdf.
[4]
E. Binder, T. M. Low, and D. T. Popovici. 2019. A Portable GPU Framework for SNP Comparisons. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 199--208.
[5]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (Tucson, AZ, USA) (PLDI '08). Association for Computing Machinery, New York, NY, USA, 101--113.
[6]
Aydin Buluç, Jeremy T. Fineman, Matteo Frigo, John R. Gilbert, and Charles E. Leiserson. 2009. Parallel Sparse Matrix-Vector and Matrix-Transpose-Vector Multiplication Using Compressed Sparse Blocks. In Proceedings of the Twenty-First Annual Symposium on Parallelism in Algorithms and Architectures (Calgary, AB, Canada) (SPAA '09). Association for Computing Machinery, New York, NY, USA, 233--244.
[7]
Andrew Canning. 2008. Scalable Parallel 3d FFTs for Electronic Structure Codes. In High Performance Computing for Computational Science - VECPAR 2008, José M. Laginha M. Palma, Patrick R. Amestoy, Michel Daydé, Marta Mattoso, and João Correia Lopes (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 280--286.
[8]
R. Car and M. Parrinello. 1985. Unified Approach for Molecular Dynamics and Density-Functional Theory. Phys. Rev. Lett. 55 (Nov 1985), 2471--2474. Issue 22.
[9]
James Cooley and John Tukey. 1965a. An Algorithm for the Machine Calculation of Complex Fourier Series. Math. Comp. 19, 90 (1965), 297--301.
[10]
James W Cooley and John W Tukey. 1965b. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297--301.
[11]
Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain Duff. 1990. A Set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16, 1 (March 1990), 1--17.
[12]
B. Driscoll and Z. Zhao. 2020. Automation of NERSC Application Usage Report. In 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools). 10--18.
[13]
Franz Franchetti, Tze Meng Low, Doru-Thom Popovici, Richard Michael Veras, Daniele G. Spampinato, Jeremy R. Johnson, Markus Püschel, James C. Hoe, and José M. F. Moura. 2018a. SPIRAL: Extreme Performance Portability. Proc. IEEE 106, 11 (2018), 1935--1968.
[14]
Franz Franchetti, Daniele G. Spampinato, Anuva Kulkarni, Doru Thom Popovici, Tze Meng Low, Michael Franusich, Andrew Canning, Peter McCorquodale, Brian Van Straalen, and Phillip Colella. 2018b. FFTX and SpectralPack: A First Look. In 2018 IEEE 25th International Conference on High Performance Computing Workshops (HiPCW). 18--27.
[15]
Matteo Frigo and Steven G. Johnson. 1997. The Fastest Fourier Transform in the West. Technical Report MIT-LCS-TR-728. Massachusetts Institute of Technology.
[16]
Kazushige Goto and Robert van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Soft. 34, 3 (May 2008), 12:1--12:25. 0098-3500
[17]
Tobias Grosser, Hongbin Zheng, Raghesh Aloor, Andreas Simbürger, Armin Größlinger, and Louis-Noël Pouchet. 2011. Polly-Polyhedral optimization in LLVM. In Proceedings of the First International Workshop on Polyhedral Compilation Techniques (IMPACT), Vol. 2011. 1.
[18]
Weile Jia, Zongyan Cao, Long Wang, Jiyun Fu, Xuebin Chi, Weiguo Gao, and Lin-Wang Wang. 2013. The analysis of a plane wave pseudopotential density functional theory code on a GPU machine. Computer Physics Communications 184, 1 (2013), 9--18. 0010-4655
[19]
Kyungjoo Kim, Timothy B. Costa, Mehmet Deveci, Andrew M. Bradley, Simon D. Hammond, Murat E. Guney, Sarah Knepper, Shane Story, and Sivasankaran Rajamanickam. 2017. Designing Vector-Friendly Compact BLAS and LAPACK Kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Denver, Colorado) (SC '17). Association for Computing Machinery, New York, NY, USA, Article 55, 12 pages.
[20]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (October 2017), 29 pages. 2475-1421
[21]
Ricardo A Lebensohn, Anand K Kanjarla, and Philip Eisenlohr. 2012. An elasto-viscoplastic formulation based on fast Fourier transforms for the prediction of micromechanical fields in polycrystalline materials. International Journal of Plasticity 32 (2012), 59--69.
[22]
Jiajia Li, Jimeng Sun, and Richard Vuduc. 2018. HiCOO: Hierarchical Storage of Sparse Tensors. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (Dallas, Texas) (SC '18). IEEE Press, Article 19, 15 pages.
[23]
Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Orti. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Softw. 43, 2, Article 12 (August 2016), 18 pages. 0098-3500
[24]
Nvidia. 2018. NVidia Cuda Based FFT Library. https://developer.nvidia.com/cufft.
[25]
NVIDIA, Péter Vingelmann, and Frank H.P. Fitzek. 2020. CUDA, release: 10.2.89. https://developer.nvidia.com/cuda-toolkit
[26]
OpenBLAS 2012. http://xianyi.github.com/OpenBLAS/.
[27]
OpenMP Architecture Review Board. 2020. OpenMP Application Program Interface Version 5.0.
[28]
Doru-Thom Popovici, Franz Franchetti, and Tze Meng Low. 2017. Mixed data layout kernels for vectorized complex arithmetic. In 2017 IEEE High Performance Extreme Computing Conference, HPEC 2017, Waltham, MA, USA, September 12-14, 2017. 1--7.
[29]
Doru-Thom Popovici, Martin D. Schatz, Franz Franchetti, and Tze Meng Low. 2020. A Flexible Framework for Multidimensional DFTs. SIAM J. Sci. Comput. 42, 5 (2020), C245--C264.
[30]
Doru Thom Popovici. 2018. An Approach to Specifying and Automatically Optimizing Fourier Transform Based Operations. Ph.D. Dissertation. Electrical and Computer Engineering, Carnegie Mellon University.
[31]
Doru-Thom Popovici, Tze-Meng Low, and Franz Franchetti. 2018. Large Bandwidth-Efficient FFTs on Multicore and Multi-Socket Systems. In IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE.
[32]
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matthew Feldman, Tian Zhao, Stefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2018. Plasticine: A Reconfigurable Accelerator for Parallel Patterns. IEEE Micro 38, 3 (2018), 20--31.
[33]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, Seattle, WA, USA, June 16-19, 2013, Hans-Juergen Boehm and Cormac Flanagan (Eds.). ACM, 519--530.
[34]
SambaNova Inc. 2020. SambaNova home. http://sambanova.ai.
[35]
Shaden Smith and George Karypis. 2015. Tensor-Matrix Products with a Compressed Sparse Tensor. In Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms (Austin, Texas) (IA3 '15). Association for Computing Machinery, New York, NY, USA, Article 5, 7 pages.
[36]
Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. 2014. Anatomy of High-Performance Many-Threaded Matrix Multiplication. In IPDPS '14: Proceedings of the International Parallel and Distributed Processing Symposium. To appear.
[37]
Field G. Van Zee and Robert A. van de Geijn. 2013. BLIS: A Framework for Rapid Instantiation of BLAS Functionality. ACM Trans. Math. Soft. (2013). In review.
[38]
Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. arXiv preprint arXiv:1412.7580 (2014).
[39]
J-L Vay, A Almgren, J Bell, L Ge, DP Grote, M Hogan, O Kononenko, R Lehe, A Myers, C Ng, et al. 2018. Warp-X: A new exascale computing platform for beam--plasma simulations. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment (2018).
[40]
Long Wang, Yue Wu, Weile Jia, Weiguo Gao, Xuebin Chi, and Lin-Wang Wang. 2011. Large scale plane wave pseudopotential density functional theory calculations on GPU clusters. In SC'11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--10.
[41]
Lin-Wang Wang, Byounghak Lee, Hongzhang Shan, Zhengji Zhao, Juan Meza, Erich Strohmaier, and David H Bailey. 2008. Linearly scaling 3D fragment method for large-scale electronic structure calculations. In SC'08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, 1--10.
[42]
Jiyuan Zhang, Franz Franchetti, and Tze Meng Low. 2018. High performance zero-memory overhead direct convolutions. In International Conference on Machine Learning. PMLR, 5776--5785.

Cited By

View all
  • (2024)Cache Cohort GPU SchedulingProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649415(19-25)Online publication date: 2-Mar-2024
  • (2023)A gem5 Implementation of the Sequential Codelet Model: Reducing Overhead and Expanding the Software Memory InterfaceProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624152(839-846)Online publication date: 12-Nov-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '21: Proceedings of the 35th ACM International Conference on Supercomputing
June 2021
506 pages
ISBN:9781450383356
DOI:10.1145/3447818
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2021

Check for updates

Author Tags

  1. Fourier transforms
  2. high dimensional tensors
  3. linear algebra operations
  4. loop optimizations

Qualifiers

  • Research-article

Funding Sources

  • U.S. Department of Energy
  • Oak Ridge National Laboratory
  • Lawrence Berkeley National Laboratory

Conference

ICS '21
Sponsor:

Acceptance Rates

ICS '21 Paper Acceptance Rate 39 of 157 submissions, 25%;
Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)151
  • Downloads (Last 6 weeks)18
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cache Cohort GPU SchedulingProceedings of the 16th Workshop on General Purpose Processing Using GPU10.1145/3649411.3649415(19-25)Online publication date: 2-Mar-2024
  • (2023)A gem5 Implementation of the Sequential Codelet Model: Reducing Overhead and Expanding the Software Memory InterfaceProceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3624152(839-846)Online publication date: 12-Nov-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media