skip to main content
10.1145/3577193.3593708acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Open access

Wafer-Scale Fast Fourier Transforms

Published: 21 June 2023 Publication History

Abstract

We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections.
Our wafer-scale FFT (wsFFT) parallelizes a n3 problem with up to n2 PEs. At this point, a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages.
We analyze in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a 5123 complex input array using a 512 × 512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.

References

[1]
Alan Ayala, Stanimire Tomov, Azzam Haidar, and Jack Dongarra. 2020. heFFTe: Highly efficient FFT for exascale. In International Conference on Computational Science. Springer, Amsterdam, 262--275.
[2]
Alan Ayala, Stanimire Tomov, Piotr Luszczek, Sébastien Cayrols, Gerald Ragghianti, and Jack Dongarra. 2020. Interim Report on Benchmarking FFT Libraries on High Performance Systems.
[3]
D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter, L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S. Schreiber, H.D. Simon, V. Venkatakrishnan, and S.K. Weeratunga. 1991. The NAS Parallel Benchmarks. The International Journal of Supercomputing Applications 5, 3 (1991), 63--73.
[4]
Christian Bell, Dan Bonachea, Rajesh Nishtala, and Katherine Yelick. 2006. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, Rhodes, 10--pp.
[5]
James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297--301.
[6]
Lisandro Dalcin. 2019. MPI for Python. https://github.com/mpi4py/mpi4py-fft.
[7]
Lisandro Dalcin, Mikael Mortensen, and David E Keyes. 2019. Fast parallel multidimensional FFT using advanced MPI. J. Parallel and Distrib. Comput. 128 (2019), 137--150.
[8]
Hong Q Ding, Robert D Ferraro, and Donald B Gennery. 1995. A Portable 3D FFT Package for Distributed-Memory Parallel Architectures. In PPSC. SIAM, San Francisco, 70--71.
[9]
Ian T Foster and Patrick H Worley. 1997. Parallel algorithms for the spectral transform method. SIAM Journal on Scientific Computing 18, 3 (1997), 806--837.
[10]
M. Frigo and S.G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216--231.
[11]
Anshul Gupta and Vipin Kumar. 1993. The scalability of FFT on parallel computers. IEEE Transactions on Parallel and Distributed Systems 4, 8 (1993), 922--932.
[12]
Salman Habib, Adrian Pope, Hal Finkel, Nicholas Frontiere, Katrin Heitmann, David Daniel, Patricia Fasel, Vitali Morozov, George Zagaris, Tom Peterka, Venkatram Vishwanath, Zarija Lukić, Saba Sehrish, and Wei keng Liao. 2016. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (jan 2016), 49--65.
[13]
Intel. 2021. Math Kernel Library (MKL). https://software.intel.com/mkl.
[14]
Rajkumar Kettimuthu and Sankara Muthukrishnan. 2005. A performance study of parallel FFT in clos and mesh networks. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA. 27--30.
[15]
LAMMPS. 2021. Molecular Dynamics Simulator. https://www.lammps.org.
[16]
Myoungkyu Lee, Nicholas Malaya, and Robert D Moser. 2013. Petascale direct numerical simulation of turbulent channel flow on up to 786k cores. In SC'13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1--11.
[17]
Sean Lie. 2021. Multi-Million Core, Multi-Wafer AI Cluster. In 2021 IEEE Hot Chips 33 Symposium (HCS). IEEE Computer Society, 1--41.
[18]
Tianjian Lu, Yi-Fan Chen, Blake Hechtman, Tao Wang, and John Anderson. 2021. Large-scale discrete Fourier transform on TPUs. IEEE Access 9 (2021), 93422--93432.
[19]
Kapil K Mathur and S Lennart Johnsson. 1995. All-to-all communication on the connection machine CM-200. Scientific Programming 4, 4 (1995), 251--273.
[20]
Timothy Pickett Morgan. 2018. Peeling the Covers Off the Summit Supercomputer. https://www.nextplatform.com/2018/06/26/peeling-the-covers-off-the-summit-supercomputer/.
[21]
NumPy. 2022. Discrete Fourier Transform. https://numpy.org/doc/stable/reference/routines.fft.html.
[22]
Nvidia. 2021. CuFFT. https://developer.nvidia.com/cufft.
[23]
M. Orenes-Vera, E. Tureci, D. Wentzlaff, and M. Martonosi. 2023. Dalorex: A Data-Local Program Execution and Architecture for Memory-bound Applications. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, Los Alamitos, CA, USA, 718--730.
[24]
ORNL. 2018. Summit Supercomputer. https://www.ornl.gov/news/ornl-launches-summit-supercomputer.
[25]
Dmitry Pekurovsky. 2012. P3DFFT: A framework for parallel computations of Fourier transforms in three dimensions. SIAM Journal on Scientific Computing 34, 4 (2012), C192--C209.
[26]
Michael Pippig. 2013. PFFT: An extension of FFTW to massively parallel architectures. SIAM Journal on Scientific Computing 35, 3 (2013), C213--C236.
[27]
Steve Plimpton. 2018. fftMPI, a distributed-memory parallel FFT library. https://lammps.github.io/fftmpi/.
[28]
Kamil Rocki, Dirk Van Essendelft, Ilya Sharapov, Robert Schreiber, Michael Morrison, Vladimir Kibardin, Andrey Portnoy, Jean Francois Dietiker, Madhava Syamlal, and Michael James. 2020. Fast stencil-code computation on a wafer-scale processor. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.
[29]
Paul N Swarztrauber. 1982. Vectorizing the FFTs. In Parallel computations. Elsevier, 51--83.
[30]
Daisuke Takahashi. 2009. An implementation of parallel 3-D FFT with 2-D decomposition on a massively parallel cluster of multi-core processors. In International Conference on Parallel Processing and Applied Mathematics. Springer, 606--614.
[31]
Daisuke Takahashi, Mitsuhisa Sato, and Taisuke Boku. 2003. An OpenMP implementation of parallel FFT and its performance on IA-64 processors. In International Workshop on OpenMP Applications and Tools. Springer, 99--108.
[32]
Charles Van Loan. 1992. Computational frameworks for the fast Fourier transform. SIAM.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '23: Proceedings of the 37th ACM International Conference on Supercomputing
June 2023
505 pages
ISBN:9798400700569
DOI:10.1145/3577193
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2023

Check for updates

Author Tags

  1. FFT
  2. wafer-scale
  3. strong scaling
  4. massive parallelization
  5. explicit communication
  6. mesh
  7. network
  8. bandwidth
  9. programming models

Qualifiers

  • Research-article

Conference

ICS '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)402
  • Downloads (Last 6 weeks)63
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)EA4RCA: Efficient AIE accelerator design framework for regular Communication-Avoiding AlgorithmACM Transactions on Architecture and Code Optimization10.1145/367801021:4(1-24)Online publication date: 15-Jul-2024
  • (2024)Near-Optimal Wafer-Scale ReduceProceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658693(334-347)Online publication date: 3-Jun-2024
  • (2024)CereSZ: Enabling and Scaling Error-bounded Lossy Compression on Cerebras CS-2Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing10.1145/3625549.3658691(309-321)Online publication date: 3-Jun-2024
  • (2024)Performance evaluation and modelling of single-precision matrix multiplication on Cerebras CS-2SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00101(727-731)Online publication date: 17-Nov-2024
  • (2024)MuchiSim: A Simulation Framework for Design Exploration of Multi-Chip Manycore Systems2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS61541.2024.00015(48-60)Online publication date: 5-May-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media