research-article

A massively parallel adaptive fast-multipole method on heterogeneous architectures

Authors:

Aparna Chandramowlishwaran,

Harper Langston,

Tuan-Anh Nguyen,

Aashay Shringarpure,

George BirosAuthors Info & Claims

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

Article No.: 58, Pages 1 - 12

https://doi.org/10.1145/1654059.1654118

Published: 14 November 2009 Publication History

Abstract

We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations.

We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.

References

[1]

NVIDIA CUDA (Compute Unified Device Architecture): Programming Guide, Version 2.1, December 2008.

[2]

P. Ajmera, R. Goradia, S. Chandran, and S. Aluru, Fast, parallel, gpu-based construction of space filling curves and octrees, in I3D '08: Proceedings of the 2008 symposium on Interactive 3D graphics and games, ACM, 2008, pp. 1--1.

Digital Library

[3]

S. Balay, K. Buschelman, W. D. Gropp, D. Kaushik, M. Knepley, L. C. McInnes, B. F. Smith, and H. Zhang, PETSc home page, 2001. http://www.mcs.anl.gov/petsc.

[4]

H. Cheng, L. Greengard, and V. Rokhlin, A fast adaptive multipole algorithm in three dimensions, Journal of Computational Physics, 155 (1999), pp. 468--498.

Digital Library

[5]

A. Grama, A. Gupta, G. Karypis, and V. Kumar, An Introduction to Parallel Computing: Design and Analysis of Algorithms, Addison Wesley, second ed., 2003.

[6]

L. Greengard and V. Rokhlin, A fast algorithm for particle simulations, Journal of Computational Physics, 73 (1987), pp. 325--348.

Digital Library

[7]

N. A. Gumerov and R. Duraiswami, Fast multipole methods on graphics processors, Journal of Computational Physics, 227 (2008), pp. 8290--8313.

Digital Library

[8]

B. Hariharan and S. Aluru, Efficient parallel algorithms and software for compressed octrees with applications to hierarchical methods, Parallel Computing, 31 (2005), pp. 311--331.

Digital Library

[9]

J. JáJ'a, An introduction to parallel algorithms, Addison Wesley, 1992.

Digital Library

[10]

J. Kurzak and B. M. Pettitt, Massively parallel implementation of a fast multipole method for distributed memory machines, Journal of Parallel and Distributed Computing, 65 (2005), pp. 870--881.

Digital Library

[11]

S. Ogata, T. J. Campbell, R. K. Kalia, A. Nakano, P. Vashishta, and S. Vemparala, Scalable and portable implementation of the fast multipole method on parallel computers, Computer Physics Communications, 153 (2003), pp. 445--461.

[12]

J. C. Phillips, J. E. Stone, and K. Schulten, Adapting a message-driven parallel application to GPU-accelerated clusters, in SC'08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, 2008, pp. 1--9.

Digital Library

[13]

R. Sampath, S. S. Adavani, H. Sundar, I. Lashuk, and G. Biros, Dendro home page, 2008.

[14]

R. S. Sampath, S. S. Adavani, H. Sundar, I. Lashuk, and G. Biros, Dendro: parallel algorithms for multigrid and AMR methods on 2:1 balanced octrees, in SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, Piscataway, NJ, USA, 2008, IEEE Press, pp. 1--12.

Digital Library

[15]

F. E. Sevilgen and S. Aluru, A unifying data structure for hierarchical methods, in Proceedings of Supercomputing, The SCxy Conference series, Portland, Oregon, November 1999, ACM/IEEE.

Digital Library

[16]

H. Sundar, R. S. Sampath, and G. Biros, Bottom-up construction and 2:1 balance refinement of linear octrees in parallel, SIAM Journal on Scientific Computing, 30 (2008), pp. 2675--2708.

Digital Library

[17]

S.-H. Teng, Provably good partitioning and load balancing algorithms for parallel adaptive N-body simulation, SIAM Journal on Scientific Computing, 19 (1998).

Digital Library

[18]

M. S. Warren and J. K. Salmon, A parallel hashed octtree N-body algorithm, in Proceedings of Supercomputing, The SCxy Conference series, Portland, Oregon, November 1993, ACM/IEEE.

Digital Library

[19]

L. Ying, G. Biros, H. Langston, and D. Zorin, KIFMM3D: The kernel-independent fast multipole (FMM) 3D code. GPL license.

[20]

L. Ying, G. Biros, and D. Zorin, A kernel-independent adaptive fast multipole method in two and three dimensions, Journal of Computational Physics, 196 (2004), pp. 591--626.

Digital Library

[21]

L. Ying, G. Biros, D. Zorin, and H. Langston, A new parallel kernel-independent fast multipole algorithm, in Proceedings of SC03, The SCxy Conference series, Phoenix, Arizona, November 2003, ACM/IEEE.

Digital Library

Cited By

Zulian PBen Bader SFourestey GKrause RRossinelli D(2024)Data-centric workloads with MPI_SortJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104833(104833)Online publication date: Jan-2024
https://doi.org/10.1016/j.jpdc.2023.104833
Sengupta BLee YAraghizadeh MMyong RLee H(2024)Comparative Analysis of Direct Method and Fast Multipole Method for Multirotor Wake DynamicsInternational Journal of Aeronautical and Space Sciences10.1007/s42405-023-00699-w25:3(789-808)Online publication date: 10-Jan-2024
https://doi.org/10.1007/s42405-023-00699-w
Kan YKärtner FLe Borne SZemke J(2023)A GPU-parallelized interpolation-based fast multipole method for the relativistic space-charge field calculationComputer Physics Communications10.1016/j.cpc.2023.108825(108825)Online publication date: Jun-2023
https://doi.org/10.1016/j.cpc.2023.108825
Show More Cited By

Index Terms

A massively parallel adaptive fast-multipole method on heterogeneous architectures

Recommendations

A massively parallel adaptive fast multipole method on heterogeneous architectures

We describe a parallel fast multipole method (FMM) for highly nonuniform distributions of particles. We employ both distributed memory parallelism (via MPI) and shared memory parallelism (via OpenMP and GPU acceleration) to rapidly evaluate two-body ...
Scalable fast multipole methods on distributed heterogeneous architectures
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

We fundamentally reconsider implementation of the Fast Multipole Method (FMM) on a computing node with a heterogeneous CPU-GPU architecture with multicore CPU(s) and one or more GPU accelerators, as well as on an interconnected cluster of such nodes. ...
Massively LDPC Decoding on Multicore Architectures

Unlike usual VLSI approaches necessary for the computation of intensive Low-Density Parity-Check (LDPC) code decoders, this paper presents flexible software-based LDPC decoders. Algorithms and data structures suitable for parallel computing are proposed ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '09: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis

November 2009

778 pages

ISBN:9781605587448

DOI:10.1145/1654059

Conference Chair:
Wilfred Pinfold

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Funding Sources

Conference

SC '09

Sponsor:

SIGARCH
IEEE-CS

SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis

November 14 - 20, 2009

Oregon, Portland

Acceptance Rates

SC '09 Paper Acceptance Rate 59 of 261 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

61
Total Citations
View Citations
161
Total Downloads

Downloads (Last 12 months)26
Downloads (Last 6 weeks)2

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zulian PBen Bader SFourestey GKrause RRossinelli D(2024)Data-centric workloads with MPI_SortJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.104833(104833)Online publication date: Jan-2024
https://doi.org/10.1016/j.jpdc.2023.104833
Sengupta BLee YAraghizadeh MMyong RLee H(2024)Comparative Analysis of Direct Method and Fast Multipole Method for Multirotor Wake DynamicsInternational Journal of Aeronautical and Space Sciences10.1007/s42405-023-00699-w25:3(789-808)Online publication date: 10-Jan-2024
https://doi.org/10.1007/s42405-023-00699-w
Kan YKärtner FLe Borne SZemke J(2023)A GPU-parallelized interpolation-based fast multipole method for the relativistic space-charge field calculationComputer Physics Communications10.1016/j.cpc.2023.108825(108825)Online publication date: Jun-2023
https://doi.org/10.1016/j.cpc.2023.108825
Watschinger RMerta MOf GZapletal J(2022)A Parallel Fast Multipole Method for a Space-Time Boundary Element Method for the Heat EquationSIAM Journal on Scientific Computing10.1137/21M143015744:4(C320-C345)Online publication date: 1-Jan-2022
https://dl.acm.org/doi/10.1137/21M1430157
Hampl SWaggoner MGallud Cidoncha XPetro ELozano P(2022)Comparison of computational algorithms for simulating an electrospray plume with a n-body approachJournal of Electric Propulsion10.1007/s44205-022-00015-w1:1Online publication date: 7-Oct-2022
https://doi.org/10.1007/s44205-022-00015-w
Nöttgen HCzappa FWolf F(2022)Accelerating Brain Simulations with the Fast Multipole MethodEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_24(387-402)Online publication date: 22-Aug-2022
https://dl.acm.org/doi/10.1007/978-3-031-12597-3_24
Wyrzykowski RDeelman EKohnke BKutzner CBeckmann ALube GKabadshow IDachsel HGrubmüller H(2021)A CUDA fast multipole method with highly efficient M2L far field evaluationInternational Journal of High Performance Computing Applications10.1177/109434202096485735:1(97-117)Online publication date: 1-Jan-2021
https://dl.acm.org/doi/10.1177/1094342020964857
Bramas B(2020)TBFMM: A C++ generic and parallel fast multipole method libraryJournal of Open Source Software10.21105/joss.024445:56(2444)Online publication date: Dec-2020
https://doi.org/10.21105/joss.02444
Lingg MHughey SDikbayir DShanker BAktulga H(2020)Exploring Task Parallelism for the Multilevel Fast Multipole Algorithm2020 IEEE 27th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC50609.2020.00018(41-50)Online publication date: Dec-2020
https://doi.org/10.1109/HiPC50609.2020.00018
Saunders WGrant JMüller EThompson I(2020)Fast electrostatic solvers for kinetic Monte Carlo simulationsJournal of Computational Physics10.1016/j.jcp.2020.109379(109379)Online publication date: Mar-2020
https://doi.org/10.1016/j.jcp.2020.109379
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents