research-article

Pencil: a pipelined algorithm for distributed stencils

Authors:

Aparna ChandramowlishwaranAuthors Info & Claims

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 85, Pages 1 - 16

Published: 09 November 2020 Publication History

Abstract

Stencil computations are at the core of various Computational Fluid Dynamics (CFD) applications and have been well-studied for several decades. Typically they're highly memory-bound and as a result, numerous tiling algorithms have been proposed to improve its performance. Although efficient, most of these algorithms are designed for single iteration spaces on shared-memory machines. However, in CFD, we are confronted with multi-block structured girds composed of multiple connected iteration spaces distributed across many nodes.

In this paper, we propose a pipelined stencil algorithm called Pencil for distributed memory machines that applies to practical CFD problems that span multiple iteration spaces. Based on an in-depth analysis of cache tiling on a single node, we first identify both the optimal combination of MPI and OpenMP for temporal tiling and the best tiling approach, which outperforms the state-of-the-art automatic parallelization tool Pluto by up to 1.92×. Then, we adopt DeepHalo to decouple the multiple connected iteration spaces so that temporal tiling can be applied to each space. Finally, we achieve overlap by pipelining the computation and communication without sacrificing the advantage from temporal cache tiling. Pencil is evaluated using 4 stencils across 6 numerical schemes on two distributed memory machines with Omni-Path and InfiniBand networks. On the Omni-Path system, Pencil exhibits outstanding weak and strong scalability for up to 128 nodes and outperforms MPI+OpenMP Funneled with space tiling by 1.33--3.41× on a multi-block grid with 32 nodes.

References

[1]

F. Rastello and T. Dauxois, "Efficient tiling for an ODE discrete integration program: Redundant tasks instead of trapezoidal shapedtiles," Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2002, pp. 246--253.

[2]

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev, and P. Sadayappan, "Effective automatic parallelization of stencil computations," ACM SIGPLAN Notices, vol. 42, no. 6, pp. 235--244, 2007.

Digital Library

[3]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, "3.5-D blocking optimization for stencil computations on modern CPUs and GPUs," 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010.

[4]

B. Mostafazadeh, F. Marti, B. Pourghassemi, F. Liu, and A. Chandramowlishwaran, "Unsteady Navier-Stokes computations on GPU architectures," in 23rd AIAA Computational Fluid Dynamics Conference, 2017.

[5]

M. Frigo and V. Strumpen, "Cache oblivious stencil computations," Proceedings of the International Conference on Supercomputing, vol. 1, no. 212, pp. 361--366, 2005.

[6]

M. Frigo and V. Strumpen, "The cache complexity of multithreaded cache oblivious algorithms," Theory of Computing Systems, vol. 45, no. 2, pp. 203--233, 2009.

Digital Library

[7]

R. Strzodka, M. Shaheen, D. Pajak, and H. P. Seidel, "Cache accurate time skewing in iterative stencil computations," Proceedings of the International Conference on Parallel Processing, pp. 571--581, 2011.

[8]

D. Orozco and G. Gao, "Mapping the FDTD application to many-core chip architectures," Proceedings of the International Conference on Parallel Processing, pp. 309--316, 2009.

[9]

V. Bandishti, I. Pananilath, and U. Bondhugula, "Tiling stencil computations to maximize parallelism," International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2012, pp. 1--11.

[10]

U. Bondhugula, V. Bandishti, and I. Pananilath, "Diamond tiling: Tiling techniques to maximize parallelism for stencil computations," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1285--1298, 2017.

Digital Library

[11]

L. Yuan, Y. Zhang, P. Guo, and S. Huang, "Tessellating stencils," Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017.

[12]

L. Yuan, S. Huang, Y. Zhang, and H. Cao, "Tessellating Star Stencils," ACM International Conference Proceeding Series, 2019.

[13]

D. G. Wonnacott and M. M. Strout, "On the Scalability of Loop Tiling Techniques," Proceedings of the 3rd International Workshop on Polyhedral Compilation Techniques, 2013.

[14]

E. Hammami and Y. Slama, "An overview on loop tiling techniques for code generation," Proceedings of IEEE/ACS International Conference on Computer Systems and Applications, AICCSA, vol. 2017-Octob, pp. 280--287, 2018.

[15]

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, "A practical automatic polyhedral parallelizer and locality optimizer," ACM SIGPLAN Notices, vol. 43, no. 6, pp. 101--113, 2008.

Digital Library

[16]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model," in International Conference on Compiler Construction. Springer, 2008, pp. 132--146.

[17]

Y. Tang, R. Chowdhury, C.-k. Luk, B. C. Kuszmaul, and C. E. Leiserson, "The Pochoir Stencil Compiler Categories and Subject Descriptors," Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 117--128, 2011.

[18]

R. T. Mullapudi, V. Vasista, and U. Bondhugula, "PolyMage: Automatic Optimization for Image Processing Pipelines," ACM SIGARCH Computer Architecture News, vol. 43, no. 1, pp. 429--443, 2015.

Digital Library

[19]

J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, "Halide:A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines," ACM SIGPLAN Notices, vol. 48, no. 6, pp. 519--530, 2013.

Digital Library

[20]

H. Wang and A. Chandramowlishwaran, "Multi-criteria partitioning of multi-block structured grids," Proceedings of the International Conference on Supercomputing, pp. 261--271, 2019.

Digital Library

[21]

L. Shi, M. Rampp, B. Hof, and M. Avila, "A hybrid MPI-OpenMP parallel implementation for pseudospectral simulations with application to Taylor-Couette flow," Computers and Fluids, vol. 106, pp. 1--11, 2015.

[22]

N. Drosinos and N. Koziris, "Performance comparison of pure MPI vs hybrid MPI-OpenMP parallelization models on SMP clusters," Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM), vol. 18, no. C, pp. 193--202.

[23]

H. Gahvari, M. Schulz, and U. M. Yang, "An approach to selecting thread + process mixes for hybrid MPI + OpenMP applications," Proceedings - IEEE International Conference on Cluster Computing, ICCC, vol. 2015-Octob, pp. 418--427, 2015.

[24]

H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, and B. Chapman, "High performance computing using MPI and OpenMP on multi-core parallel systems," Parallel Computing, vol. 37, no. 9, pp. 562--575, 2011.

Digital Library

[25]

J. M. Bull, J. Enright, X. Guo, C. Maynard, and F. Reid, "Performance evaluation of mixed-mode OpenMP/MPI implementations," International Journal of Parallel Programming, vol. 38, no. 5--6, pp. 396--417, 2010.

[26]

M. J. Chorley and D. W. Walker, "Performance analysis of a hybrid MPI/OpenMP application on multi-core clusters," Journal of Computational Science, vol. 1, no. 3, pp. 168--174, 2010.

[27]

R. Rabenseifner, G. Hager, and G. Jost, "Hybrid MPI/OpenMP parallel programming on clusters of multi-core SMP nodes," Proceedings of the 17th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2009, no. c, pp. 427--436, 2009.

[28]

F. Bassetti, K. Davis, and D. Quinlan, "Optimizing transformations of stencil operations for parallel object-oriented scientific frameworks on cache-based architectures," in International Symposium on Computing in Object-Oriented Parallel Environments. Springer, 1998, pp. 107--118.

[29]

A. Sawdey and M. O'Keefe, "Program analysis of overlap area usage in self-similar parallel programs," in International Workshop on Languages and Compilers for Parallel Computing. Springer, 1997, pp. 79--93.

[30]

C. Ding and Y. He, "A ghost cell expansion method for reducing communications in solving pde problems," in SC'01: Proceedings of the 2001 ACM/IEEE Conference on Supercomputing. IEEE, 2001, pp. 55--55.

[31]

F. B. Kjolstad and M. Snir, "Ghost cell pattern," in Proceedings of the 2010 Workshop on Parallel Programming Patterns, 2010, pp. 1--9.

[32]

P. Basu, A. Venkat, M. Hall, S. Williams, B. Van Straalen, and L. Oliker, "Compiler generation and autotuning of communication-avoiding operators for geometric multigrid," 20th Annual International Conference on High Performance Computing, HiPC 2013, pp. 452--461, 2013.

[33]

M. Si, A. J. Pena, J. Hammond, P. Balaji, M. Takagi, and Y. Ishikawa, "Casper: An Asynchronous Progress Model for MPI RMA on Many-Core Architectures," Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2015, pp. 665--676.

[34]

G. Schubert, H. Fehske, G. Hager, and G. Wellein, "Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems," Parallel Processing Letters, vol. 21, no. 03, pp. 339--358, 2011.

[35]

A. Denis and F. Trahay, "MPI overlap: Benchmark and analysis," in 2016 45th International Conference on Parallel Processing (ICPP). IEEE, 2016, pp. 258--267.

[36]

H. S. B, S. Chakraborty, and D. K. Panda, "Designing Dynamic and Adaptive MPI Point-to-Point Communication Protocols for Efficient Overlap of Computation and Communication," vol. 10524, pp. 334--354, 2017.

[37]

Message Passing Interface Forum, "MPI: A message-passing interface standard," Knoxville, TN, USA, Tech. Rep., 2015.

[38]

M. Si and P. Balaji, "Process-Based Asynchronous Progress Model for MPI Point-to-Point Communication," in 2017 IEEE 19th International Conference on High Performance Computing and Communications; IEEE 15th International Conference on Smart City; IEEE 3rd International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, 2017, pp. 206--214.

[39]

T. H. Kaiser and S. B. Baden, "Overlapping communication and computation with OpenMP and MPI," Scientific Programming, vol. 9, no. 2--3, pp. 73--81, 2001.

Digital Library

[40]

V. Marjanović, J. Labarta, E. Ayguadé, and M. Valero, "Overlapping Communication and Computation by Using a Hybrid MPI / SMPSs Approach," 2010.

Digital Library

[41]

M. Jiayin, S. Bo, W. Yongwei, and Y. Guangwen, "Overlapping communication and computation in MPI by multithreading," Proc. of International Conference on Parallel and Distributed Processing Techniques and Applications, no. February, pp. 2--7, 2006.

[42]

Y. Barigou and E. Gabriel, "Maximizing Communication-Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations," International Journal of Parallel Programming, vol. 45, no. 6, pp. 1390--1416, 2017.

Digital Library

[43]

P. R. Eller, T. Hoefler, and W. Gropp, "Using performance models to understand scalable Krylov solver performance at scale for structured grid problems," Proceedings of the International Conference on Supercomputing, pp. 138--149, 2019.

Digital Library

[44]

N. Li and S. Laizet, "2DECOMP & FFT-A Highly Scalable 2D Decomposition Library and FFT Interface," Cray User Group 2010 conference, pp. 1--13, 2010.

[45]

T. Malas, G. Hager, H. Ltaief, H. Stengel, G. Wellein, and D. Keyes, "Multicore-optimized wavefront diamond blocking for optimizing stencil updates," SIAM Journal on Scientific Computing, vol. 37, no. 4, pp. C439--C464, 2015.

[46]

T. Hoefler, J. Dinan, D. Buntinas, P. Balaji, B. Barrett, R. Brightwell, W. Gropp, V. Kale, and R. Thakur, "MPI+ MPI: A new hybrid approach to parallel programming with MPI plus shared memory," Computing, vol. 95, no. 12, pp. 1121--1136, 2013.

Digital Library

[47]

R. Zambre, A. Chandramowlishwaran, and P. Balaji, "Scalable communication endpoints for mpi + threads applications," in 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 2018, pp. 803--812.

[48]

P. Basu, "Compiler optimizations and autotuning for stencils and geometric multigrid," Ph.D. dissertation, The University of Utah, 2016.

[49]

T. Henretty, R. Veras, F. Franchetti, L. N. Pouchet, J. Ramanujam, and P. Sadayappan, "A stencil compiler for short-vector SIMD architectures," Proceedings of the International Conference on Supercomputing, pp. 13--24, 2013.

[50]

M. Si, A. J. Pena, J. Hammond, P. Balaji, and Y. Ishikawa, "Scaling NWChem with efficient and portable asynchronous communication in MPI RMA," Proceedings - 2015 IEEE/ACM 15th International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2015, pp. 811--816.

[51]

W. L. Briggs, V. E. Henson, and S. F. McCormick, A Multigrid Tutorial, Second Edition, 2000.

Digital Library

[52]

U. Bondhugula, V. Bandishti, A. Cohen, G. Potron, and N. Vasilache, "Tiling and optimizing time-iterated computations on periodic domains," Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, pp. 39--50, 2014.

[53]

X. D. Liu, "Weighted essentially non-oscillatory schemes," Journal of Computational Physics, vol. 115, no. 1, pp. 200--212, 1994.

Digital Library

[54]

B. Mostafazadeh, F. Marti, F. Liu, and A. Chandramowlishwaran, "Roofline guided design and analysis of a multi-stencil cfd solver for multicore performance," Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018, pp. 753--762.

[55]

M. Christen, O. Schenk, P. Messmer, E. Neufeld, and H. Burkhart, "Accelerating stencil-based computations by increased temporal locality on modern multi-and many-core architectures," High-performance and hardware-aware computing: Proceedings of the First International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC'08), no. June 2014, pp. 47--54, 2008.

[56]

W. Luzhou, K. Sano, and S. Yamamoto, "Domain-specific language and compiler for stencil computation on fpga-based systolic computational-memory array," in International Symposium on Applied Reconfigurable Computing. Springer, 2012, pp. 26--39.

[57]

K. Sano, Y. Hatsuda, and S. Yamamoto, "Multi-FPGA accelerator for scalable stencil computation with constant memory bandwidth," IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 3, pp. 695--705, 2014.

Digital Library

[58]

K. Dohi, K. Okina, R. Soejima, Y. Shibata, and K. Oguri, "Performance modeling of stencil computing on a stream-based FPGA accelerator for efficient design space exploration," IEICE Transactions on Information and Systems, 2015.

[59]

H. M. Waidyasooriya, Y. Takei, S. Tatsumi, and M. Hariyama, "OpenCL-based FPGA-platform for stencil computation and its optimization methodology," IEEE Transactions on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1390--1402, 2017.

Digital Library

[60]

H. R. Zohouri, A. Podobas, and S. Matsuoka, "Combined spatial and temporal blocking for high-performance stencil computation on FPGAs using OpenCL," FPGA 2018 - Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, vol. 2018-February, pp. 153--162, 2018.

[61]

H. R. Zohouri, A. Podobas, and S. Matsuoka, "High-performance high-order stencil computation on fpgas using opencl," in 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2018, pp. 123--130.

[62]

N. Koziris, A. Sotiropoulos, and G. Goumas, "A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping," Journal of Parallel and Distributed Computing, vol. 63, no. 11, pp. 1138--1151, 2003.

Digital Library

[63]

G. Goumas, N. Anastopoulos, N. Koziris, and N. Ioannou, "Overlapping computation and communication in SMT clusters with commodity interconnects," Proceedings - IEEE International Conference on Cluster Computing, ICCC, pp. 1--10, 2009.

[64]

P. Basu, M. Hall, S. Williams, B. V. Straalen, L. Oliker, and P. Colella, "Compiler-Directed Transformation for Higher-Order Stencils," Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015, pp. 313--323.

[65]

T. Denniston, S. Kamil, and S. Amarasinghe, "Distributed halide," in Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP, 2016.

[66]

I. Z. Reguly, G. R. Mudalige, M. B. Giles, D. Curran, and S. McIntosh-Smith, "The ops domain specific abstraction for multi-block structured grid computations," in 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing. IEEE, 2014, pp. 58--67.

[67]

I. Z. Reguly, G. R. Mudalige, and M. B. Giles, "Loop tiling in large-scale stencil codes at run-time with OPS," IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 4, pp. 873--886, 2018.

[68]

U. Bondhugula, "Compiling affine loop nests for distributed-memory parallel architectures," International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013, pp. 1--12.

Recommendations

Combining loop transformations considering caches and scheduling
MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture

The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g.,...
Combining Loop Transformations Considering Caches and Scheduling

The performance of modern microprocessors is greatly affected by cache behavior, instruction scheduling, register allocation and loop overhead. High-level loop transformations such as fission, fusion, tiling, interchanging and outer loop unrolling (e.g.,...
Generating Realignment-Based Communication for HPF Programs
IPPS '96: Proceedings of the 10th International Parallel Processing Symposium

This paper presents methods for generating communication on compiling HPF programs for distributed-memory machines. We introduce the concept of an iteration template corresponding to an iteration space. Our HPF compiler performs the loop iteration ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2020

1454 pages

ISBN:9781728199986

General Chair:
Christine Cuicchi,
Program Chairs:
Irene Qualters,
William Kramer

Sponsors

SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing

In-Cooperation

IEEE CS

Publisher

IEEE Press

Publication History

Published: 09 November 2020

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SC '20

Sponsor:

SIGHPC

SC '20: The International Conference for High Performance Computing, Networking, Storage and Analysis

November 9 - 19, 2020

Georgia, Atlanta

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
159
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 23 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents