Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleFebruary 2024
ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores
- Yuetao Chen,
- Kun Li,
- Yuhao Wang,
- Donglin Bai,
- Lei Wang,
- Lingxiao Ma,
- Liang Yuan,
- Yunquan Zhang,
- Ting Cao,
- Mao Yang
PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingPages 333–347https://doi.org/10.1145/3627535.3638476Tensor Core Unit (TCU) is increasingly integrated into modern high-performance processors to enhance matrix multiplication performance. However, constrained to its over-specification, its potential for improving other critical scientific operations like ...
- research-articleJune 2023
SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil Computation
- Gagandeep Singh,
- Alireza Khodamoradi,
- Kristof Denolf,
- Jack Lo,
- Juan Gomez-Luna,
- Joseph Melber,
- Andra Bisca,
- Henk Corporaal,
- Onur Mutlu
ICS '23: Proceedings of the 37th ACM International Conference on SupercomputingPages 463–476https://doi.org/10.1145/3577193.3593719Fast and accurate climate simulations and weather predictions are critical for understanding and preparing for the impact of climate change. Real-world climate and weather simulations involve the use of complex compound stencil kernels, which are ...
- research-articleJanuary 2023
A Scalable Many-core Overlay Architecture on an HBM2-enabled Multi-Die FPGA
ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 16, Issue 1Article No.: 15, Pages 1–33https://doi.org/10.1145/3547657The overlay architecture enables to raise the abstraction level of hardware design and enhances hardware-accelerated applications’ portability. In FPGAs, there is a growing awareness of the overlay structure as typified by many-core architecture. It works ...
- research-articleNovember 2022
Scalable distributed high-order stencil computations
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisArticle No.: 30, Pages 1–13Stencil computations lie at the heart of many scientific and industrial applications. Stencil algorithms pose several challenges on machines with cache based memory hierarchy, due to low re-use of memory accesses if special care is not taken to optimize ...
- research-articleJune 2022
Toward accelerated stencil computation by adapting tensor core unit on GPU
ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingArticle No.: 28, Pages 1–12https://doi.org/10.1145/3524059.3532392The Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly ...
-
Temporal vectorization for stencils
SC '21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 82, Pages 1–13https://doi.org/10.1145/3458817.3476149Stencil computations represent a very common class of nested loops in scientific and engineering applications. Exploiting vector units in modern CPUs is crucial to achieving peak performance. Previous vectorization approaches often consider the data ...
- research-articleAugust 2021
Enhancing the Scalability of Multi-FPGA Stencil Computations via Highly Optimized HDL Components
- Enrico Reggiani,
- Emanuele Del Sozzo,
- Davide Conficconi,
- Giuseppe Natale,
- Carlo Moroni,
- Marco D. Santambrogio
ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 14, Issue 3Article No.: 15, Pages 1–33https://doi.org/10.1145/3461478Stencil-based algorithms are a relevant class of computational kernels in high-performance systems, as they appear in a plethora of fields, from image processing to seismic simulations, from numerical methods to physical modeling. Among the various ...
- research-articleJuly 2021
Fast Stencil Computations using Fast Fourier Transforms
SPAA '21: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and ArchitecturesPages 8–21https://doi.org/10.1145/3409964.3461803Stencil computations are widely used to simulate the change of state of physical systems across a multidimensional grid over multiple timesteps. The state-of-the-art techniques in this area fall into three groups: cache-aware tiled looping algorithms, ...
- research-articleFebruary 2020
High-level hardware feature extraction for GPU performance prediction of stencils
GPGPU '20: Proceedings of the 13th Annual Workshop on General Purpose Processing using Graphics Processing UnitPages 21–30https://doi.org/10.1145/3366428.3380769High-level functional programming abstractions have started to show promising results for HPC (High-Performance Computing). Approaches such as Lift, Futhark or Delite have shown that it is possible to have both, high-level abstractions and performance, ...
- research-articleNovember 2019
Automatically translating image processing libraries to halide
ACM Transactions on Graphics (TOG), Volume 38, Issue 6Article No.: 204, Pages 1–13https://doi.org/10.1145/3355089.3356549This paper presents Dexter, a new tool that automatically translates image processing functions from a low-level general-purpose language to a high-level domain-specific language (DSL), allowing them to leverage cross-platform optimizations enabled by ...
- research-articleSeptember 2019
PIMS: a lightweight processing-in-memory accelerator for stencil computations
MEMSYS '19: Proceedings of the International Symposium on Memory SystemsPages 41–52https://doi.org/10.1145/3357526.3357550Stencil computation is a classic computational kernel present in many high-performance scientific applications, like image processing and partial differential equation solvers (PDE). A stencil computation sweeps over a multi-dimensional grid and ...
- research-articleAugust 2018
Communication-Avoiding for Dynamical Core of Atmospheric General Circulation Model
ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingArticle No.: 12, Pages 1–10https://doi.org/10.1145/3225058.3225140Dynamical core is one of the most time-consuming parts in the global atmospheric general circulation model, which is widely used for the numerical simulation of the dynamic evolution process of global atmosphere. Due to its complicated calculation ...
- research-articleMay 2018
Extreme-scale realistic stencil computations on sunway taihulight with ten million cores
CCGrid '18: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingPages 566–571https://doi.org/10.1109/CCGRID.2018.00086Stencil computation arises from a large variety of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to optimize stencil ...
- posterFebruary 2018
An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism: (Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysPage 286https://doi.org/10.1145/3174243.3174964Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its ...
- research-articleNovember 2017
Tessellating stencils
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 49, Pages 1–13https://doi.org/10.1145/3126908.3126920Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Unlike ...
- articleApril 2017
TOAST: Automatic tiling for iterative stencil computations on GPUs
Concurrency and Computation: Practice & Experience (CCOMP), Volume 29, Issue 8Page n/ahttps://doi.org/10.1002/cpe.4053The stencil pattern is important in many scientific and engineering domains, spurring great interest from researchers and industry. In recent years, various optimizations have been proposed for parallel stencil applications running on graphics ...
- research-articleOctober 2017
Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping
- Marc Gamell,
- Keita Teranishi,
- Hemanth Kolla,
- Jackson Mayo,
- Michael A. Heroux,
- Jacqueline Chen,
- Manish Parashar
SIAM Journal on Scientific Computing (SISC), Volume 39, Issue 5Pages S347–S378https://doi.org/10.1137/16M1081610In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple ...
- research-articleNovember 2015
Reducing overhead in the Uintah framework to support short-lived tasks on GPU-heterogeneous architectures
WOLFHPC '15: Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance ComputingArticle No.: 4, Pages 1–8https://doi.org/10.1145/2830018.2830023The Uintah computational framework is used for the parallel solution of partial differential equations on adaptive mesh refinement grids using modern supercomputers. Uintah is structured with an application layer and a separate runtime system. The ...
- research-articleJune 2015
Helium: lifting high-performance stencil kernels from stripped x86 binaries to halide DSL code
- Charith Mendis,
- Jeffrey Bosboom,
- Kevin Wu,
- Shoaib Kamil,
- Jonathan Ragan-Kelley,
- Sylvain Paris,
- Qin Zhao,
- Saman Amarasinghe
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and ImplementationPages 391–402https://doi.org/10.1145/2737924.2737974Highly optimized programs are prone to bit rot, where performance quickly becomes suboptimal in the face of new hardware and compiler techniques. In this paper we show how to automatically lift performance-critical stencil kernels from a stripped x86 ...
Also Published in:
ACM SIGPLAN Notices: Volume 50 Issue 6 - research-articleOctober 2014
Trace-Driven Memory Access Pattern Recognition in Computational Kernels
WOSC '14: Proceedings of the Second Workshop on Optimizing Stencil ComputationsPages 25–32https://doi.org/10.1145/2686745.2686748Classifying memory access patterns is paramount to the selection of the right set of optimizations and determination of the parallelization strategy. Static analyses suffer from ambiguities present in source code, which modern compilation techniques, ...