Keyword: graphics processing units : Search

research-article

Open Access

Taking One for the Team: Trading Overhead and Blocking for Optimal Critical-Section Granularity with a Shared GPU

RTNS '24: Proceedings of the 32nd International Conference on Real-Time Networks and SystemsPages 94–104https://doi.org/10.1145/3696355.3696368

Conventional wisdom maintains that the interval of time mutually exclusive access is granted to a shared resource (i.e., in a critical section) should be as short as possible. However, the arbitration of shared-resource accesses introduces overhead. As a ...

research-article

Open Access

Refining HPCToolkit for application performance analysis at exascale

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 38, Issue 6Pages 612–632https://doi.org/10.1177/10943420241277839

As part of the US Department of Energy’s Exascale Computing Project (ECP), Rice University has been refining its HPCToolkit performance tools to better support measurement and analysis of applications executing on exascale supercomputers. To efficiently ...

research-article

Open Access

Fast Kronecker Matrix-Matrix Multiplication on GPUs

PPoPP '24: Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel ProgrammingPages 390–403https://doi.org/10.1145/3627535.3638489

Kronecker Matrix-Matrix Multiplication (Kron-Matmul) is the multiplication of a matrix with the Kronecker Product of several smaller matrices. Kron-Matmul is a core operation for many scientific and machine learning computations. State-of-the-art Kron-...

research-article

Acceleration of a parallel BDDC solver by using graphics processing units on subdomains

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 37, Issue 2Pages 151–164https://doi.org/10.1177/10943420221136873

An approach to accelerating a parallel domain decomposition (DD) solver by graphics processing units (GPUs) is investigated. The solver is based on the Balancing Domain Decomposition Method by Constraints (BDDC), which is a nonoverlapping DD technique. ...

research-article

Open Access

PeleC: An adaptive mesh refinement solver for compressible reacting flows

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 37, Issue 2Pages 115–131https://doi.org/10.1177/10943420221121151

Reacting flow simulations for combustion applications require extensive computing capabilities. Leveraging the AMReX library, the Pele suite of combustion simulation tools targets the largest supercomputers available and future exascale machines. We ...

research-article

Open Access

Compressed basis GMRES on high-performance graphics processing units

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 37, Issue 2Pages 82–100https://doi.org/10.1177/10943420221115140

Krylov methods provide a fast and highly parallel numerical tool for the iterative solution of many large-scale sparse linear systems. To a large extent, the performance of practical realizations of these methods is constrained by the communication ...

research-article

High-Performance Implementation of the Identity-Based Signature Scheme in IEEE P1363 on GPU

ACM Transactions on Embedded Computing Systems (TECS), Volume 22, Issue 2Article No.: 25, Pages 1–35https://doi.org/10.1145/3564784

Identity-based cryptography is proposed to solve the complicated certificate management of traditional public-key cryptography. The pairing computation and high-level tower extension field arithmetic turn out to be the performance bottleneck of pairing-...

research-article

Designing Virtual Memory System of MCM GPUs

MICRO '22: Proceedings of the 55th Annual IEEE/ACM International Symposium on MicroarchitecturePages 404–422https://doi.org/10.1109/MICRO56248.2022.00036

Multi-Chip Module (MCM) designs have emerged as a key technique to scale up a GPU's compute capabilities in the face of slowing transistor technology. However, the disaggregated nature of MCM GPUs with many chiplets connected via in-package ...

research-article

GCoM: a detailed GPU core model for accurate analytical modeling of modern GPUs

ISCA '22: Proceedings of the 49th Annual International Symposium on Computer ArchitecturePages 424–436https://doi.org/10.1145/3470496.3527384

Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior ...

research-article

Development of a hardware-accelerated simulation kernel for ultra-high vacuum with Nvidia RTX GPUs

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 36, Issue 2Pages 141–152https://doi.org/10.1177/10943420211056654

Molflow+ is a Monte Carlo (MC) simulation software for ultra-high vacuum, mainly used to simulate pressure in particle accelerators. In this article, we present and discuss the design choices arising in a new implementation of its ray-tracing–based ...

survey

Query Processing on Heterogeneous CPU/GPU Systems

ACM Computing Surveys (CSUR), Volume 55, Issue 1Article No.: 11, Pages 1–38https://doi.org/10.1145/3485126

Due to their high computational power and internal memory bandwidth, graphic processing units (GPUs) have been extensively studied by the database systems research community. A heterogeneous query processing system that employs CPUs and GPUs at the same ...

research-article

Optimisation of plagiarism detection using vector space model on CUDA architecture

International Journal of Innovative Computing and Applications (IJICA), Volume 13, Issue 4Pages 232–244https://doi.org/10.1504/ijica.2022.125675

Plagiarism is a rapidly rising issue among students during submission of assignments, reports and publications in universities and educational institutions, due to easy accessibility of abundant e-resources on the internet. Existing tools become ...

research-article

Open Access

Multi‐stream adaptive spatial‐temporal attention graph convolutional network for skeleton‐based action recognition

IET Computer Vision (CVI2), Volume 16, Issue 2Pages 143–158https://doi.org/10.1049/cvi2.12075

Abstract

Skeleton‐based action recognition algorithms have been widely applied to human action recognition. Graph convolutional networks (GCNs) generalize convolutional neural networks (CNNs) to non‐Euclidean graphs and achieve significant performance in ...

research-article

Pointer-Based Divergence Analysis for OpenCL 2.0 Programs

ACM Transactions on Parallel Computing (TOPC), Volume 8, Issue 4Article No.: 20, Pages 1–23https://doi.org/10.1145/3470644

A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This ...

research-article

Dual-side sparse tensor core

ISCA '21: Proceedings of the 48th Annual International Symposium on Computer ArchitecturePages 1083–1095https://doi.org/10.1109/ISCA52012.2021.00088

Leveraging sparsity in deep neural network (DNN) models is promising for accelerating model inference. Yet existing GPUs can only leverage the sparsity from weights but not activations, which are dynamic, unpredictable, and hence challenging to exploit. ...

research-article

Public Access

Exploring AMD GPU Scheduling Details by Experimenting With “Worst Practices”

RTNS '21: Proceedings of the 29th International Conference on Real-Time Networks and SystemsPages 24–34https://doi.org/10.1145/3453417.3453432

Graphics processing units (GPUs) have been the target of a significant body of recent real-time research, but research is often hampered by the “black box” nature of GPU hardware and software. Now that one GPU manufacturer, AMD, has embraced an open-...

research-article

Implicit Hari–Zimmermann algorithm for the generalized SVD on the GPUs

International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 35, Issue 2Pages 170–205https://doi.org/10.1177/1094342020972772

A parallel, blocked, one-sided Hari–Zimmermann algorithm for the generalized singular value decomposition (GSVD) of a real or a complex matrix pair ( F , G ) is here proposed, where F and G have the same number of columns, and are both of the ...

research-article

Sparse GPU kernels for deep learning

SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 17, Pages 1–14

Scientific workloads have traditionally exploited high levels of sparsity to accelerate computation and reduce memory requirements. While deep neural networks can be made sparse, achieving practical speedups on GPUs is difficult because these ...

research-article

Public Access

Model-Based Warp Overlapped Tiling for Image Processing Programs on GPUs

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation TechniquesPages 317–328https://doi.org/10.1145/3410463.3414649

Domain-specific languages that execute image processing pipelines on GPUs, such as Halide and Forma, operate by 1)~dividing the image into overlapped tiles, and 2)~fusing loops to improve memory locality. However, current approaches have limitations: 1)~...

research-article

ScoRD: a scoped race detector for GPUs

ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer ArchitecturePages 1036–1049https://doi.org/10.1109/ISCA45697.2020.00088

GPUs have emerged as a key computing platform for an ever-growing range of applications. Unlike traditional bulk-synchronous GPU programs, many emerging GPU-accelerated applications, such as graph processing, have irregular interaction among the ...

Applied Filters

People

Names

Institutions

Authors

Reviewers

Publications

Journal/Magazine Names

Proceedings/Book Names

All Publications

Content Type

Supplemental Material Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Reproducibility Badges

Publication Date

Save to Binder

Upcoming Conferences