Parallel computing methodologies

Applied Filters

People

Publications

Conferences

Publication Date

14 Results for: Book/Issue: DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceEdit SearchSave SearchRSS

Searched The ACM Guide to Computing Literature (3,826,055 records)|Limit your search to The ACM Full-Text Collection (772,627 records)

Showing 1 - 14of14 Results

Filters

Select All

Export Citations Save to Binder

per page:

Recency

research-article
November 2024
SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 251, Pages 1–6https://doi.org/10.1145/3649329.3658494

Stencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for stencil ...
0
74
Metrics
Total Citations0
Total Downloads74
Last 12 Months74
Last 6 weeks32
Get Access
research-article
November 2024
Control Flow Divergence Optimization by Exploiting Tensor Cores
- Weiguang Pang,
- Xu Jiang,
- Songran Liu,
- Lei Qiao,
- Kexue Fu,
- Longxiang Gao,
- Wang Yi
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 216, Pages 1–6https://doi.org/10.1145/3649329.3658462

Kernels are scheduled on Graphics Processing Units (GPUs) in the granularity of GPU warp, which is a bunch of threads that must be scheduled together. When executing kernels with conditional branches, the threads within a warp may execute different ...
0
82
Metrics
Total Citations0
Total Downloads82
Last 12 Months82
Last 6 weeks36
Get Access
research-article
November 2024
A Software-Hardware Co-design Solution for 3D Inner Structure Reconstruction
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 281, Pages 1–6https://doi.org/10.1145/3649329.3656522

Volume imaging (3D model with inner structure) is widely applied to various areas, such as medical diagnosis and archaeology. Especially during the COVID-19 pandemic, there is a great demand for lung CT. However, it is quite time-consuming to generate a ...
0
41
Metrics
Total Citations0
Total Downloads41
Last 12 Months41
Last 6 weeks23
Get Access
research-article
November 2024
A Combined Content Addressable Memory and In-Memory Processing Approach for k-Clique Counting Acceleration
- Xidi Ma,
- Weichen Zhang,
- Xueyan Wang,
- Tianyang Yu,
- Bi Wu,
- Gang Qu,
- Weisheng Zhao
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 257, Pages 1–6https://doi.org/10.1145/3649329.3656513

k-Clique counting problem plays an important role in graph mining which has seen a growing number of applications. However, current k-Clique counting accelerators cannot meet the performance requirement mainly because they struggle with high data ...
0
68
Metrics
Total Citations0
Total Downloads68
Last 12 Months68
Last 6 weeks39
Get Access
research-article
November 2024
PT-Map: Efficient Program Transformation Optimization for CGRA Mapping
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 14, Pages 1–6https://doi.org/10.1145/3649329.3656257

Coarse-Grained Reconfigurable Array (CGRA) is a parallel architecture providing high energy efficiency and spatial-temporal re-configurability. Beyond loop scheduling for throughput optimization, program transformation is also crucial in CGRA mapping to ...
0
86
Metrics
Total Citations0
Total Downloads86
Last 12 Months86
Last 6 weeks29
Get Access
research-article
November 2024
G-kway: Multilevel GPU-Accelerated k-way Graph Partitioner
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 105, Pages 1–6https://doi.org/10.1145/3649329.3656238

Graph partitioning is important for the design of many CAD algorithms. However, as the graph size continues to grow, graph partitioning becomes increasingly time-consuming. To overcome these challenges, we propose G-kway, an efficient multilevel GPU-...
3
69
Metrics
Total Citations3
Total Downloads69
Last 12 Months69
Last 6 weeks37
Get Access
research-article
Open Access
November 2024
HiLight: A Comprehensive Framework for High-Performance and Lightweight Scalability in Surface Code Communication
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 15, Pages 1–6https://doi.org/10.1145/3649329.3655980

In pursuing fault-tolerant quantum computing (FTQC), the surface code (SC) serves as a key quantum error correction protocol. The double-defect mode of the SC enables long-range two-qubit communication via braiding. However, intersecting braiding paths ...
0
62
Metrics
Total Citations0
Total Downloads62
Last 12 Months62
Last 6 weeks28
View online with eReader
PDF
research-article
Open Access
November 2024
Partitioned Scheduling and Parallelism Assignment for Real-Time DNN Inference Tasks on Multi-TPU
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 333, Pages 1–6https://doi.org/10.1145/3649329.3655979

Pipelining on Edge Tensor Processing Units (TPUs) optimizes the deep neural network (DNN) inference by breaking it down into multiple stages processed concurrently on multiple accelerators. Such DNN inference tasks can be modeled as sporadic non-...
0
107
Metrics
Total Citations0
Total Downloads107
Last 12 Months107
Last 6 weeks59
View online with eReader
PDF
research-article
Open Access
November 2024
PHD: Parallel Huffman Decoder on FPGA for Extreme Performance and Energy Efficiency
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 6, Pages 1–6https://doi.org/10.1145/3649329.3655967

Huffman decoding is crucial in data compression, and the self-synchronization-based parallel decoding algorithm enables subsequence-level parallelism. This paper introduces PHD, the first accelerator designed for self-synchronization-based parallel ...
0
137
Metrics
Total Citations0
Total Downloads137
Last 12 Months137
Last 6 weeks80
View online with eReader
PDF
research-article
Open Access
November 2024
MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 221, Pages 1–6https://doi.org/10.1145/3649329.3655951

Mixture-of-Experts (MoE) large language models (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of ...
0
148
Metrics
Total Citations0
Total Downloads148
Last 12 Months148
Last 6 weeks95
View online with eReader
PDF
research-article
November 2024
SpaHet: A Software/Hardware Co-design for Accelerating Heterogeneous-Sparsity based Sparse Matrix Multiplication
- Haoqin Huang,
- Pengcheng Yao,
- Zhaozeng An,
- Yufei Sun,
- Ao Hu,
- Peng Xu,
- Long Zheng,
- Xiaofei Liao,
- Hai Jin
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 202, Pages 1–6https://doi.org/10.1145/3649329.3655944

Sparse general matrix-matrix multiplication is widely used in data mining applications. Its irregular memory access patterns limit the performance of general-purpose processors, thus motivating many FPGA-based hardware innovations in recent years. ...
0
83
Metrics
Total Citations0
Total Downloads83
Last 12 Months83
Last 6 weeks40
Get Access
research-article
November 2024
SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 45, Pages 1–6https://doi.org/10.1145/3649329.3655906

While designed for massive parallelism, GPUs are frequently suffering from low thread occupancy and limited data throughput, which are typically attributed to constrained on-chip resources, such as shared memory and register file. To alleviate the ...
0
84
Metrics
Total Citations0
Total Downloads84
Last 12 Months84
Last 6 weeks51
Get Access
research-article
Open Access
November 2024
Dyn-Bitpool: A Two-sided Sparse CIM Accelerator Featuring a Balanced Workload Scheme and High CIM Macro Utilization
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 35, Pages 1–6https://doi.org/10.1145/3649329.3655690

Computing-in-memory (CIM), a promising computing paradigm, has demonstrated great energy-efficiency by integrating computing units into memory. However, previous research on CIM has rarely utilized sparsity in activation and weight concurrently. Moreover,...
0
99
Metrics
Total Citations0
Total Downloads99
Last 12 Months99
Last 6 weeks69
View online with eReader
PDF
research-article
Open Access
November 2024
GSPO: A Graph Substitution and Parallelization Joint Optimization Framework for DNN Inference
- Zheng Xu,
- Xu Dai,
- Shaojun Wei,
- Shouyi Yin,
- Yang Hu
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 214, Pages 1–6https://doi.org/10.1145/3649329.3655683

This work proposes GSPO, an automatic unified framework that jointly applies graph substitution and parallelization for DNN inference. GSPO uses a joint optimization computation graph (JOCG) to represent graph substitution and parallelization at the ...
0
84
Metrics
Total Citations0
Total Downloads84
Last 12 Months84
Last 6 weeks47
View online with eReader
PDF

Applied Filters

People

Names

Institutions

Authors

Publications

Proceedings/Book Names

All Publications

Content Type

Media Formats

Publisher

Conferences

Sponsors

Conference Event

Proceedings Series

Publication Date

SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers

Control Flow Divergence Optimization by Exploiting Tensor Cores

A Software-Hardware Co-design Solution for 3D Inner Structure Reconstruction

A Combined Content Addressable Memory and In-Memory Processing Approach for k-Clique Counting Acceleration

PT-Map: Efficient Program Transformation Optimization for CGRA Mapping

G-kway: Multilevel GPU-Accelerated k-way Graph Partitioner

HiLight: A Comprehensive Framework for High-Performance and Lightweight Scalability in Surface Code Communication

Partitioned Scheduling and Parallelism Assignment for Real-Time DNN Inference Tasks on Multi-TPU

PHD: Parallel Huffman Decoder on FPGA for Extreme Performance and Energy Efficiency

MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models

SpaHet: A Software/Hardware Co-design for Accelerating Heterogeneous-Sparsity based Sparse Matrix Multiplication

SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism

Dyn-Bitpool: A Two-sided Sparse CIM Accelerator Featuring a Balanced Workload Scheme and High CIM Macro Utilization

GSPO: A Graph Substitution and Parallelization Joint Optimization Framework for DNN Inference