Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleNovember 2024
SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 251, Pages 1–6https://doi.org/10.1145/3649329.3658494Stencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for stencil ...
- research-articleNovember 2024
Control Flow Divergence Optimization by Exploiting Tensor Cores
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 216, Pages 1–6https://doi.org/10.1145/3649329.3658462Kernels are scheduled on Graphics Processing Units (GPUs) in the granularity of GPU warp, which is a bunch of threads that must be scheduled together. When executing kernels with conditional branches, the threads within a warp may execute different ...
- research-articleNovember 2024
A Software-Hardware Co-design Solution for 3D Inner Structure Reconstruction
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 281, Pages 1–6https://doi.org/10.1145/3649329.3656522Volume imaging (3D model with inner structure) is widely applied to various areas, such as medical diagnosis and archaeology. Especially during the COVID-19 pandemic, there is a great demand for lung CT. However, it is quite time-consuming to generate a ...
- research-articleNovember 2024
A Combined Content Addressable Memory and In-Memory Processing Approach for k-Clique Counting Acceleration
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 257, Pages 1–6https://doi.org/10.1145/3649329.3656513k-Clique counting problem plays an important role in graph mining which has seen a growing number of applications. However, current k-Clique counting accelerators cannot meet the performance requirement mainly because they struggle with high data ...
- research-articleNovember 2024
PT-Map: Efficient Program Transformation Optimization for CGRA Mapping
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 14, Pages 1–6https://doi.org/10.1145/3649329.3656257Coarse-Grained Reconfigurable Array (CGRA) is a parallel architecture providing high energy efficiency and spatial-temporal re-configurability. Beyond loop scheduling for throughput optimization, program transformation is also crucial in CGRA mapping to ...
- research-articleNovember 2024
G-kway: Multilevel GPU-Accelerated k-way Graph Partitioner
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 105, Pages 1–6https://doi.org/10.1145/3649329.3656238Graph partitioning is important for the design of many CAD algorithms. However, as the graph size continues to grow, graph partitioning becomes increasingly time-consuming. To overcome these challenges, we propose G-kway, an efficient multilevel GPU-...
- research-articleNovember 2024
HiLight: A Comprehensive Framework for High-Performance and Lightweight Scalability in Surface Code Communication
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 15, Pages 1–6https://doi.org/10.1145/3649329.3655980In pursuing fault-tolerant quantum computing (FTQC), the surface code (SC) serves as a key quantum error correction protocol. The double-defect mode of the SC enables long-range two-qubit communication via braiding. However, intersecting braiding paths ...
- research-articleNovember 2024
Partitioned Scheduling and Parallelism Assignment for Real-Time DNN Inference Tasks on Multi-TPU
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 333, Pages 1–6https://doi.org/10.1145/3649329.3655979Pipelining on Edge Tensor Processing Units (TPUs) optimizes the deep neural network (DNN) inference by breaking it down into multiple stages processed concurrently on multiple accelerators. Such DNN inference tasks can be modeled as sporadic non-...
- research-articleNovember 2024
PHD: Parallel Huffman Decoder on FPGA for Extreme Performance and Energy Efficiency
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 6, Pages 1–6https://doi.org/10.1145/3649329.3655967Huffman decoding is crucial in data compression, and the self-synchronization-based parallel decoding algorithm enables subsequence-level parallelism. This paper introduces PHD, the first accelerator designed for self-synchronization-based parallel ...
- research-articleNovember 2024
MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 221, Pages 1–6https://doi.org/10.1145/3649329.3655951Mixture-of-Experts (MoE) large language models (LLM) have memory requirements that often exceed the GPU memory capacity, requiring costly parameter movement from secondary memories to the GPU for expert computation. In this work, we present Mixture of ...
- research-articleNovember 2024
SpaHet: A Software/Hardware Co-design for Accelerating Heterogeneous-Sparsity based Sparse Matrix Multiplication
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 202, Pages 1–6https://doi.org/10.1145/3649329.3655944Sparse general matrix-matrix multiplication is widely used in data mining applications. Its irregular memory access patterns limit the performance of general-purpose processors, thus motivating many FPGA-based hardware innovations in recent years. ...
- research-articleNovember 2024
SMILE: LLC-based Shared Memory Expansion to Improve GPU Thread Level Parallelism
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 45, Pages 1–6https://doi.org/10.1145/3649329.3655906While designed for massive parallelism, GPUs are frequently suffering from low thread occupancy and limited data throughput, which are typically attributed to constrained on-chip resources, such as shared memory and register file. To alleviate the ...
- research-articleNovember 2024
Dyn-Bitpool: A Two-sided Sparse CIM Accelerator Featuring a Balanced Workload Scheme and High CIM Macro Utilization
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 35, Pages 1–6https://doi.org/10.1145/3649329.3655690Computing-in-memory (CIM), a promising computing paradigm, has demonstrated great energy-efficiency by integrating computing units into memory. However, previous research on CIM has rarely utilized sparsity in activation and weight concurrently. Moreover,...
- research-articleNovember 2024
GSPO: A Graph Substitution and Parallelization Joint Optimization Framework for DNN Inference
DAC '24: Proceedings of the 61st ACM/IEEE Design Automation ConferenceArticle No.: 214, Pages 1–6https://doi.org/10.1145/3649329.3655683This work proposes GSPO, an automatic unified framework that jointly applies graph substitution and parallelization for DNN inference. GSPO uses a joint optimization computation graph (JOCG) to represent graph substitution and parallelization at the ...