Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- posterNovember 2024
Quantifying the Direct Overhead of Virtual Function Calls on Massively Parallel Architectures
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 496–497https://doi.org/10.1109/PACT.2019.00063Programmable accelerators aim to provide the flexibility of traditional CPUs, with greatly improved performance and energy-efficiency. Arguably, the greatest impediment to the widespread adoption of programmable accelerators, like GPUs, is the software ...
- posterNovember 2024
Exploiting Multi-Level Task Dependencies to Prune Redundant Work in Relax-Ordered Task-Parallel Algorithms
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 494–495https://doi.org/10.1109/PACT.2019.00062Work-efficient task-parallel algorithms enforce ordering between tasks using queuing primitives. Such algorithms offer limited parallelism due to queuing constraints that result in data movement and synchronization bottlenecks. Speculatively relaxing ...
- posterNovember 2024
A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 486–487https://doi.org/10.1109/PACT.2019.00058Asymmetric multicore processors (AMP) are necessary for extracting performance in an era of limited power budget and dark silicon. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient ...
- posterNovember 2024
CogR: Exploiting Program Structures for Machine-Learning Based Runtime Solutions
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 484–485https://doi.org/10.1109/PACT.2019.00057We propose CogR, a machine-learning based runtime solution, that enables efficient and dynamic resource scheduling and performance optimization for high-level programming interfaces on heterogeneous systems. CogR tightly combines the structural ...
- posterNovember 2024
Automatic Parallelization Targeting Asynchronous Task-Based Runtimes
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 464–465https://doi.org/10.1109/PACT.2019.00047In a post-Moore world, asynchronous task-based parallelism has become a popular paradigm for parallel programming. Auto-parallelizing compilers are also an active area of research, promising improved developer productivity and application performance. ...
- research-articleNovember 2024
A Methodology for Characterizing Sparse Datasets and Its Application to SIMD Performance Prediction
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 444–455https://doi.org/10.1109/PACT.2019.00042Irregular computations are commonly seen in many scientific and engineering domains that use unstructured meshes or sparse matrices. The performance of an irregular application is very dependent upon the dataset. This paper poses the following question: "...
- research-articleNovember 2024
Accelerating DCA++ (Dynamical Cluster Approximation) Scientific Application on the Summit supercomputer
- Giovanni Balduzzi,
- Arghya Chatterjee,
- Ying Wai Li,
- Peter W. Doak,
- Urs Haehner,
- Ed F. D'Azevedo,
- Thomas A. Maier,
- Thomas Schulthess
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 432–443https://doi.org/10.1109/PACT.2019.00041Optimizing scientific applications on today's accelerator-based high performance computing systems can be challenging, especially when multiple GPUs and CPUs with heterogeneous memories and persistent non-volatile memories are present. An example is ...
- research-articleNovember 2024
Generating Portable High-Performance Code via Multi-Dimensional Homomorphisms
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 353–368https://doi.org/10.1109/PACT.2019.00035We address a key challenge in programming high-performance applications - achieving portable performance, i.e., the same source code achieves a consistent, high level of performance over the variety of modern parallel processors, including multi-core CPU ...
EDGE: Event-Driven GPU Execution
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 336–352https://doi.org/10.1109/PACT.2019.00034GPUs are known to benefit structured applications with ample parallelism, such as deep learning in a datacenter. Recently, GPUs have shown promise for irregular streaming network tasks. However, the GPU's co-processor dependence on a CPU for task ...
- research-articleNovember 2024
Adaptive Task Aggregation for High-Performance Sparse Solvers on GPUs
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 323–335https://doi.org/10.1109/PACT.2019.00033Sparse solvers are heavily used in computational fluid dynamics (CFD), computer-aided design (CAD), and other important application domains. These solvers remain challenging to execute on massively parallel architectures, due to the sequential ...
- research-articleNovember 2024
Analyzing and Leveraging Remote-core Bandwidth for Enhanced Performance in GPUs
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 257–270https://doi.org/10.1109/PACT.2019.00028Bandwidth achieved from local/shared caches and memory is a major performance determinant in Graphics Processing Units (GPUs). These existing sources of bandwidth are often not enough for optimal GPU performance. Therefore, to enhance the performance ...
- research-articleNovember 2024
Achieving scalability in a k-NN multi-GPU network service with Centaur
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 244–256https://doi.org/10.1109/PACT.2019.00027Centaur is a GPU-centric architecture for building a low-latency approximate k-Nearest-Neighbors network server. We implement a multi-GPU distributed data flow runtime which enables efficient and scalable network request processing on GPUs. The runtime ...
- research-articleNovember 2024
Unfair Scheduling Patterns in NUMA Architectures
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 205–218https://doi.org/10.1109/PACT.2019.00024Lock-free algorithms are typically designed and analyzed with adversarial scheduling in mind. However, on real hardware, lock-free algorithms perform much better than the adversarial assumption predicts, suggesting that adversarial scheduling is ...
- research-articleNovember 2024
Forgive-TM: Supporting Lazy Conflict Detection In Eager Hardware Transactional Memory
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 192–204https://doi.org/10.1109/PACT.2019.00023Commercial hardware transactional memory (TM) systems commonly use coherence messages to detect data conflicts. When a core inside a transaction receives a coherence request for data, it uses this information to determine whether there was a data ...
- research-articleNovember 2024
Fast Parallel Equivalence Relations in a Datalog Compiler
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 82–96https://doi.org/10.1109/PACT.2019.00015Modern parallelizing Datalog compilers are employed in industrial applications such as networking and static program analysis. These applications regularly reason about equivalences, e.g., computing bitcoin user groups, fast points-to analyses, and ...
BOLT: Optimizing OpenMP Parallel Regions with User-Level Threads
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 29–42https://doi.org/10.1109/PACT.2019.00011OpenMP is widely used by a number of applications, computational libraries, and runtime systems. As a result, multiple levels of the software stack use OpenMP independently of one another, often leading to nested parallel regions. Although exploiting ...
- research-articleNovember 2024
Gluon-Async: A Bulk-Asynchronous System for Distributed and Heterogeneous Graph Analytics
- Roshan Dathathri,
- Gurbinder Gill,
- Loc Hoang,
- Vishwesh Jatala,
- Keshav Pingali,
- V. Krishna Nandivada,
- Hoang-Vu Dang,
- Marc Snir
PACT '19: Proceedings of the International Conference on Parallel Architectures and Compilation TechniquesPages 15–28https://doi.org/10.1109/PACT.2019.00010Distributed graph analytics systems for CPUs, like D-Galois and Gemini, and for GPUs, like D-IrGL and Lux, use a bulk-synchronous parallel (BSP) programming and execution model. BSP permits bulk-communication and uses large messages which are supported ...