Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJune 2015
A Nested Partitioning Algorithm for Adaptive Meshes on Heterogeneous Clusters
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 319–328https://doi.org/10.1145/2751205.2751246In the era of the accelerator, load balancing strategies that are well-understood for traditional homogeneous supercomputers must be re-worked in order to address the problem of distributing work across heterogeneous hardware such that neither the CPU ...
- research-articleJune 2015
Automatic Energy Efficient Parallelization of Uniform Dependence Computations
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 373–382https://doi.org/10.1145/2751205.2751245Energy is now a critical concern in all aspects of computing. We address a class of programs that includes the so-called "stencil computations" that have already been optimized for speed. We target the energy expended in dynamic memory accesses, since ...
- research-articleJune 2015
PeerWave: Exploiting Wavefront Parallelism on GPUs with Peer-SM Synchronization
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 25–35https://doi.org/10.1145/2751205.2751243Nested loops with regular iteration dependencies span a large class of applications ranging from string matching to linear system solvers. Wavefront parallelism is a well-known technique to enable concurrent processing of such applications and is widely ...
- research-articleJune 2015
Composing Algorithmic Skeletons to Express High-Performance Scientific Applications
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 415–424https://doi.org/10.1145/2751205.2751241Algorithmic skeletons are high-level representations for parallel programs that hide the underlying parallelism details from program specification. These skeletons are defined in terms of higher-order functions that can be composed to build larger ...
- research-articleJune 2015
DaCache: Memory Divergence-Aware GPU Cache Management
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 89–98https://doi.org/10.1145/2751205.2751239The lock-step execution model of GPU requires a warp to have the data blocks for all its threads before execution. However, there is a lack of salient cache mechanisms that can recognize the need of managing GPU cache blocks at the warp level for ...
- research-articleJune 2015
Unique Worker model for OpenMP
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 47–56https://doi.org/10.1145/2751205.2751238In OpenMP, because of the underlying efficient 'team of workers' model, each worker is given a chunk of tasks (iterations of a parallel-for-loop, or sections in a parallel-sections block), and a barrier construct is used to synchronize the workers (not ...
- research-articleJune 2015
Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 329–338https://doi.org/10.1145/2751205.2751235Current and future parallel programming models need to be portable and efficient when moving to heterogeneous multi-core systems. OmpSs is a task-based programming model with dependency tracking and dynamic scheduling. This paper describes the OmpSs ...
- research-articleJune 2015
STAPL-RTS: An Application Driven Runtime System
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 425–434https://doi.org/10.1145/2751205.2751233Modern HPC systems are growing in complexity, as they move towards deeper memory hierarchies and increasing use of computational heterogeneity via GPUs or other accelerators. When developing applications for these platforms, programmers are faced with ...
- research-articleJune 2015
Fine-Grained Synchronizations and Dataflow Programming on GPUs
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 109–118https://doi.org/10.1145/2751205.2751232The last decade has witnessed the blooming emergence of many-core platforms, especially the graphic processing units (GPUs). With the exponential growth of cores in GPUs, utilizing them efficiently becomes a challenge. The data-parallel programming ...
- research-articleJune 2015
Parameterized Diamond Tiling for Stencil Computations with Chapel parallel iterators
- Ian J. Bertolacci,
- Catherine Olschanowsky,
- Ben Harshbarger,
- Bradford L. Chamberlain,
- David G. Wonnacott,
- Michelle Mills Strout
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 197–206https://doi.org/10.1145/2751205.2751226Stencil computations figure prominently in the core kernels of many scientific computations, such as partial differential equation solvers. Parallel scaling of stencil computations can be significantly improved on multicore processors using advanced ...
- research-articleJune 2015
PaCMap: Topology Mapping of Unstructured Communication Patterns onto Non-contiguous Allocations
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 37–46https://doi.org/10.1145/2751205.2751225In high performance computing (HPC), applications usually have many parallel tasks running on multiple machine nodes. As these tasks intensively communicate with each other, the communication overhead has a significant impact on an application's ...
- research-articleJune 2015
Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 155–164https://doi.org/10.1145/2751205.2751219Remote memory access (RMA) is an emerging high-performance programming model that uses RDMA hardware directly. Yet, accessing remote memories cannot invoke activities at the target which complicates implementation and limits performance of data-centric ...
- research-articleJune 2015
Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 3–13https://doi.org/10.1145/2751205.2751218In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can ...
- research-articleJune 2015
FAST: A Fast Stencil Autotuning Framework Based On An Optimal-solution Space Model
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 187–196https://doi.org/10.1145/2751205.2751214Stencil computations comprise an important class of kernels in many scientific computing applications. As the diversity of both architectures and programming models grow, autotuning is emerging as a critical strategy for achieving portable performance ...
- research-articleJune 2015
SemCache++: Semantics-Aware Caching for Efficient Multi-GPU Offloading
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPages 79–88https://doi.org/10.1145/2751205.2751210Offloading computations to multiple GPUs is not an easy task. It requires decomposing data, distributing computations and handling communication manually. GPU drop-in libraries (which require no program rewrite) have made it easy to offload computations ...
- invited-talkJune 2015
Streaming Task Parallelism
ICS '15: Proceedings of the 29th ACM on International Conference on SupercomputingPage 1https://doi.org/10.1145/2751205.2751208Stream computing is often associated with regular, data-intensive applications, and more specifically with the family of cyclo-static data-flow models. The term also refers to bulk-synchronous data parallelism on SIMD architectures. Both interpretations ...