No abstract available.
Proceeding Downloads
Autotuning Wavefront Applications for Multicore Multi-GPU Hybrid Architectures
Manual tuning of applications for heterogeneous parallel systems is tedious and complex. Optimizations are often not portable, and the whole process must be repeated when moving to a new system, or sometimes even to a different problem size. Pattern-...
Reduction Operations in Parallel Loops for GPGPUs
Manycore accelerators offer the potential of significantly improving the performance of scientific applications when offloading compute intensive portions of programs to the accelerators. Directive-based programming models such as OpenACC and OpenMP are ...
Self-Configuration and Self-Optimization Autonomic Skeletons using Events
This paper presents a novel way to introduce self-configuration and self-optimization autonomic characteristics to algorithmic skeletons using event driven programming techniques. Based on an algorithmic skeleton language, we show that the use of events ...
Programming a Multicore Architecture without Coherency and Atomic Operations
It is hard to reason about the state of a multicore system-on-chip, because operations on memory need multiple cycles to complete, since cores communicate via an interconnect like a network-on-chip. To simplify programming, atomicity is required, by ...
Vectorizing Unstructured Mesh Computations for Many-core Architectures
Achieving optimal performance on the latest multi-core and many-core architectures depends more and more on making efficient use of the hardware's vector processing capabilities. While auto-vectorizing compilers do not require the use of vector ...
Compiling Fresh Breeze Codelets
This paper presents the design of a compiler for parallel programs expressed as collections of codelets -- blocks of statements intended for distribution over the processing cores of a massively parallel computer system.
Because interactions of codelets ...
A Framework for Multiplatform HPC Applications
This paper proposes a framework for building multi-platform applications in Java for High Performance Computing (HPC). It allows HPC developers to write their programs in Java but dynamically translate part of the programs into C programs using MPI or ...
A Novel CPU-GPU Cooperative Implementation of A Parallel Two-List Algorithm for the Subset-Sum Problem
The subset-sum problem is a well-known NP-complete decision problem. Many parallel algorithms have been developed to solve the problem within a reasonable computation time, and some of them have been implemented on a GPU. However, the GPU ...
Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures
With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying ...
Fast Longest Common Subsequence with General Integer Scoring Support on GPUs
Graphic Processing Units (GPUs) have been gaining popularity among high-performance users. Certain classes of algorithms benefit greatly from the massive parallelism of GPUs. One such class of algorithms is longest common subsequence (LCS). Combined ...
Efficient Parallel Implementations of Multiple Sequence Alignment using BSP/CGM Model
Multiple sequence alignment is a fundamental tool in bioinformatics, widely used for predicting protein structure and function, reconstructing phylogeny and several other biological sequence analyses. Because it is a NP-hard problem, several studies ...
Work Stealing Strategies For Multi-Core Parallel Branch-and-Bound Algorithm Using Factorial Number System
Many real-world problems in different industrial and economic fields are permutation combinatorial optimization problems. Solving to optimality large instances of these problems, such as the flowshop problem, is a challenge for multi-core computing.
...
Palirria: Accurate On-line Parallelism Estimation for Adaptive Work-Stealing
We present Palirria, a self-adapting work-stealing scheduling method for nested fork/join parallelism that can be used to estimate the number of utilizable workers and self-adapt accordingly. The estimation mechanism is optimized for accuracy, ...
DWS: Demand-aware Work-Stealing in Multi-programmed Multi-core Architectures
Traditional work-stealing schedulers perform poorly in multi-programmed multi-core architectures, because all the programs tend to use all the cores and thus incur serious core contention. To relieve this problem, this paper proposes a Demand-aware Work-...
Reachability Analysis of Cost-Reward Timed Automata for Energy Efficiency Scheduling
As the ongoing scaling of semiconductor technology causing severe increase of on-chip power density in microprocessors, this leads for urgent requirement for power management during each level of computer system design. In this paper, we describe an ...
Index Terms
- Proceedings of Programming Models and Applications on Multicores and Manycores
Recommendations
Synergistic execution of stream programs on multicores with accelerators
LCTES '09: Proceedings of the 2009 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systemsThe StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...
Synergistic execution of stream programs on multicores with accelerators
LCTES '09The StreamIt programming model has been proposed to exploit parallelism in streaming applications on general purpose multicore architectures. The StreamIt graphs describe task, data and pipeline parallelism which can be exploited on accelerators such as ...
Manycores in the future
HPCC'07: Proceedings of the Third international conference on High Performance Computing and CommunicationsThe change from single core to multicore processors is expected to continue, taking us to manycore chips (64 processors) and beyond. Cores are more numerous, but not faster. They also may be less reliable. Chip-level parallelism raises important ...