Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- posterSeptember 2010
Automatic vector instruction selection for dynamic compilation
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 573–574https://doi.org/10.1145/1854273.1854358Accelerating program performance via short SIMD vector units is very common in modern processors, as evidenced by the use of SSE, MMX, and AltiVec SIMD instructions in multimedia, scientific, and embedded applications. To take full advantage of the ...
- posterSeptember 2010
A software-SVM-based transactional memory for multicore accelerator architectures with local memory
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 567–568https://doi.org/10.1145/1854273.1854355We propose a software transactional memory (STM) for heterogeneous multicores with small local memory. The heterogeneous multicore architecture consists of a general-purpose processor element (GPE) and multiple accelerator processor elements (APEs). The ...
- posterSeptember 2010
DMATiler: revisiting loop tiling for direct memory access
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 559–560https://doi.org/10.1145/1854273.1854351In this paper we present the design and implementation of a DMATiler which combines compiler analysis and runtime management to optimize local memory performance. In traditional cache model based loop tiling optimizations, the compiler approximates ...
- posterSeptember 2010
An integer programming framework for optimizing shared memory use on GPUs
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 553–554https://doi.org/10.1145/1854273.1854348General purpose computing using GPUs is becoming increasingly popular, because of GPU's extremely favorable performance/price ratio. Like standard processors, GPUs also have a memory hierarchy, which must be carefully optimized for in order to achieve ...
- posterSeptember 2010
Analyzing cache performance bottlenecks of STM applications and addressing them with compiler's help
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 547–548https://doi.org/10.1145/1854273.1854345Software transactional memory (STM) is a promising programming paradigm for shared memory multithreaded programs as an alternative to traditional lock based synchronization. However adoption of STM in mainstream software has been quite low due to its ...
-
- posterSeptember 2010
Improving speculative loop parallelization via selective squash and speculation reuse
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 543–544https://doi.org/10.1145/1854273.1854343Speculative parallelization is a powerful technique to parallelize loops with irregular data dependencies. In this poster, we present a value-based selective squash protocol and an optimistic speculation reuse technique that leverages an extended notion ...
- posterSeptember 2010
Ordered and unordered algorithms for parallel breadth first search
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 539–540https://doi.org/10.1145/1854273.1854341We describe and evaluate ordered and unordered algorithms for shared-memory parallel breadth-first search. The unordered algorithm is based on viewing breadth-first search as a fixpoint computation, and in general, it may perform more work than the ...
- posterSeptember 2010
Believe it or not!: mult-core CPUs can match GPU performance for a FLOP-intensive application!
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 537–538https://doi.org/10.1145/1854273.1854340In this paper, we evaluate performance of a real-world image processing application that uses a cross-correlation algorithm to compare a given image with a reference one. We implement this algorithm on a nVidia GTX 285 GPU using CUDA, and also ...
- research-articleSeptember 2010
Tiled-MapReduce: optimizing resource usages of data-parallel applications on multicore with tiling
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 523–534https://doi.org/10.1145/1854273.1854337The prevalence of chip multiprocessor opens opportunities of running data-parallel applications originally in clusters on a single machine with many cores. MapReduce, a simple and elegant programming model to program large scale clusters, has recently ...
- research-articleSeptember 2010
Compiler-assisted data distribution for chip multiprocessors
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 501–512https://doi.org/10.1145/1854273.1854335Data access latency, a limiting factor in the performance of chip multiprocessors, grows significantly with the number of cores in non-uniform cache architectures with distributed cache banks. To mitigate this effect, it is necessary to leverage the ...
- research-articleSeptember 2010
Using memory mapping to support cactus stacks in work-stealing runtime systems
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 411–420https://doi.org/10.1145/1854273.1854324Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a "cactus stack," wherein a function's accesses to stack variables properly respect the function's calling ancestry, even when many of the functions operate in ...
- research-articleSeptember 2010
AM++: a generalized active message framework
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 401–410https://doi.org/10.1145/1854273.1854323Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, use active messages in their low-...
- research-articleSeptember 2010
The Paralax infrastructure: automatic parallelization with a helping hand
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 389–400https://doi.org/10.1145/1854273.1854322Speeding up sequential programs on multicores is a challenging problem that is in urgent need of a solution. Automatic parallelization of irregular pointer-intensive codes, exemplified by the SPECint codes, is a very hard problem. This paper shows that, ...
- research-articleSeptember 2010
Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 377–388https://doi.org/10.1145/1854273.1854321In recent years multi-core computer systems have left the realm of high-performance computing and virtually all of today's desktop computers and embedded computing systems are equipped with several processing cores. Still, no single parallel programming ...
- research-articleSeptember 2010
An empirical characterization of stream programs and its implications for language and compiler design
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 365–376https://doi.org/10.1145/1854273.1854319Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in ...
- research-articleSeptember 2010
Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 353–364https://doi.org/10.1145/1854273.1854318Ocelot is a dynamic compilation framework designed to map the explicitly data parallel execution model used by NVIDIA CUDA applications onto diverse multithreaded platforms. Ocelot includes a dynamic binary translator from Parallel Thread eXecution ISA (...
- research-articleSeptember 2010
A model for fusion and code motion in an automatic parallelizing compiler
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 343–352https://doi.org/10.1145/1854273.1854317Loop fusion has been studied extensively, but in a manner isolated from other transformations. This was mainly due to the lack of a powerful intermediate representation for application of compositions of high-level transformations. Fusion presents ...
- research-articleSeptember 2010
Partitioning streaming parallelism for multi-cores: a machine learning based approach
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 307–318https://doi.org/10.1145/1854273.1854313Stream based languages are a popular approach to expressing parallelism in modern applications. The efficient mapping of streaming parallelism to multi-core processors is, however, highly dependent on the program and underlying architecture. We address ...
- research-articleSeptember 2010
Efficient sequential consistency using conditional fences
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 295–306https://doi.org/10.1145/1854273.1854312Among the various memory consistency models, the sequential consistency (SC) model, in which memory operations appear to take place in the order specified by the program, is the most intuitive and enables programmers to reason about their parallel ...
- research-articleSeptember 2010
Discovering and understanding performance bottlenecks in transactional applications
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesPages 285–294https://doi.org/10.1145/1854273.1854311Many researchers have developed applications using transactionalmemory (TM) with the purpose of benchmarking different implementations, and studying whether or not TM is easy to use. However, comparatively little has been done to provide general-purpose ...