A new model for integrated nested task and data parallel programming
High Performance Fortran (HPF) has emerged as a standard language fordata parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Therefore, a good model of task ...
High performance Fortran for highly irregular problems
We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of(1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph ...
Space-efficient implementation of nested parallelism
Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an ...
Dynamic pointer alignment: tiling and communication optimizations for parallel pointer-based computations
Loop tiling and communication optimization, such as message pipelining and aggregation, can achieve optimized and robust memory performance by proactively managing storage and data movement. In this paper, we generalize these techniques to pointer-based ...
Compiler and software distributed shared memory support for irregular applications
We investigate the use of a software distributed shared memory (DSM) layer to support irregular computations on distributed memory machines. Software DSM supports irregular computation through demand fetching of data in response to memory access faults. ...
Space and time efficient execution of parallel irregular computations
Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. ...
The interaction of parallel programming constructs and coherence protocols
Some of the most common parallel programming idioms include locks, barriers, and reduction operations. The interaction of these programming idioms with the multiprocessor's coherence protocol has a significant impact on performance. In addition, the ...
Ace: linguistic mechanisms for customizable protocols
Customizing the protocols used to manage accesses to different data structures within an application can improve the performance of shared-memory programs substantially [10, 21]. Existing systems for using customizable protocols are, however, hard to ...
Tradeoffs between false sharing and aggregation in software distributed shared memory
Software Distributed Shared Memory (DSM) systems based on virtual memory techniques traditionally use the hardware page as the consistency unit. The large size of the hardware page is considered to be a performance bottleneck because of the implied ...
Optimizing communication in HPF programs on fine-grain distributed shared memory
Unlike compiler-generated message-passing code, the coherence mechanisms in shared-memory systems work equally well for regular and irregular programs. In many programs, however compile-time information about data accesses would permit data to be ...
Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives
As shared-memory multiprocessors become the dominant commodity source of computation, parallelizing compilers must support mainstream computations that manipulate irregular, pointer-based data structures such as lists, trees and graphs, Our experience ...
Experiences with non-numeric applications on multithreaded architectures
Distributed-memory machines have proved successful for many challenging numerical programs that can be split into largely independent computation-intensive subtasks requiring little data exchange (although the amount of exchanged data may be large). ...
Automatic placement of communications in mesh-partitioning parallelization
We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. This conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries of the sub-meshes. The program is ...
Parallel breadth-first BDD construction
With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. This paper presents a parallel ...
Experience with efficient array data flow analysis for array privatization
Array data flow analysis is known to be crucial to the success of array privatization, one of the most important techniques for program parallelization. It is clear that array data flow analysis should be performed interprocedurally and symbolically, ...
Compiling dynamic mappings with array copies
Array remapping are useful to many applications on distributed memory parallel machines. They are available in High Performance Fortran, a Fortran-based data-parallel language. This paper describes techniques to handle dynamic mappings through simple ...
Compilation of parallel multimedia computations—extending retiming theory and Amdahl's law
Multimedia applications (also called multimedia systems) operate on datastreams, which are periodic sequences of data elements, called datasets. A large class of multimedia applications is described by the macro-dataflow graph model, with nodes ...
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation
- Yuanyuan Zhou,
- Liviu Iftode,
- Jaswinder Pal Sing,
- Kai Li,
- Brian R. Toonen,
- Ioannis Schoinas,
- Mark D. Hill,
- David A. Wood
During the past few years, two main approaches have been taken to improve the performance of software shared memory implementations: relaxing consistency models and providing fine-grained access control. Their performance tradeoffs, however, we not well ...
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code
An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of ...
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors
The performance portability of parallel programs across a wide range of emerging coherent shared address space systems is not well understood. Programs that run well on efficient, hardware cache-coherent systems often do not perform well on less optimal ...
Performance implications of communication mechanisms in all-software global address space systems
Global addressing of shared data simplifies parallel programming and complements message passing models commonly found in distributed memory machines. A number of programming systems have been designed that synthesize global addressing purely in ...
Shared-memory performance profiling
This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn Parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach ...
Improving parallel shear-warp volume rendering on shared address space multiprocessors
This paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we use increasingly detailed ...
An effective garbage collection strategy for parallel programming languages on large scale distributed-memory machines
This paper describes the design and implementation of a garbage collection scheme on large-scale distributed-memory computers and reports various experimental results. The collector is based on the conservative GC library by Boehm & Weiser. Each ...
LoPC: modeling contention in parallel algorithms
Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention ...