skip to main content
Reflects downloads up to 06 Nov 2024Bibliometrics
Skip Table Of Content Section
article
Free
A new model for integrated nested task and data parallel programming

High Performance Fortran (HPF) has emerged as a standard language fordata parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Therefore, a good model of task ...

article
Free
High performance Fortran for highly irregular problems

We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of(1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph ...

article
Free
Space-efficient implementation of nested parallelism

Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an ...

article
Free
Dynamic pointer alignment: tiling and communication optimizations for parallel pointer-based computations

Loop tiling and communication optimization, such as message pipelining and aggregation, can achieve optimized and robust memory performance by proactively managing storage and data movement. In this paper, we generalize these techniques to pointer-based ...

article
Free
Compiler and software distributed shared memory support for irregular applications

We investigate the use of a software distributed shared memory (DSM) layer to support irregular computations on distributed memory machines. Software DSM supports irregular computation through demand fetching of data in response to memory access faults. ...

article
Free
Space and time efficient execution of parallel irregular computations

Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. ...

article
Free
The interaction of parallel programming constructs and coherence protocols

Some of the most common parallel programming idioms include locks, barriers, and reduction operations. The interaction of these programming idioms with the multiprocessor's coherence protocol has a significant impact on performance. In addition, the ...

article
Free
Ace: linguistic mechanisms for customizable protocols

Customizing the protocols used to manage accesses to different data structures within an application can improve the performance of shared-memory programs substantially [10, 21]. Existing systems for using customizable protocols are, however, hard to ...

article
Free
Tradeoffs between false sharing and aggregation in software distributed shared memory

Software Distributed Shared Memory (DSM) systems based on virtual memory techniques traditionally use the hardware page as the consistency unit. The large size of the hardware page is considered to be a performance bottleneck because of the implied ...

article
Free
Optimizing communication in HPF programs on fine-grain distributed shared memory

Unlike compiler-generated message-passing code, the coherence mechanisms in shared-memory systems work equally well for regular and irregular programs. In many programs, however compile-time information about data accesses would permit data to be ...

article
Free
Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives

As shared-memory multiprocessors become the dominant commodity source of computation, parallelizing compilers must support mainstream computations that manipulate irregular, pointer-based data structures such as lists, trees and graphs, Our experience ...

article
Free
Experiences with non-numeric applications on multithreaded architectures

Distributed-memory machines have proved successful for many challenging numerical programs that can be split into largely independent computation-intensive subtasks requiring little data exchange (although the amount of exchanged data may be large). ...

article
Free
Automatic placement of communications in mesh-partitioning parallelization

We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. This conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries of the sub-meshes. The program is ...

article
Free
Parallel breadth-first BDD construction

With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. This paper presents a parallel ...

article
Free
Experience with efficient array data flow analysis for array privatization

Array data flow analysis is known to be crucial to the success of array privatization, one of the most important techniques for program parallelization. It is clear that array data flow analysis should be performed interprocedurally and symbolically, ...

article
Free
Compiling dynamic mappings with array copies

Array remapping are useful to many applications on distributed memory parallel machines. They are available in High Performance Fortran, a Fortran-based data-parallel language. This paper describes techniques to handle dynamic mappings through simple ...

article
Free
Compilation of parallel multimedia computations—extending retiming theory and Amdahl's law

Multimedia applications (also called multimedia systems) operate on datastreams, which are periodic sequences of data elements, called datasets. A large class of multimedia applications is described by the macro-dataflow graph model, with nodes ...

article
Free
Relaxed consistency and coherence granularity in DSM systems: a performance evaluation

During the past few years, two main approaches have been taken to improve the performance of software shared memory implementations: relaxing consistency models and providing fine-grained access control. Their performance tradeoffs, however, we not well ...

article
Free
Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of ...

article
Free
Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

The performance portability of parallel programs across a wide range of emerging coherent shared address space systems is not well understood. Programs that run well on efficient, hardware cache-coherent systems often do not perform well on less optimal ...

article
Free
Performance implications of communication mechanisms in all-software global address space systems

Global addressing of shared data simplifies parallel programming and complements message passing models commonly found in distributed memory machines. A number of programming systems have been designed that synthesize global addressing purely in ...

article
Free
Shared-memory performance profiling

This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn Parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach ...

article
Free
Improving parallel shear-warp volume rendering on shared address space multiprocessors

This paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we use increasingly detailed ...

article
Free
An effective garbage collection strategy for parallel programming languages on large scale distributed-memory machines

This paper describes the design and implementation of a garbage collection scheme on large-scale distributed-memory computers and reports various experimental results. The collector is based on the conservative GC library by Boehm & Weiser. Each ...

article
Free
LoPC: modeling contention in parallel algorithms

Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention ...

Subjects

Comments