SIGPLAN: Vol 32, No 7

Volume 32, Issue 7July 1997

Volume 32, Issue 7

July 1997

Editor:

A. Michael Berman

Publisher:

Association for Computing Machinery
New York
NY
United States

ISSN:0362-1340

EISSN:1558-1160

Tags:

Bibliometrics

Select All

Export Citations Save to Binder

article

Free

A new model for integrated nested task and data parallel programming

Pages 1–12https://doi.org/10.1145/263767.263768

High Performance Fortran (HPF) has emerged as a standard language fordata parallel computing. However, a wide variety of scientific applications are best programmed by a combination of task and data parallelism. Therefore, a good model of task ...

article

Free

High performance Fortran for highly irregular problems

Pages 13–24https://doi.org/10.1145/263767.263769

We present a general data parallel formulation for highly irregular problems in High Performance Fortran (HPF). Our formulation consists of(1) a method for linearizing irregular data structures (2) a data parallel implementation (in HPF) of graph ...

article

Free

Space-efficient implementation of nested parallelism

Pages 25–36https://doi.org/10.1145/263767.263770

Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an ...

article

Free

Dynamic pointer alignment: tiling and communication optimizations for parallel pointer-based computations

Pages 37–47https://doi.org/10.1145/263767.263771

Loop tiling and communication optimization, such as message pipelining and aggregation, can achieve optimized and robust memory performance by proactively managing storage and data movement. In this paper, we generalize these techniques to pointer-based ...

article

Free

Compiler and software distributed shared memory support for irregular applications

Pages 48–56https://doi.org/10.1145/263767.263772

We investigate the use of a software distributed shared memory (DSM) layer to support irregular computations on distributed memory machines. Software DSM supports irregular computation through demand fetching of data in response to memory access faults. ...

article

Free

Space and time efficient execution of parallel irregular computations

Pages 57–68https://doi.org/10.1145/263767.263773

Solving problems of large sizes is an important goal for parallel machines with multiple CPU and memory resources. In this paper, issues of efficient execution of overhead-sensitive parallel irregular computation under memory constraints are addressed. ...

article

Free

The interaction of parallel programming constructs and coherence protocols

Pages 69–79https://doi.org/10.1145/263767.263774

Some of the most common parallel programming idioms include locks, barriers, and reduction operations. The interaction of these programming idioms with the multiprocessor's coherence protocol has a significant impact on performance. In addition, the ...

article

Free

Ace: linguistic mechanisms for customizable protocols

Pages 80–89https://doi.org/10.1145/263767.263777

Customizing the protocols used to manage accesses to different data structures within an application can improve the performance of shared-memory programs substantially [10, 21]. Existing systems for using customizable protocols are, however, hard to ...

article

Free

Tradeoffs between false sharing and aggregation in software distributed shared memory

Pages 90–99https://doi.org/10.1145/263767.263778

Software Distributed Shared Memory (DSM) systems based on virtual memory techniques traditionally use the hardware page as the consistency unit. The large size of the hardware page is considered to be a performance bottleneck because of the implied ...

article

Free

Optimizing communication in HPF programs on fine-grain distributed shared memory

Pages 100–111https://doi.org/10.1145/263767.263780

Unlike compiler-generated message-passing code, the coherence mechanisms in shared-memory systems work equally well for regular and irregular programs. In many programs, however compile-time information about data accesses would permit data to be ...

article

Free

Effective fine-grain synchronization for automatically parallelized programs using optimistic synchronization primitives

Martin Rinard

Pages 112–123https://doi.org/10.1145/263767.263781

As shared-memory multiprocessors become the dominant commodity source of computation, parallelizing compilers must support mainstream computations that manipulate irregular, pointer-based data structures such as lists, trees and graphs, Our experience ...

article

Free

Experiences with non-numeric applications on multithreaded architectures

Pages 124–135https://doi.org/10.1145/263767.263782

Distributed-memory machines have proved successful for many challenging numerical programs that can be split into largely independent computation-intensive subtasks requiring little data exchange (although the amount of exchanged data may be large). ...

article

Free

Automatic placement of communications in mesh-partitioning parallelization

Laurent Hascoët

Pages 136–144https://doi.org/10.1145/263767.263783

We present a tool for mesh-partitioning parallelization of numerical programs working iteratively on an unstructured mesh. This conventional method splits a mesh into sub-meshes, adding some overlap on the boundaries of the sub-meshes. The program is ...

article

Free

Parallel breadth-first BDD construction

Pages 145–156https://doi.org/10.1145/263767.263784

With the increasing complexity of protocol and circuit designs, formal verification has become an important research area and binary decision diagrams (BDDs) have been shown to be a powerful tool in formal verification. This paper presents a parallel ...

article

Free

Experience with efficient array data flow analysis for array privatization

Pages 157–167https://doi.org/10.1145/263767.263785

Array data flow analysis is known to be crucial to the success of array privatization, one of the most important techniques for program parallelization. It is clear that array data flow analysis should be performed interprocedurally and symbolically, ...

article

Free

Compiling dynamic mappings with array copies

Fabien Coelho

Pages 168–179https://doi.org/10.1145/263767.263786

Array remapping are useful to many applications on distributed memory parallel machines. They are available in High Performance Fortran, a Fortran-based data-parallel language. This paper describes techniques to handle dynamic mappings through simple ...

article

Free

Compilation of parallel multimedia computations—extending retiming theory and Amdahl's law

G. Srinivasa N. Prasanna

Pages 180–192https://doi.org/10.1145/263767.263787

Multimedia applications (also called multimedia systems) operate on datastreams, which are periodic sequences of data elements, called datasets. A large class of multimedia applications is described by the macro-dataflow graph model, with nodes ...

article

Free

Relaxed consistency and coherence granularity in DSM systems: a performance evaluation

Pages 193–205https://doi.org/10.1145/263767.263788

During the past few years, two main approaches have been taken to improve the performance of software shared memory implementations: relaxing consistency models and providing fine-grained access control. Their performance tradeoffs, however, we not well ...

article

Free

Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code

Pages 206–216https://doi.org/10.1145/263767.263789

An elementary, machine-independent, recursive algorithm for matrix multiplication C+=A*B provides implicit blocking at every level of the memory hierarchy and tests out faster than classically optimrd code, tracking hand-coded BLAS3 routines. Proof of ...

article

Free

Application restructuring and performance portability on shared virtual memory and hardware-coherent multiprocessors

Pages 217–229https://doi.org/10.1145/263767.263792

The performance portability of parallel programs across a wide range of emerging coherent shared address space systems is not well understood. Programs that run well on efficient, hardware cache-coherent systems often do not perform well on less optimal ...

article

Free

Performance implications of communication mechanisms in all-software global address space systems

Pages 230–239https://doi.org/10.1145/263767.263794

Global addressing of shared data simplifies parallel programming and complements message passing models commonly found in distributed memory machines. A number of programming systems have been designed that synthesize global addressing purely in ...

article

Free

Shared-memory performance profiling

Pages 240–251https://doi.org/10.1145/263767.263796

This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn Parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach ...

article

Free

Improving parallel shear-warp volume rendering on shared address space multiprocessors

Pages 252–263https://doi.org/10.1145/263767.263798

This paper presents a new parallel volume rendering algorithm and implementation, based on shear warp factorization, for shared address space multiprocessors. Starting from an existing parallel shear-warp renderer, we use increasingly detailed ...

article

Free

An effective garbage collection strategy for parallel programming languages on large scale distributed-memory machines

Pages 264–275https://doi.org/10.1145/263767.263801

This paper describes the design and implementation of a garbage collection scheme on large-scale distributed-memory computers and reports various experimental results. The collector is based on the conservative GC library by Boehm & Weiser. Each ...

article

Free

LoPC: modeling contention in parallel algorithms

Pages 276–287https://doi.org/10.1145/263767.263803

Parallel algorithm designers need computational models that take first order system costs into account, but are also simple enough to use in practice. This paper introduces the LoPC model, which is inspired by the LogP model but accounts for contention ...

Sections

Save to Binder

Subjects

Comments