No abstract available.
Reference idempotency analysis: a framework for optimizing speculative execution
Recent proposals for multithreaded architectures allow threads with unknown dependences to execute speculatively in parallel. These architectures use hardware speculative storage to buffer uncertain data, track data dependences and roll back incorrect ...
Pointer and escape analysis for multithreaded programs
This paper presents a new combined pointer and escape analysis for multithreaded programs. The algorithm uses a new abstraction called parallel interaction graphs to analyze the interactions between threads and extract precise points-to, escape, and ...
Language support for Morton-order matrices
The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important ...
Efficient load balancing for wide-area divide-and-conquer applications
Divide-and-conquer programs are easily parallelized by letting the programmer annotate potential parallelism in the form of spawn and sync constructs. To achieve efficient program execution, the generated work load has to be balanced evenly among the ...
Scalable queue-based spin locks with timeout
Queue-based spin locks allow programs with busy-wait synchronization to scale to very large multiprocessors, without fear of starvation or performance-destroying contention. So-called try locks, traditionally based on non-scalable test-and-set locks, ...
Contention elimination by replication of sequential sections in distributed shared memory programs
In shared memory programs contention often occurs at the transition between a sequential and a parallel section of the code. As all threads start executing the parallel section, they often access data just modified by the thread that executed the ...
Accurate data redistribution cost estimation in software distributed shared memory systems
Distributing data is one of the key problems in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in programs where data redistribution between computational phases is considered. The global data ...
Dynamic adaptation to available resources for parallel computing in an autonomous network of workstations
Networks of workstations (NOWs), which are generally composed of autonomous compute elements networked together, are an attractive parallel computing platform since they offer high performance at low cost. The autonomous nature of the environment, ...
Source-level global optimizations for fine-grain distributed shared memory systems
This paper describes and evaluates the use of aggressive static analysis in Jackal, a fine-grain Distributed Shared Memory (DSM) system for Java. Jackal uses an optimizing, source-level compiler rather than the binary rewriting techniques employed by ...
High-level adaptive program optimization with ADAPT
Compile-time optimization is often limited by a lack of target machine and input data set knowledge. Without this information, compilers may be forced to make conservative assumptions to preserve correctness and to avoid performance degradation. In ...
Blocking and array contraction across arbitrarily nested loops using affine partitioning
Applicable to arbitrary sequences and nests of loops, affine partitioning is a program transformation framework that unifies many previously proposed loop transformations, including unimodular transforms, fusion, fission, reindexing, scaling and ...
Efficiency vs. portability in cluster-based network servers
Efficiency and portability are conflicting objectives for cluster-based network servers that distribute the clients' requests across the cluster based on the actual content requested. Our work is based on the observation that this efficiency vs. ...
Statistical scalability analysis of communication operations in distributed applications
Current trends in high performance computing suggest that users will soon have widespread access to clusters of multiprocessors with hundreds, if not thousands, of processors. This unprecedented degree of parallelism will undoubtedly expose scalability ...
LogGPS: a parallel computational model for synchronization analysis
We present a new parallel computational model, named LogGPS, which captures synchronization.
The LogGPS model is an extension of the LogGP model, which abstracts communication on parallel platforms. Although the LogGP model captures long messages with ...
Index Terms
- Proceedings of the eighth ACM SIGPLAN symposium on Principles and practices of parallel programming