Global-aware and multi-order context-based prefetching for high-performance processors
Data prefetching is widely used in high-end computing systems to accelerate data accesses and to bridge the increasing performance gap between processor and memory. Context-based prefetching has become a primary focus of study in recent years due to its ...
Periodic hierarchical load balancing for large supercomputers
Large parallel machines with hundreds of thousands of processors are becoming more prevalent. Ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing ...
Sparse triangular solves for ILU revisited: data layout crucial to better performance
A key to good processor utilization for sparse matrix computations is storing the data in the format that is most conducive to fast access by the memory system. In particular, for sparse matrix triangular solves the traditional compressed sparse matrix ...
A general method for modeling on irregular grids
For simulation on a spherical surface, such as global numerical weather prediction, icosahedral grids are superior to their competitors in uniformity of grid mesh distance across the entire globe and lack of neighboring grid cells that share only a ...
Color and texture analysis using emerging parallel architectures
While image texture is effective for use in pattern-recognition and image-analysis algorithms, textural features are time-consuming to calculate on standard CPUs. Therefore, we present novel implementations of textural-feature algorithms on graphics ...
Trace-based performance analysis for the petascale simulation code FLASH
Performance analysis of applications on modern high-end petascale systems is increasingly challenging due to the rising complexity and quantity of the computing units. This paper presents a performance-analysis study using the Vampir performance-...
Fast iterative solution of large sparse linear systems on geographically separated clusters
Parallel asynchronous iterative algorithms exhibit features that are extremely well-suited for Grid computing, such as lack of synchronization points. Unfortunately, they also suffer from slow convergence rates. In this paper we propose using ...
Measuring TeraGrid: workload characterization for a high-performance computing federation
TeraGrid has deployed a significant monitoring and accounting infrastructure in order to understand its operational success. In this paper, we present an analysis of the jobs reported by TeraGrid for 2008. We consider the workload from several ...
Scalability studies and large grid computations for surface combatant using CFDShip-Iowa
Scalability studies and computations using the largest grids to date for free-surface flows are performed using message-passing interface (MPI)-based CFDShip-Iowa toolbox curvilinear (V4) and Cartesian (V6) grid solvers on Navy high-performance ...
Parallel solution of the obstacle problem in Grid environments
The present study deals with the solution of the obstacle problem defined in a three-dimensional domain. In order to solve a large-scale obstacle problem, the use of parallelism is necessary. In this work we present a parallel synchronous iterative ...
The Combinatorial BLAS: design, implementation, and applications
This paper presents a scalable high-performance software library to be used for graph analysis and data mining. Large combinatorial graphs appear in many applications of high-performance computing, including computational biology, informatics, analytics,...