Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
- research-articleJune 2008
Automatic analysis of speedup of MPI applications
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 349–358https://doi.org/10.1145/1375527.1375578The intricacy of high performance computing applications has been growing very
fast in the last years. Only skilled analysts are able to determine the factors that are undermining the performance of up-to-date applications. Analyst time is a very ...
- research-articleJune 2008
Efficient computation of sum-products on GPUs through software-managed cache
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 309–318https://doi.org/10.1145/1375527.1375572We present a technique for designing memory-bound algorithms with high data reuse on Graphics Processing Units (GPUs) equipped with close-to-ALU software-managed memory. The approach is based on the efficient use of this memory through the ...
- research-articleJune 2008
Phasers: a unified deadlock-free construct for collective and point-to-point synchronization
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 277–288https://doi.org/10.1145/1375527.1375568Coordination and synchronization of parallel tasks is a major source of complexity in parallel programming. These constructs take many forms in practice including mutual exclusion in accesses to shared resources, termination detection of child tasks, ...
- research-articleJune 2008
Performance portable optimizations for loops containing communication operations
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 266–276https://doi.org/10.1145/1375527.1375567Effective use of communication networks is critical to the performance and scalability of parallel applications. Partitioned Global Address Space languages like UPC bring the promise of performance and programmer productivity. Studies of well-tuned ...
- research-articleJune 2008
Three-dimensional delaunay refinement for multi-core processors
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 214–224https://doi.org/10.1145/1375527.1375560We develop the first ever fully functional three-dimensional guaranteed quality parallel graded Delaunay mesh generator. First, we prove a criterion and a sufficient condition of Delaunay-independence of Steiner points in three dimensions. Based on ...
- research-articleJune 2008
Fast scan algorithms on graphics processors
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 205–213https://doi.org/10.1145/1375527.1375559Scan and segmented scan are important data-parallel primitives for a wide range of applications. We present fast, work-efficient algorithms for these primitives on graphics processing units (GPUs). We use novel data representations that map well to the ...
- research-articleJune 2008
Advanced collective communication in aspen
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 83–93https://doi.org/10.1145/1375527.1375543Aspen is a programming language that relies on high-level messaging to support communication among different program tasks executing in parallel. Unlike MPI, the computational logic of Aspen tasks is specified and developed independently of the global ...
- research-articleJune 2008
Preserving time in large-scale communication traces
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 46–55https://doi.org/10.1145/1375527.1375537Analyzing the performance of large-scale scientific applications is becoming increasingly difficult due to the sheer size of performance data gathered. Recent work on scalable communication tracing applies online interprocess compression to address this ...
- research-articleJune 2008
Implementing Wilson-Dirac operator on the cell broadband engine
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPages 4–14https://doi.org/10.1145/1375527.1375532Computing the actions of Wilson-Dirac operator contributes most of the CPU time for the grand challenge problem of simulating Lattice Quantum Chromodynamics (Lattice QCD). This routine exhibits many challenges in implementation on most computational ...
- keynoteJune 2008
Many-core GPU computing with NVIDIA CUDA
ICS '08: Proceedings of the 22nd annual international conference on SupercomputingPage 1https://doi.org/10.1145/1375527.1375528In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will ...