Export Citations
Save this search
Please login to be able to save your searches and receive alerts for new content matching your search criteria.
Not all GPUs are created equal: characterizing variability in large-scale, accelerator-rich systems
SC '22: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisArticle No.: 65, Pages 1–15Scientists are increasingly exploring and utilizing the massive parallelism of general-purpose accelerators such as GPUs for scientific breakthroughs. As a result, datacenters, hyperscalers, national computing centers, and supercomputers have procured ...
- research-articleNovember 2020
A submatrix-based method for approximate matrix function evaluation in the quantum chemistry code CP2K
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 80, Pages 1–14Electronic structure calculations based on density-functional theory (DFT) represent a significant part of today's HPC workloads and pose high demands on high-performance computing resources. To perform these quantum-mechanical DFT calculations on ...
- research-articleNovember 2020
TOSS-2020: a commodity software stack for HPC
- Edgar A. León,
- Trent D'Hooge,
- Nathan Hanford,
- Ian Karlin,
- Ramesh Pankajakshan,
- Jim Foraker,
- Chris Chambreau,
- Matthew L. Leininger
SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 40, Pages 1–15The simulation environment of any HPC platform is key to the performance, portability, and productivity of scientific applications. This environment has traditionally been provided by platform vendors, presenting challenges for HPC centers and users ...
- research-articleNovember 2020
Creating an agile hardware design flow
- Rick Bahr,
- Clark Barrett,
- Nikhil Bhagdikar,
- Alex Carsello,
- Ross Daly,
- Caleb Donovick,
- David Durst,
- Kayvon Fatahalian,
- Kathleen Feng,
- Pat Hanrahan,
- Teguh Hofstee,
- Mark Horowitz,
- Dillon Huff,
- Fredrik Kjolstad,
- Taeyoung Kong,
- Qiaoyi Liu,
- Makai Mann,
- Jackson Melchert,
- Ankita Nayak,
- Aina Niemetz,
- Gedeon Nyengele,
- Priyanka Raina,
- Stephen Richardson,
- Raj Setaluri,
- Jeff Setter,
- Kavya Sreedhar,
- Maxwell Strange,
- James Thomas,
- Christopher Torng,
- Leonard Truong,
- Nestan Tsiskaridze,
- Keyi Zhang
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation ConferenceArticle No.: 142, Pages 1–6Although an agile approach is standard for software design, how to properly adapt this method to hardware is still an open question. This work addresses this question while building a system on chip (SoC) with specialized accelerators. Rather than using ...
- research-articleSeptember 2020
SOFF: an OpenCL high-level synthesis framework for FPGAs
ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer ArchitecturePages 295–308https://doi.org/10.1109/ISCA45697.2020.00034Recently, OpenCL has been emerging as a programming model for energy-efficient FPGA accelerators. However, the state-of-the-art OpenCL frameworks for FPGAs suffer from poor performance and usability. This paper proposes a high-level synthesis framework ...
- research-articleMarch 2019
Reducing communication in parallel graph search algorithms with software caches
International Journal of High Performance Computing Applications (SAGE-HPCA), Volume 33, Issue 2Pages 384–396https://doi.org/10.1177/1094342018762510In many scientific and computational domains, graphs are used to represent and analyze data. Such graphs often exhibit the characteristics of small-world networks: few high-degree vertexes connect many low-degree vertexes. Despite the randomness in a ...
- research-articleJune 2018
Enabling scientific computing on memristive accelerators
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer ArchitecturePages 367–382https://doi.org/10.1109/ISCA.2018.00039Linear algebra is ubiquitous across virtually every field of science and engineering, from climate modeling to macroeconomics. This ubiquity makes linear algebra a prime candidate for hardware acceleration, which can improve both the run time and the ...
- research-articleJune 2018
A configurable cloud-scale DNN processor for real-time AI
- Jeremy Fowers,
- Kalin Ovtcharov,
- Michael Papamichael,
- Todd Massengill,
- Ming Liu,
- Daniel Lo,
- Shlomi Alkalay,
- Michael Haselman,
- Logan Adams,
- Mahdi Ghandi,
- Stephen Heil,
- Prerak Patel,
- Adam Sapek,
- Gabriel Weisz,
- Lisa Woods,
- Sitaram Lanka,
- Steven K. Reinhardt,
- Adrian M. Caulfield,
- Eric S. Chung,
- Doug Burger
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer ArchitecturePages 1–14https://doi.org/10.1109/ISCA.2018.00012Interactive AI-powered services require low-latency evaluation of deep neural network (DNN) models---aka "realtime AI". The growing demand for computationally expensive, state-of-the-art DNNs, coupled with diminishing performance gains of general-...
- research-articleOctober 2017
Lightweight SIMT core designs for intelligent 3D stacked DRAM
MEMSYS '17: Proceedings of the International Symposium on Memory SystemsPages 49–59https://doi.org/10.1145/3132402.3132426In this work we present an analysis of the Harmonica stream multiprocessor, a light-weight, parameterized, open-source single-instruction-multiple-thread (SIMT) core designed for integration within 3D-stacked DRAM. We evaluate the range of Harmonica ...
- research-articleNovember 2016
ePython: an implementation of Python for the many-core Epiphany coprocessor
PyHPC '16: Proceedings of the 6th Workshop on Python for High-Performance and Scientific ComputingPages 59–66The Epiphany is a many-core, low power, low on-chip memory architecture and one can very cheaply gain access to a number of parallel cores which is beneficial for HPC education and prototyping. The very low power nature of these architectures also means ...
- research-articleNovember 2016
Runtime coordinated heterogeneous tasks in charm++
Effective utilization of the increasingly heterogeneous hardware in modern supercomputers is a significant challenge. Many applications have seen performance gains by using GPUs, but many implementations leave CPUs sitting idle.
In this paper, we ...
- research-articleNovember 2016
Extended task queuing: active messages for heterogeneous systems
- Michael LeBeane,
- Brandon Potter,
- Abhisek Pan,
- Alexandru Dutu,
- Vinay Agarwala,
- Wonchan Lee,
- Deepak Majeti,
- Bibek Ghimire,
- Eric Van Tassell,
- Samuel Wasmundt,
- Brad Benton,
- Mauricio Breternitz,
- Michael L. Chu,
- Mithuna Thottethodi,
- Lizy K. John,
- Steven K. Reinhardt
SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisArticle No.: 80, Pages 1–12Accelerators have emerged as an important component of modern cloud, datacenter, and HPC computing environments. However, launching tasks on remote accelerators across a network remains unwieldy, forcing programmers to send data in large chunks to ...
- research-articleJune 2016
Evaluation of an analog accelerator for linear algebra
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitecturePages 570–582https://doi.org/10.1109/ISCA.2016.56Due to the end of supply voltage scaling and the increasing percentage of dark silicon in modern integrated circuits, researchers are looking for new scalable ways to get useful computation from existing silicon technology. In this paper we present a ...
Also Published in:
ACM SIGARCH Computer Architecture News: Volume 44 Issue 3 - ArticleMay 2015
Fast and Flexible Conversion of Geohash Codes to and from Latitude/Longitude Coordinates
FCCM '15: Proceedings of the 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing MachinesPages 179–186https://doi.org/10.1109/FCCM.2015.18Insights extracted from spatial queries in geodatabase systems introduce significant opportunities for business intelligence. However, geodatabases are unable to keep up with the required performance due to the massive (and sky-rocketing) amounts of ...
- research-articleNovember 2014
A caching approach to reduce communication in graph search algorithms
DISCS '14: Proceedings of the 2014 International Workshop on Data Intensive Scalable Computing SystemsPages 65–72https://doi.org/10.1109/DISCS.2014.8In many scientific and computational domains, graphs are used to represent and analyze data. Such graphs often exhibit the characteristics of small-world networks: few high-degree vertexes connect many low-degree vertexes. Despite the randomness in a ...
- ArticleMay 2013
Exploring SIMD for Molecular Dynamics, Using Intel® Xeon® Processors and Intel® Xeon Phi™ Coprocessors
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed ProcessingPages 1085–1097https://doi.org/10.1109/IPDPS.2013.44We analyse gather-scatter performance bottlenecks in molecular dynamics codes and the challenges that they pose for obtaining benefits from SIMD execution. This analysis informs a number of novel code-level and algorithmic improvements to Sandia's ...
- ArticleNovember 2012
Developing Performance-Portable Molecular Dynamics Kernels in OpenCL
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and AnalysisPages 386–395https://doi.org/10.1109/SC.Companion.2012.58This paper investigates the development of a molecular dynamics code that is highly portable between architectures. Using OpenCL, we develop an implementation of Sandia's miniMD benchmark that achieves good levels of performance across a wide range of ...
- research-articleJune 2011
Probabilistic auto-tuning for architectures with complex constraints
EXADAPT '11: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop EraPages 22–33https://doi.org/10.1145/2000417.2000420It is hard to optimize applications for coprocessor accelerator architectures, like FPGAs and GPUs, because application parameters must be tuned carefully to the size of the target architecture. Moreover, some combinations of parameters simply do not ...