skip to main content
research-article

An adaptive performance modeling tool for GPU architectures

Published: 09 January 2010 Publication History

Abstract

This paper presents an analytical model to predict the performance of
general-purpose applications on a GPU architecture. The model is designed to provide performance information to an auto-tuning compiler and assist it in narrowing down the search to the more promising implementations. It can also be incorporated into a tool to help programmers better assess the performance bottlenecks in their code. We analyze each GPU kernel and identify how the kernel exercises major GPU microarchitecture features. To identify the performance bottlenecks accurately, we introduce an abstract interpretation of a GPU kernel, work flow graph, based on which we estimate the execution time of a GPU kernel. We validated our performance model on the NVIDIA GPUs using CUDA (Compute Unified Device Architecture). For this purpose, we used data parallel benchmarks that stress different GPU microarchitecture events such as uncoalesced memory accesses, scratch-pad memory bank conflicts, and control flow divergence, which must be accurately modeled but represent challenges to the analytical performance models. The proposed model captures full system complexity and shows high accuracy in predicting the performance trends of different optimized kernel implementations. We also describe our approach to extracting the performance model automatically from a kernel code.

References

[1]
ATI Stream Computing. http://developer.amd.com/gpu-assets/Stream-Computing-Overview.pdf.
[2]
M. Baskaran and R. Bordawekar. Optimizing Sparse Matrix-Vector multiplication on GPUs. IBM Research Report, December 2008.
[3]
M. Clement and M. Quinn. Analytical Performance Prediction on Multicomputers. In ACM/IEEE Conference on Supercomputing, November 1993.
[4]
R. Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, and F. K. Zadeck. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst., pages 451--490, 1991.
[5]
T. Davis. University of Florida Sparse Matrix Collection. http://www.cise.u.edu/research/sparse/matrices/.
[6]
K. Fatahalian, J. Sugerman, and P. Hanrahan. Understanding the Efficiency of GPU Algorithms for Matrix-matrix Multiplication. In Conference on Graphics Hardware, August 2004.
[7]
J. Ferrante, K. J. Ottenstein, and J. D. Warren. The Program Dependence Graph and its Use in Optimization. ACM Trans. Program. Lang. Syst., pages 319--349, 1987.
[8]
N. K. Govindaraju, S. Larsen, J. Gray, and D. Manocha. A Memory Model for Scientific Algorithms on Graphics Processors. In ACM/IEEE Conference on Supercomputing, November 2006.
[9]
N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete Fourier transforms on graphics processors. In ACM/IEEE Conference on Supercomputing, November 2008.
[10]
S. Hong and H. Kim. An model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness. In International Symposium on Computer Architecture, June 2009.
[11]
C. Jiang and M. Snir. Automatic Tuning Matrix Multiplication Performance on Graphics Hardware. In International Conference on Parallel Architectures and Compilation Techniques, September 2005.
[12]
W. Liu, W. Muller-Wittig, and B. Schmidt. Performance Predictions for General-Purpose Computation on GPUs. In International Conference on Parallel Processing, September 2007.
[13]
M. Kongstad. An Implementation of Global Value Numbering in the GNU Compiler Collection with Performance Measurements, October 2004.
[14]
G. Marin and J. Mellor-Crummey. Cross-architecture Performance Predictions for Scientific Applications Using Parameterized Models. In International conference on Measurement and modeling of computer systems, June 2004.
[15]
A. Munshi. The OpenCL Specification. http://www.khronos.org/registry/cl/specs/opencl-1.0.29.pdf.
[16]
J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scalable Parallel Programming with CUDA. Queue, 6(2):40--53, 2008.
[17]
NVIDIA Corporation. NVIDIA CUDA Programming Guide: Version 1.0, June 2007.
[18]
Z. Pan and R. Eigenmann. Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning. In International Symposium on Code Generation and Optimization, March 2006.
[19]
S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. Ueng, J. A. Stratton, and W. W. Hwu. Program Optimization Space Pruning for a Multithreaded GPU. In International Symposium on Code Generation and Optimization, April 2008.
[20]
D. Schaa and D. Kaeli. Exploring the Multi GPU Design Space. In International Symposium on Parallel and Distributed Processing, October 2009.
[21]
D. B. Whalley. Tuning High Performance Kernels through Empirical Compilation. In International Conference on Parallel Processing, June 2005.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 45, Issue 5
PPoPP '10
May 2010
346 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1837853
Issue’s Table of Contents
  • cover image ACM Conferences
    PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
    January 2010
    372 pages
    ISBN:9781605588773
    DOI:10.1145/1693453
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2010
Published in SIGPLAN Volume 45, Issue 5

Check for updates

Author Tags

  1. analytical model
  2. gpu
  3. parallel programming
  4. performance estimation

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)149
  • Downloads (Last 6 weeks)17
Reflects downloads up to 28 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media