skip to main content
10.1145/1188455.1188549acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
Article

A memory model for scientific algorithms on graphics processors

Published: 11 November 2006 Publication History

Abstract

We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C's model to analyze the cache misses. Moreover. we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications - sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we are able to achieve 2-5 x performance improvement.

References

[1]
Aggarwal, A., and Vitter, J. S. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 1116--1127.
[2]
Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., and Sorensen, D. 1992. LAPACK User's Guide, Release 1.0. SIAM, Philadelphia.
[3]
Arge, L., Brodal, G., and Fagerberg, R. 2004. Cache oblivious data structures. Handbook on Data Structures and Applications.
[4]
Bacon, D. F., Graham, S. L., and Sharp, O. J. 1994. Compiler transformations for high-performance computing. ACM Comput. Surv. 26, 4, 345--420.
[5]
Banerjee, U. 1990. Unimodular transformations of double loops. Proc. of the Workshop on Advances in Lanugages and Compilers for Parallel Processing, 192--219.
[6]
Batcher, K. 1968. Sorting networks and their applications. In AFIPS Spring Joint Computer Conference.
[7]
Bolz, J., Farmer, I., Grinspun, E., and Schröder, P. 2003. Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph. 22, 3, 917--924.
[8]
Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph. 23, 3. 777--786.
[9]
Carr, S., and Kennedy, K. 1992. Compiler blockability of numerical algorithms. Proc. of ACM/IEEE Conference on Supercomputing, 114--124.
[10]
Coleman, S., and McKinley, K. 1995. Tile size selection using cache organization and data layout. SIGPLAN Conference on Programming Language Design and Implementation, 279--290.
[11]
Fan, Z., Qiu, F., Kaufman, A., and Yoakum-Stover, S. 2004. GPU cluster for high performance computing. In ACM / IEEE Supercomputing Conference 2004.
[12]
Fatahalian, K., Sugerman, J., and Hanrahan, P. 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Eurographics Association.
[13]
Frigo, M., Leiserson, C., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. Symposium on Foundations of Computer Science.
[14]
Galoppo, N., Govindaraju, N., Henson, M., and Manocha, D. 2005. LUGPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proc. ACM/IEEE SuperComputing Conference.
[15]
Göddeke, D. 2005. GPGPU performance tuning. Tech. rep., University of Dortmund, Germany. http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/.
[16]
Govindaraju, N., Lloyd, B., Wang, W., Lin, M., and Manocha, D. 2004. Fast computation of database operations using graphics processors. Proc. of ACM SIGMOD.
[17]
Govindaraju, N., Raghuvanshi, N., and Manocha, D. 2005. Fast and approximate stream mining of quantiles and frequencies using graphics processors. Proc. of ACM SIGMOD.
[18]
Govindaraju, N., Gray, J., Kumar, R., and Manocha, D. 2006. GPUTera-Sort: High performance graphics coprocessor sorting for large database management. Proc. of ACM SIGMOD.
[19]
Hakura, Z., and Gupta, A. 1997. The design and analysis of a cache architecture for texture mapping. Proc. of 24th International Symposium on Computer Architecture, 108--120.
[20]
Hall, J. D., Carr, N., and Hart, J. 2003. Cache and bandwidth aware matrix multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois at Urbana-Champaign.
[21]
Harris, M., Baxter, B., Scheuermann, G., and Lastra, A. 2003. Simulation of cloud dynamics on graphics hardware. SIGGRAPH/Eurographics Workshop on Graphics Hardware.
[22]
Hill, M. D., and Smith, A. J. 1989. Evaluating associativity in cpu caches. IEEE Transactions on Computers 38, 12, 1612--1630.
[23]
Kim, T., and Lin, M. 2003. Visual simulation of ice crystal growth. In Proc. of ACM SIGGRAPH / Eurographics Symposium on Computer Animcation.
[24]
Kipfer, P., Segal, M., and Westermann, R. 2004. Uberflow: A gpu-based particle engine. SIGGPRAH/Eurographics Workshop on Graphics Hardware.
[25]
Kodukula, I., Ahmed, N., and Pingali, K. 1997. Data-centric multi-level blocking. Proc. of ACM SIGPLAN, 346--357.
[26]
Krüger, J., and Westermann, R. 2003. Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graph. 22, 3, 908--916.
[27]
Lam, M., Rothberg, E., and Wolf, M. 1991. The performance and optimization of blocked algorithms. Proc. of 4th International conference on Architectural support for programming languages and operating systems, 63--74.
[28]
Larsen, E. S., and McAllister, D. 2001. Fast matrix multiplies using graphics hardware. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), ACM Press, 55--55.
[29]
Lastra, A., Lin, M., and Manocha, D. 2004. ACM workshop on general purpose computation on graphics processors.
[30]
Li, W., and Pingali, K. 1993. Access normalization: loop restructuring for numa computers. ACM Transactions on Computer Systems 11, 4, 353--375.
[31]
McCool, M., Toit, S. D., Popa, T., Chan, B., and Moule, K. 2004. Shader algebra. ACM Trans. Graph. 23, 3, 787--795.
[32]
Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., and Purcell, T. 2005. A survey of general-purpose computation on graphics hardware.
[33]
Purcell, T., Donner, C., Cammarano, M., Jensen, H., and Hanrahan, P. 2003. Photon mapping on programmable graphics hardware. ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, 41--50.
[34]
Rumpf, M., and Strzodka, R. 2001. Using graphics cards for quantized FEM computations. In Proc. of IASTED Visualization, Imaging and Image Processing Conference (VIIP'01), 193--202.
[35]
Sen, S., Chatterjee, S., and Dumir, N. 2002. Towards a theory of cache-efficient algorithms. Journal of the ACM 49, 828--858.
[36]
Tolimieri, R., An, M., and Lu, C. 1997. Algorithms for Discrete Fourier Transforms and Convolution. Springer.
[37]
Vitter, J. 2001. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 209--271.
[38]
Wolfe, M., Shanklin, C., and Ortega, L. 1995. High performance compilers for parallel computing. Addison-Wesley.
[39]
Wolfe, M. 1987. Iteration space tiling for memory hierarchies. Proc. of the Third SIAM Conference on Parallel Processing for Scientific Computing, 357--361.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing
November 2006
746 pages
ISBN:0769527000
DOI:10.1145/1188455
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graphics processors
  2. memory model
  3. scientific algorithms

Qualifiers

  • Article

Conference

SC '06
Sponsor:

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)25
  • Downloads (Last 6 weeks)3
Reflects downloads up to 26 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media