Article

A memory model for scientific algorithms on graphics processors

Authors:

Naga K. Govindaraju,

Dinesh ManochaAuthors Info & Claims

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

Pages 89 - es

https://doi.org/10.1145/1188455.1188549

Published: 11 November 2006 Publication History

Abstract

We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C's model to analyze the cache misses. Moreover. we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications - sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we are able to achieve 2-5 x performance improvement.

References

[1]

Aggarwal, A., and Vitter, J. S. 1988. The input/output complexity of sorting and related problems. Commun. ACM 31, 1116--1127.

Digital Library

[2]

Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., and Sorensen, D. 1992. LAPACK User's Guide, Release 1.0. SIAM, Philadelphia.

Digital Library

[3]

Arge, L., Brodal, G., and Fagerberg, R. 2004. Cache oblivious data structures. Handbook on Data Structures and Applications.

[4]

Bacon, D. F., Graham, S. L., and Sharp, O. J. 1994. Compiler transformations for high-performance computing. ACM Comput. Surv. 26, 4, 345--420.

Digital Library

[5]

Banerjee, U. 1990. Unimodular transformations of double loops. Proc. of the Workshop on Advances in Lanugages and Compilers for Parallel Processing, 192--219.

[6]

Batcher, K. 1968. Sorting networks and their applications. In AFIPS Spring Joint Computer Conference.

Digital Library

[7]

Bolz, J., Farmer, I., Grinspun, E., and Schröder, P. 2003. Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graph. 22, 3, 917--924.

Digital Library

[8]

Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., Houston, M., and Hanrahan, P. 2004. Brook for GPUs: stream computing on graphics hardware. ACM Trans. Graph. 23, 3. 777--786.

Digital Library

[9]

Carr, S., and Kennedy, K. 1992. Compiler blockability of numerical algorithms. Proc. of ACM/IEEE Conference on Supercomputing, 114--124.

Digital Library

[10]

Coleman, S., and McKinley, K. 1995. Tile size selection using cache organization and data layout. SIGPLAN Conference on Programming Language Design and Implementation, 279--290.

Digital Library

[11]

Fan, Z., Qiu, F., Kaufman, A., and Yoakum-Stover, S. 2004. GPU cluster for high performance computing. In ACM / IEEE Supercomputing Conference 2004.

Digital Library

[12]

Fatahalian, K., Sugerman, J., and Hanrahan, P. 2004. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware, Eurographics Association.

Digital Library

[13]

Frigo, M., Leiserson, C., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. Symposium on Foundations of Computer Science.

Digital Library

[14]

Galoppo, N., Govindaraju, N., Henson, M., and Manocha, D. 2005. LUGPU: Efficient algorithms for solving dense linear systems on graphics hardware. In Proc. ACM/IEEE SuperComputing Conference.

Digital Library

[15]

Göddeke, D. 2005. GPGPU performance tuning. Tech. rep., University of Dortmund, Germany. http://www.mathematik.uni-dortmund.de/~goeddeke/gpgpu/.

[16]

Govindaraju, N., Lloyd, B., Wang, W., Lin, M., and Manocha, D. 2004. Fast computation of database operations using graphics processors. Proc. of ACM SIGMOD.

Digital Library

[17]

Govindaraju, N., Raghuvanshi, N., and Manocha, D. 2005. Fast and approximate stream mining of quantiles and frequencies using graphics processors. Proc. of ACM SIGMOD.

Digital Library

[18]

Govindaraju, N., Gray, J., Kumar, R., and Manocha, D. 2006. GPUTera-Sort: High performance graphics coprocessor sorting for large database management. Proc. of ACM SIGMOD.

Digital Library

[19]

Hakura, Z., and Gupta, A. 1997. The design and analysis of a cache architecture for texture mapping. Proc. of 24th International Symposium on Computer Architecture, 108--120.

Digital Library

[20]

Hall, J. D., Carr, N., and Hart, J. 2003. Cache and bandwidth aware matrix multiplication on the GPU. Technical Report UIUCDCS-R-2003-2328, University of Illinois at Urbana-Champaign.

[21]

Harris, M., Baxter, B., Scheuermann, G., and Lastra, A. 2003. Simulation of cloud dynamics on graphics hardware. SIGGRAPH/Eurographics Workshop on Graphics Hardware.

Digital Library

[22]

Hill, M. D., and Smith, A. J. 1989. Evaluating associativity in cpu caches. IEEE Transactions on Computers 38, 12, 1612--1630.

Digital Library

[23]

Kim, T., and Lin, M. 2003. Visual simulation of ice crystal growth. In Proc. of ACM SIGGRAPH / Eurographics Symposium on Computer Animcation.

Digital Library

[24]

Kipfer, P., Segal, M., and Westermann, R. 2004. Uberflow: A gpu-based particle engine. SIGGPRAH/Eurographics Workshop on Graphics Hardware.

Digital Library

[25]

Kodukula, I., Ahmed, N., and Pingali, K. 1997. Data-centric multi-level blocking. Proc. of ACM SIGPLAN, 346--357.

Digital Library

[26]

Krüger, J., and Westermann, R. 2003. Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graph. 22, 3, 908--916.

Digital Library

[27]

Lam, M., Rothberg, E., and Wolf, M. 1991. The performance and optimization of blocked algorithms. Proc. of 4th International conference on Architectural support for programming languages and operating systems, 63--74.

Digital Library

[28]

Larsen, E. S., and McAllister, D. 2001. Fast matrix multiplies using graphics hardware. In Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), ACM Press, 55--55.

Digital Library

[29]

Lastra, A., Lin, M., and Manocha, D. 2004. ACM workshop on general purpose computation on graphics processors.

[30]

Li, W., and Pingali, K. 1993. Access normalization: loop restructuring for numa computers. ACM Transactions on Computer Systems 11, 4, 353--375.

Digital Library

[31]

McCool, M., Toit, S. D., Popa, T., Chan, B., and Moule, K. 2004. Shader algebra. ACM Trans. Graph. 23, 3, 787--795.

Digital Library

[32]

Owens, J., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A., and Purcell, T. 2005. A survey of general-purpose computation on graphics hardware.

[33]

Purcell, T., Donner, C., Cammarano, M., Jensen, H., and Hanrahan, P. 2003. Photon mapping on programmable graphics hardware. ACM SIGGRAPH/Eurographics Conference on Graphics Hardware, 41--50.

Digital Library

[34]

Rumpf, M., and Strzodka, R. 2001. Using graphics cards for quantized FEM computations. In Proc. of IASTED Visualization, Imaging and Image Processing Conference (VIIP'01), 193--202.

[35]

Sen, S., Chatterjee, S., and Dumir, N. 2002. Towards a theory of cache-efficient algorithms. Journal of the ACM 49, 828--858.

Digital Library

[36]

Tolimieri, R., An, M., and Lu, C. 1997. Algorithms for Discrete Fourier Transforms and Convolution. Springer.

[37]

Vitter, J. 2001. External memory algorithms and data structures: Dealing with massive data. ACM Computing Surveys, 209--271.

Digital Library

[38]

Wolfe, M., Shanklin, C., and Ortega, L. 1995. High performance compilers for parallel computing. Addison-Wesley.

Digital Library

[39]

Wolfe, M. 1987. Iteration space tiling for memory hierarchies. Proc. of the Third SIAM Conference on Parallel Processing for Scientific Computing, 357--361.

Digital Library

Cited By

Uchkunovna K(2022)Use of Graphics Processors in Data Compression AlgorithmsIndonesian Journal of Innovation Studies10.21070/ijins.v18i.58818Online publication date: 12-Apr-2022
https://doi.org/10.21070/ijins.v18i.588
Wu CZhang Y(2019)Toward Efficient Transparent Computing for IoT Apps by On-Chip Kernel OffloadIEEE Internet of Things Journal10.1109/JIOT.2018.28750506:3(4085-4097)Online publication date: Jun-2019
https://doi.org/10.1109/JIOT.2018.2875050
Nesi LPillon Mde Assunção MKoslovski GEl-Araby EEl-Ghazawi TPanda D(2018)GPU-accelerated algorithms for allocating virtual infrastructure in cloud data centersProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00057(364-365)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1109/CCGRID.2018.00057
Show More Cited By

Index Terms

A memory model for scientific algorithms on graphics processors
1. Computing methodologies
  1. Computer graphics
    1. Graphics systems and interfaces
      1. Graphics processors
2. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

Cache-efficient numerical algorithms using graphics hardware

We present cache-efficient algorithms for scientific computations using graphics processing units (GPUs). Our approach is based on mapping the nested loops in the numerical algorithms to the texture mapping hardware and efficiently utilizing GPU caches. ...
Relational query coprocessing on graphics processors

Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs ...
A performance study of general-purpose applications on graphics processors using CUDA

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths. GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '06: Proceedings of the 2006 ACM/IEEE conference on Supercomputing

November 2006

746 pages

ISBN:0769527000

DOI:10.1145/1188455

Conference Chair:
Barbara Horner-Miller
Arctic Region Supercomputing Center

Copyright © 2006 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2006

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SC '06

Sponsor:

SIGARCH
IEEE-CS

SC '06: International Conference for High Performance Computing, Networking, Storage and Analysis

November 11 - 17, 2006

Florida, Tampa

Acceptance Rates

SC '06 Paper Acceptance Rate 54 of 239 submissions, 23%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

104
Total Citations
View Citations
982
Total Downloads

Downloads (Last 12 months)25
Downloads (Last 6 weeks)3

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Uchkunovna K(2022)Use of Graphics Processors in Data Compression AlgorithmsIndonesian Journal of Innovation Studies10.21070/ijins.v18i.58818Online publication date: 12-Apr-2022
https://doi.org/10.21070/ijins.v18i.588
Wu CZhang Y(2019)Toward Efficient Transparent Computing for IoT Apps by On-Chip Kernel OffloadIEEE Internet of Things Journal10.1109/JIOT.2018.28750506:3(4085-4097)Online publication date: Jun-2019
https://doi.org/10.1109/JIOT.2018.2875050
Nesi LPillon Mde Assunção MKoslovski GEl-Araby EEl-Ghazawi TPanda D(2018)GPU-accelerated algorithms for allocating virtual infrastructure in cloud data centersProceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing10.1109/CCGRID.2018.00057(364-365)Online publication date: 1-May-2018
https://dl.acm.org/doi/10.1109/CCGRID.2018.00057
Zhang NJiang CSun XSong S(2017)Evaluating GPGPU Memory Performance Through the C-AMAT ModelProceedings of the Workshop on Memory Centric Programming for HPC10.1145/3145617.3158214(35-39)Online publication date: 12-Nov-2017
https://dl.acm.org/doi/10.1145/3145617.3158214
Peng YGrossman MSarkar V(2016)Static cost estimation for data layout selection on GPUsProceedings of the 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems10.5555/3019057.3019065(76-86)Online publication date: 13-Nov-2016
https://dl.acm.org/doi/10.5555/3019057.3019065
Bhimani JLeeser MMi N(2016)Design space exploration of GPU Accelerated cluster systems for optimal data transfer using PCIe bus2016 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2016.7761614(1-7)Online publication date: Sep-2016
https://doi.org/10.1109/HPEC.2016.7761614
Hung CLin COu CTseng YHung PLi SFu C(2016)Efficient bit-parallel subcircuit extraction using CUDAConcurrency and Computation: Practice & Experience10.1002/cpe.373228:16(4326-4338)Online publication date: 1-Nov-2016
https://dl.acm.org/doi/10.1002/cpe.3732
Ausavarungnirun RGhose SKayiran OLoh GDas CKandemir MMutlu O(2015)Exploiting Inter-Warp Heterogeneity to Improve GPGPU PerformanceProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.38(25-38)Online publication date: 18-Oct-2015
https://dl.acm.org/doi/10.1109/PACT.2015.38
Zhao D(2015)Fast filter bank convolution for three-dimensional wavelet transform by shared memory on mobile GPU computingThe Journal of Supercomputing10.1007/s11227-015-1443-771:9(3440-3455)Online publication date: 1-Sep-2015
https://dl.acm.org/doi/10.1007/s11227-015-1443-7
Ma LAgrawal KChamberlain R(2014)A memory access model for highly-threaded many-core architecturesFuture Generation Computer Systems10.5555/2747903.274817530:C(202-215)Online publication date: 1-Jan-2014
https://dl.acm.org/doi/10.5555/2747903.2748175
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten