skip to main content
10.1145/3123939.3124534acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Beyond the socket: NUMA-aware GPUs

Published: 14 October 2017 Publication History

Abstract

GPUs achieve high throughput and power efficiency by employing many small single instruction multiple thread (SIMT) cores. To minimize scheduling logic and performance variance they utilize a uniform memory system and leverage strong data parallelism exposed via the programming model. With Moore's law slowing, for GPUs to continue scaling performance (which largely depends on SIMT core count) they are likely to embrace multi-socket designs where transistors are more readily available. However when moving to such designs, maintaining the illusion of a uniform memory system is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be exploited allowing GPU sockets to dynamically optimize their individual interconnect and cache policies, minimizing the impact of NUMA effects. Our NUMA-aware GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while achieving 89%, 84%, and 76% of theoretical application scalability in 2, 4, and 8 sockets designs respectively. Implementable today, NUMA-aware multi-socket GPUs may be a promising candidate for scaling GPU performance beyond a single socket.

References

[1]
Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F Wenisch, John Danskin, and Stephen W Keckler. 2016. Selective GPU Caches to Eliminate CPU-GPU HW Cache Coherence. In International Symposium on High Performance Computer Architecture (HPCA). IEEE.
[2]
AMD Inc. 2017. AMD's Infinity Fabric Detailed. http://wccftech.com/amds-infinity-fabric-detailed/. (2017). {Online; accessed 2017-04-04}.
[3]
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-Chip-Module GPUs for Continued Performance Scalability. In International Symposium on Computer Architecture (ISCA).
[4]
Tal Ben-Nun, Ely Levy, Amnon Barak, and Eri Rubin. 2015. Memory Access Patterns: The Missing Piece of the Multi-GPU Puzzle. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM.
[5]
P. Bright. 2016. Moore's Law Really is Dead This Time. http://arstechnica.com/information-technology/2016/02/moores-law-really-is-dead-this-time. (2016). {Online; accessed 2017-04-04}.
[6]
Broadcom. 2017. PCI Express Switches. https://www.broadcom.com/products/pcie-switches-bridges/pcie-switches/. (2017). {Online; accessed 2017-07-10}.
[7]
Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei W. Hwu. 2015. Automatic Parallelization of Kernels in Shared-Memory Multi-GPU Nodes. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS).
[8]
J. Chang and G. S. Sohi. 2007. Cooperative Cache Partitioning for Chip Multiprocessors. Proceedings of ISC (June 2007).
[9]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A Benchmark Suite for Heterogeneous Computing. In International Symposium on Workload Characterization (IISWC).
[10]
INTEL Corporation. 2004. Intel Xeon Processor with 533 MHz Front Side Bus at 2 GHz to 3.20 GHz. http://download.intel.com/support/processors/xeon/sb/25213506.pdf. (2004). {Online; accessed 2017-04-04}.
[11]
INTEL Corporation. 2007. The Xeon X5365. http://ark.intel.com/products/30702/Intel-Xeon-Processor-X5365-8M-Cache-3_00-GHz-1333-MHz-FSB. (2007). {Online; accessed 2016-08-19}.
[12]
Jack Doweck, Wen-Fu Kao, Allen Kuan-yu Lu, Julius Mandelblat, Anirudha Rahatekar, Lihu Rappoport, Efraim Rotem, Ahmad Yasin, and Adi Yoaz. 2017. Inside 6th-Generation Intel Core: New Microarchitecture Code-Named Skylake. IEEE Micro (2017).
[13]
J. L. Hennessy and D. A. Patterson. 2011. Computer Architecture: A Quantitative Approach (5th ed.). Elsevier.
[14]
A. Herdrich, E. Verplanke, P. Autee, R. Illikkal, C. Gianos, R. Singhal, and R. Iyer. 2016. Cache QoS: From Concept to Reality in the Intel Xeon Processor E5-2600 v3 Product Family. HPCA (2016).
[15]
HSA Fondation. 2016. HSA Platform System Architecture Specification 1.1. http://www.hsafoundation.com/?ddownload=5114. (2016). {Online; accessed 2017-13-06}.
[16]
SK Hynix. 2009. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Datasheet Revision 1.0. https://www.skhynix.com/eolproducts.view.do?pronm=GDDR5+SDRAM&srnm=H5GQ1H24AFR&rk=26&rc=graphics. &rk=26&rc=graphics. &rc=graphics. &rk=26&rc=graphics. &rc=graphics. &rc=graphics. (2009). {Online; accessed 2017-04-04}.
[17]
HyperTransport Consortium. 2010. HyperTransport 3.1 Specification. http://www.hypertransport.org/ht-3-1-link-spec. (2010). {Online; accessed 2017-04-04}.
[18]
IBM. 2011. IBM zEnterprise 196 Technical Guide. http://www.redbooks.ibm.com/redbooks/pdfs/sg247833.pdf. (2011). {Online; accessed 2017-04-04}.
[19]
IBM. 2012. IBM Power Systems Deep Dive. http://www-05.ibm.com/cz/events/febannouncement2012/pdf/power_architecture.pdf. (2012). {Online; accessed 2017-04-04}.
[20]
AMD Inc. 2012. AMD Server Solutions Playbook. http://www.amd.com/Documents/AMD_Opteron_ServerPlaybook.pdf. (2012). {Online; accessed 2017-04-04}.
[21]
INTEL Corporation. 2009. An Introduction to the Intel QuickPath Interconnect. http://www.intel.com/content/www/us/en/io/quickpath-technology/quick-path-interconnect-introduction-paper.html. (2009). {Online; accessed 2017-04-04}.
[22]
A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, Jr S. Steely, and J. Emer. 2008. Adaptive Insertion Policies for Managing Shared Caches. Proceedings of MICRO (Oct 2008).
[23]
JEDEC. 2015. High Bandwidth Memory(HBM) DRAM - JESD235. http://www.jedec.org/standards-documents/results/jesd235. (2015). {Online; accessed 2017-04-04}.
[24]
KHRONOS GROUP. 2016. OpenCL 2.2 API Specification (Provisional). https://www.khronos.org/opencl/. (2016). {Online; accessed 2017-04-04}.
[25]
Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a Single Compute Device Image in OpenCL for Multiple GPUs. ACM SIGPLAN Notices (2011).
[26]
Oak Ridge National Laboratory. 2013. Titan : The World's #1 Open Science Super Computer. https://www.olcf.ornl.gov/titan/. (2013). {Online; accessed 2017-04-04}.
[27]
Andrew Lavin. 2015. Fast Algorithms for Convolutional Neural Networks. http://arxiv.org/abs/1509.09308. (2015). {Online; accessed 2017-04-04}.
[28]
Lawerence Livermore National Laboratory. 2014. CORAL Benchmarks. https://asc.llnl.gov/CORAL-benchmarks/. (2014). {Online; accessed 2017-04-04}.
[29]
Lawerence Livermore National Laboratory. 2016. CORAL/Sierra. https://asc.llnl.gov/coral-info. (2016). {Online; accessed 2017-04-04}.
[30]
Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. 2013. Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems. In International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE.
[31]
D. Li, M. Rhu, D. R. Johnson, M. O'Connor, M. Erez, D. Burger, D. S. Fussell, and S. W. Keckler. 2015. Priority-Based Cache Allocation in Throughput Processors. HPCA (2015).
[32]
Zhen Lin, Lars Nyland, and Huiyang Zhou. 2016. Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press.
[33]
Benjamin Munger, David Akeson, Srikanth Arekapudi, Tom Burd, Harry R Fair, Jim Farrell, Dave Johnson, Guhan Krishnan, Hugh McIntyre, Edward McLellan, et al. 2016. Carrizo: A High Performance, Energy Efficient 28 nm APU. Journal of Solid-State Circuits (2016).
[34]
NVIDIA. {n. d.}. NVIDIA cuDNN, GPU Accelerated Deep Learning. https://developer.nvidia.com/cudnn. ({n. d.}). {Online; accessed 2017-04-04}.
[35]
NVIDIA. {n. d.}. The World's First AI Supercomputer in a Box. http://www.nvidia.com/object/deep-learning-system.html. ({n. d.}). {Online; accessed 2017-04-04}.
[36]
NVIDIA. 2011. Multi-GPU Programming. http://www.nvidia.com/docs/IO/116711/sc11-multi-gpu.pdf. (2011). {Online; accessed 2017-04-04}.
[37]
NVIDIA. 2013. Unified Memory in CUDA 6. http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/. (2013). {Online; accessed 2017-04-04}.
[38]
NVIDIA. 2014. Compute Unified Device Architecture. http://www.nvidia.com/object/cuda_home_new.html. (2014). {Online; accessed 2017-04-04}.
[39]
NVIDIA. 2014. NVIDIA Launches World's First High-Speed GPU Interconnect, Helping Pave the Way to Exascale Computing. http://nvidianews.nvidia.com/news/nvidia-launches-world-s-first-high-speed-gpu/interconnect-helping-pave-theway-to-exascale-computing. (2014). {Online; accessed 2017-04-04}.
[40]
NVIDIA. 2015. CUDA C Programming Guild v7.0. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html. (2015). {Online; accessed 2017-04-04}.
[41]
NVIDIA. 2016. Inside Pascal: NVIDIA's Newest Computing Platform. https://devblogs.nvidia.com/parallelforall/inside-pascal. (2016). {Online; accessed 2017-04-04}.
[42]
NVIDIA. 2016. MPI Solutions for GPUs. https://developer.nvidia.com/mpi-solutions-gpus. (2016). {Online; accessed 2017-04-04}.
[43]
NVIDIA. 2016. NVIDIA Tesla P100. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf. (2016). {Online; accessed 2017-04-04}.
[44]
NVIDIA. 2017. Scalable Link Interconnect. http://www.nvidia.in/object/sli-technology-overview-in.html. (2017). {Online; accessed 2017-07-10}.
[45]
M. A. O'Neil and M. Burtscher. 2014. Microarchitectural Performance Characterization of Irregular GPU Kernels. In International Symposium on Workload Characterization (IISWC).
[46]
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2015. Chimera: Collaborative Preemption for Multitasking on a Shared GPU. ACM SIGARCH Computer Architecture News (2015).
[47]
PCI-SIG. 2015. PCI Express Base Specification Revision 3.1a. https://members.pcisig.com/wg/PCI-SIG/document/download/8257. (2015). {Online; accessed 2017-07-10}.
[48]
Stephen Phillips. 2014. M7: Next Generation SPARC. In Hot Chips 26 Symposium (HCS). IEEE.
[49]
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous System Coherence for Integrated CPU-GPU Systems. In International Symposium on Microarchitecture (MICRO).
[50]
Sooraj Puthoor, Ashwin M Aji, Shuai Che, Mayank Daga, Wei Wu, Bradford M Beckmann, and Gregory Rodgers. 2016. Implementing Directed Acyclic Graphs With the Heterogeneous System Architecture. In Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit. ACM.
[51]
M. Qureshi and Y. Patt. 2006. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. Proceedings of MICRO (Dec 2006).
[52]
N. Rafique, W.T. Lim, and M. Thottethodi. 2006. Architectural Support for OS-driven CMP Cache Management. Proceedings of PACT (Sep 2006).
[53]
Karen Simonyan and Andrew Zisserman. 2014. Very Deep Convolutional Networks for Large-Scale Image Recognition. http://arxiv.org/abs/1409.1556. (2014). {Online; accessed 2017-04-04}.
[54]
Matthew D Sinclair, Johnathan Alsop, and Sarita V Adve. 2015. Efficient GPU Synchronization Without Scopes: Saying No to Complex Consistency Models. In Proceedings of the 48th International Symposium on Microarchitecture. ACM.
[55]
Inderpreet Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O'Connor, and Tor M Aamodt. 2013. Cache Coherence for GPU Architectures. In International Symposium on High Performance Computer Architecture (HPCA). IEEE.
[56]
John R. Spence and Michael M. Yamamura. 1985. Clocked Tri-State Driver Circuit. https://www.google.com/patents/US4504745. (1985).
[57]
Ivan Tanasic, Isaac Gelado, Javier Cabezas, Alex Ramirez, Nacho Navarro, and Mateo Valero. 2014. Enabling Preemptive Multiprogramming on GPUs. In ACM/IEEE International Symposium on Computer Architecture (ISCA).
[58]
Mellanox Technologies. 2015. Switch-IB 2 EDR Switch Silicon - World's First Smart Switch. http://www.mellanox.com/related-docs/prod_silicon/PB_SwitchIB2_EDR_Switch_Silicon.pdf. (2015). {Online; accessed 2017-04-04}.
[59]
Mellanox Technologies. 2016. ConnectX-4 VPI Single and Dual Port QSFP28 Adapter Card User Manual. http://www.mellanox.com/related-docs/user_manuals/ConnectX-4_VPI_Single_and_Dual_QSFP28_Port_Adapter_Card_User_Manual.pdf. (2016). {Online; accessed 2017-04-04}.
[60]
Jouke Verbree, Erik Jan Marinissen, Philippe Roussel, and Dimitrios Velenis. 2010. On the Cost-Effectiveness of Matching Repositories of Pre-tested Wafers for Wafer-to-Wafer 3D Chip Stacking. In IEEE European Test Symposium.
[61]
C. G. Willard, A. Snell, and M. Feldman. 2015. HPC Application Support for GPU Computing. http://www.intersect360.com/industry/reports.php?id=131. (2015). {Online; accessed 2017-04-04}.
[62]
Amir Kavyan Ziabari, José L Abellán, Yenai Ma, Ajay Joshi, and David Kaeli. 2015. Asymmetric NoC Architectures for GPU Systems. In Proceedings of the 9th International Symposium on Networks-on-Chip. ACM.
[63]
Amir Kavyan Ziabari, Yifan Sun, Yenai Ma, Dana Schaa, José L Abellán, Rafael Ubal, John Kim, Ajay Joshi, and David Kaeli. 2016. UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs. ACM Transactions on Architecture and Code Optimization (TACO) (2016).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
October 2017
850 pages
ISBN:9781450349529
DOI:10.1145/3123939
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 October 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NUMA systems
  2. graphics processing units
  3. multi-socket GPUs

Qualifiers

  • Research-article

Funding Sources

  • Ministry of Economy and Competitiveness of Spain

Conference

MICRO-50
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)120
  • Downloads (Last 6 weeks)8
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media