skip to main content
10.1145/3466752.3480088acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

GPS: A Global Publish-Subscribe Model for Multi-GPU Memory Management

Published: 17 October 2021 Publication History

Abstract

Suboptimal management of memory and bandwidth is one of the primary causes of low performance on systems comprising multiple GPUs. Existing memory management solutions like Unified Memory (UM) offer simplified programming but come at the cost of performance: applications can even exhibit slowdown with increasing GPU count due to their inability to leverage system resources effectively. To solve this challenge, we propose GPS, a HW/SW multi-GPU memory management technique that efficiently orchestrates inter-GPU communication using proactive data transfers. GPS offers the programmability advantage of multi-GPU shared memory with the performance of GPU-local memory. To enable this, GPS automatically tracks the data accesses performed by each GPU, maintains duplicate physical replicas of shared regions in each GPU’s local memory, and pushes updates to the replicas in all consumer GPUs. GPS is compatible within the existing NVIDIA GPU memory consistency model but takes full advantage of its relaxed nature to deliver high performance. We evaluate GPS in the context of a 4-GPU system with varying interconnects and show that GPS achieves an average speedup of 3.0 × relative to the performance of a single GPU, outperforming the next best available multi-GPU memory management technique by 2.3 × on average. In a 16-GPU system, using a future PCIe 6.0 interconnect, we demonstrate a 7.9 × average strong scaling speedup over single-GPU performance, capturing 80% of the available opportunity.

References

[1]
Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W Keckler. 2015. Page Placement Strategies for GPUs Within Heterogeneous Memory Systems. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[2]
Marcos K Aguilera, Robert E Strom, Daniel C Sturman, Mark Astley, and Tushar D Chandra. 1999. Matching Events in a Content-Based Subscription System. In Symposium on Principles of Distributed Computing (PODC).
[3]
Jasmin Ajanovic. 2009. PCI Express 3.0 Overview. In A Symposium on High Performance Chips (Hot Chips).
[4]
AMD. 2019. AMD Infinity Architecture: The Foundation of the Modern Datacenter. Product Brief. amd.com/system/files/documents/LE-70001-SB-InfinityArchitecture.pdf, last accessed on 08/17/2020.
[5]
AMD. 2020. AMD Crossfire™ Technology. www.amd.com/en/technologies/crossfire, last accessed on 04/14/2021.
[6]
Akhil Arunkumar, Evgeny Bolotin, Benjamin Cho, Ugljesa Milic, Eiman Ebrahimi, Oreste Villa, Aamer Jaleel, Carole-Jean Wu, and David Nellans. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In International Symposium of Computer Architecture (ISCA).
[7]
Guruduth Banavar, Tushar Chandra, Bodhi Mukherjee, Jay Nagarajarao, Robert E Strom, and Daniel C Sturman. 1999. An Efficient Multicast Protocol for Content-Based Publish-Subscribe Systems. In International Conference on Distributed Computing Systems (ICDCS).
[8]
Guruduth Banavar, Tushar Chandra, Robert Strom, and Daniel Sturman. 1999. A Case for Message Oriented Middleware. In International Symposium on Distributed Computing (DISC).
[9]
Trinayan Baruah, Yifan Sun, Ali Dinçer, Md Saiful Arefin Mojumder, José L. Abellán, Yash Ukidave, Ajay Joshi, Norman Rubin, John Kim, and David Kaeli. 2020. Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems. In International Symposium on High Performance Computer Architecture (HPCA).
[10]
Arkaprava Basu, Sooraj Puthoor, Shuai Che, and Bradford M Beckmann. 2016. Software Assisted Hardware Cache Coherence for Heterogeneous Processors. In International Symposium on Memory Systems (ISMM).
[11]
Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2020. Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing. Transactions on Parallel Computing (TOPC) 7, 3 (2020), 1–27.
[12]
Chiachen Chou, Aamer Jaleel, and Moinuddin K Qureshi. 2016. CANDY: Enabling Coherent DRAM Caches for Multi-node Systems. In International Symposium on Microarchitecture (MICRO).
[13]
Mohammad Dashti and Alexandra Fedorova. 2017. Analyzing Memory Management Methods on Integrated CPU-GPU Systems. In International Symposium on Memory Management (ISMM).
[14]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA systems. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[15]
Alan Demers, Johannes Gehrke, Mingsheng Hong, Mirek Riedewald, and Walker White. 2006. Towards Expressive Publish/Subscribe Systems. In International Conference on Extending Database Technology (EDBT).
[16]
Andi Drebes, Karine Heydemann, Nathalie Drach, Antoniu Pop, and Albert Cohen. 2014. Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-parallel Languages. Transactions on Architecture and Code Optimization (TACO) 11, 3(2014), 1–25.
[17]
Patrick Th Eugster, Pascal A Felber, Rachid Guerraoui, and Anne-Marie Kermarrec. 2003. The Many Faces of Publish/Subscribe. Computing Surveys (CSUR) 35, 2 (2003), 114–131.
[18]
Françoise Fabret, H Arno Jacobsen, François Llirbat, Joăo Pereira, Kenneth A Ross, and Dennis Shasha. 2001. Filtering Algorithms and Implementation for Very Fast Publish/Subscribe Systems. In International Conference on Management of Data (SIGMOD).
[19]
Benedict R Gaster. 2013. HSA Memory Model. In A Symposium on High Performance Chips (Hot Chips).
[20]
Tom’s Hardware. 2019. AMD Big Navi and RDNA 2 GPUs. tomshardware.com/news/amd-big_navi-rdna2-all-we-know, last accessed on 08/17/2020.
[21]
Mark Harris. 2017. Unified Memory for CUDA Beginners. developer.nvidia.com/blog/unified-memory-cuda-beginners/, last accessed on 08/17/2020.
[22]
Blake A Hechtman, Shuai Che, Derek R Hower, Yingying Tian, Bradford M Beckmann, Mark D Hill, Steven K Reinhardt, and David A Wood. 2014. QuickRelease: A Throughput-oriented Approach to Release Consistency on GPUs. In International Symposium on High Performance Computer Architecture (HPCA).
[23]
Mark D Hill, James R Larus, Steven K Reinhardt, and David A Wood. 1992. Cooperative Shared Memory: Software and Hardware for Scalable Multiprocessors. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[24]
Derek R Hower, Blake A Hechtman, Bradford M Beckmann, Benedict R Gaster, Mark D Hill, Steven K Reinhardt, and David A Wood. 2014. Heterogeneous-race-free Memory Models. In International conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[25]
Hyojong Kim, Jaewoong Sim, Prasun Gera, Ramyad Hadidi, and Hyesoon Kim. 2020. Batch-Aware Unified Memory Management in GPUs for Irregular Workloads. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[26]
Nagesh B Lakshminarayana and Hyesoon Kim. 2014. Spare Register Aware Prefetching for Graph Algorithms on GPUs. In International Symposium on High Performance Computer Architecture (HPCA).
[27]
Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications. In International Symposium on Microarchitecture (MICRO).
[28]
Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA Systems: Asymmetry Matters. In USENIX Annual Technical Conference (USENIX ATC).
[29]
Ang Li, Shuaiwen Leon Song, Jieyang Chen, Xu Liu, Nathan Tallent, and Kevin Barker. 2018. Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite. In International Symposium on Workload Characterization (IISWC).
[30]
Daniel Lustig and Margaret Martonosi. 2013. Reducing GPU Offload Latency via Fine-Grained CPU-GPU Synchronization. In International Symposium on High Performance Computer Architecture (HPCA).
[31]
Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[32]
Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman Ebrahimi, Aamer Jaleel, Alex Ramirez, and David Nellans. 2017. Beyond the Socket: NUMA-aware GPUs. In International Symposium on Microarchitecture (MICRO).
[33]
Gero Mühl. 2002. Large-Scale Content-Based Publish-Subscribe Systems. Ph.D. Dissertation. Technische Universität Darmstadt.
[34]
Harini Muthukrishnan, David Nellans, Daniel Lustig, Jeffrey Fessler, and Thomas Wenisch. 2021. Efficient Multi-GPU Shared Memory via Automatic Optimization of Fine-grained Transfers. In International Symposium on Computer Architecture (ISCA).
[35]
Prashant J Nair, David A Roberts, and Moinuddin K Qureshi. 2016. Citadel: Efficiently Protecting Stacked Memory from TSV and Large Granularity Failures. Transactions on Architecture and Code Optimization (TACO) 12, 4(2016), 1–24.
[36]
NVIDIA. 2013. CUDA Toolkit Documentation. docs.nvidia.com/cuda/, last accessed on 08/17/2020.
[37]
NVIDIA. 2019. GP100 MMU Format. nvidia.github.io/open-gpu-doc/pascal/gp100-mmu-format.pdf, last accessed on 08/17/2020.
[38]
NVIDIA. 2019. NVLink AND NVSwitch The Building Blocks of Advanced Multi-GPU Communication. nvidia.com/en-us/data-center/nvlink/, last accessed on 08/17/2020.
[39]
NVIDIA. 2020. NVIDIA Ampere Architecture. www.nvidia.com/en-us/data-center/ampere-architecture/, last accessed on 04/14/2021.
[40]
NVIDIA. 2020. NVIDIA DGX Systems. www.nvidia.com/en-us/data-center/dgx-systems/ last accessed on 04/14/2021.
[41]
NVIDIA. 2020. NVIDIA NVLink High-Speed GPU Interconnect. nvidia.com/en-us/design-visualization/nvlink-bridges/, last accessed on 08/17/2020.
[42]
NVIDIA. 2020. NVIDIA TITAN V, NVIDIA’s Supercomputing GPU Architecture, Now for Your PC. www.nvidia.com/en-us/titan/titan-v/, last accessed on 08/17/2020.
[43]
NVIDIA. 2020. PTX: Parallel Thread Execution ISA Version 7.0. docs.nvidia.com/cuda/pdf/ptx_isa_7.0.pdf, last accessed on 08/17/2020.
[44]
Marc S Orr, Shuai Che, Ayse Yilmazer, Bradford M Beckmann, Mark D Hill, and David A Wood. 2015. Synchronization Using Remote-Scope Promotion. International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[45]
Sreeram Potluri, Khaled Hamidouche, Akshay Venkatesh, Devendar Bureddy, and Dhabaleswar K Panda. 2013. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs. In International Conference on Parallel Processing (ICPP).
[46]
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M Beckmann, Mark D Hill, Steven K Reinhardt, and David A Wood. 2013. Heterogeneous System Coherence for Integrated CPU-GPU Systems. In International Symposium on Microarchitecture (MICRO).
[47]
Jason Power, Mark D Hill, and David A Wood. 2014. Supporting x86-64 Address Translation for 100s of GPU Lanes. In International Symposium on High Performance Computer Architecture (HPCA).
[48]
Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, and Anastasia Ailamaki. 2015. Scaling Up Concurrent Main-memory Column-store Scans: Towards Adaptive NUMA-aware Data and Task Placement. In Proceedings of the VLDB Endowment (PVLDB).
[49]
Xiaowei Ren and Mieszko Lis. 2017. Efficient Sequential Consistency in GPUs via Relativistic Cache Coherence. In International Symposium on High Performance Computer Architecture (HPCA).
[50]
Xiaowei Ren, Daniel Lustig, Evgeny Bolotin, Aamer Jaleel, Oreste Villa, and David Nellans. 2020. HMG: Extending Cache Coherence Protocols Across Modern Hierarchical Multi-GPU Systems. In International Symposium on High Performance Computer Architecture (HPCA).
[51]
Tim C Schroeder. 2011. Peer-to-peer & Unified Virtual Addressing. In GPU Technology Conference (GTC).
[52]
Ankit Sethia, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2013. APOGEE: Adaptive Prefetching on GPUs for Energy Efficiency. In International Conference on Parallel Architectures and Compilation Techniques (PACT).
[53]
Matthew D Sinclair, Johnathan Alsop, and Sarita V Adve. 2015. Efficient GPU synchronization without scopes: Saying No to Complex Consistency Models. In International Symposium on Microarchitecture (MICRO).
[54]
Inderpreet Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O’Connor, and Tor M Aamodt. 2013. Cache Coherence for GPU Architectures. In International Symposium on High Performance Computer Architecture (HPCA).
[55]
Mohammed Sourouri, Tor Gillberg, Scott B Baden, and Xing Cai. 2014. Effective Multi-GPU Communication Using Multiple CUDA Streams and Threads. In International Conference on Parallel and Distributed Systems (ICPADS).
[56]
Abdulaziz Tabbakh, Xuehai Qian, and Murali Annavaram. 2018. G-TSC: Timestamp Based Coherence for GPUs. In International Symposium on High Performance Computer Architecture (HPCA).
[57]
Oreste Villa, Daniel Lustig, Zi Yan, Evgeny Bolotin, Yaosheng Fu, Niladrish Chatterjee, Nan Jiang, and David Nellans. 2021. Need for Speed: Experiences Building a Trustworthy System level GPU Simulator. In International Symposium on High Performance Computer Architecture (HPCA).
[58]
Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In International Symposium on Microarchitecture (MICRO).
[59]
Peng Wang. 2017. UNIFIED MEMORY ON P100. olcf.ornl.gov/wp-content/uploads/2018/02/SummitDev_Unified-Memory.pdf, last accessed on 02/14/2021.
[60]
Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riffel, and John D Owens. 2016. Gunrock: A High-performance Graph Processing Library on the GPU. In Principles and Practice of Parallel Programming (PPoPP).
[61]
John Wickerson, Mark Batty, Bradford M Beckmann, and Alastair F Donaldson. 2015. Remote-scope Promotion: Clarified, Rectified, and Verified. In International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA).
[62]
Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Oreste Villa. 2018. Combining HW/SW Mechanisms to Improve NUMA Performance of Multi-GPU Systems. In International Symposium on Microarchitecture (MICRO).
[63]
Kaiyuan Zhang, Rong Chen, and Haibo Chen. 2015. NUMA-aware Graph-structured Analytics. In Symposium on Principles and Practice of Parallel Programming (PPoPP).
[64]
Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W Keckler. 2016. Towards High Performance Paged Memory for GPUs. In International Symposium on High Performance Computer Architecture (HPCA).

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture
October 2021
1322 pages
ISBN:9781450385572
DOI:10.1145/3466752
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPGPU
  2. GPU memory management
  3. communication
  4. heterogeneous systems
  5. multi-GPU
  6. strong scaling

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

MICRO '21
Sponsor:

Acceptance Rates

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)255
  • Downloads (Last 6 weeks)27
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media