skip to main content
research-article
Open access

UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs

Published: 02 December 2016 Publication History

Abstract

In this article, we describe how to ease memory management between a Central Processing Unit (CPU) and one or multiple discrete Graphic Processing Units (GPUs) by architecting a novel hardware-based Unified Memory Hierarchy (UMH). Adopting UMH, a GPU accesses the CPU memory only if it does not find its required data in the directories associated with its high-bandwidth memory, or the NMOESI coherency protocol limits the access to that data. Using UMH with NMOESI improves performance of a CPU-multiGPU system by at least 1.92 × in comparison to alternative software-based approaches. It also allows the CPU to access GPUs modified data by at least 13 × faster.

References

[1]
Darren Abramson, Jeff Jackson, Sridhar Muthrasanallur, Gil Neiger, Greg Regnier, Rajesh Sankaran, Ioannis Schoinas, Rich Uhlig, Balaji Vembu, and John Wiegert. 2006. Intel virtualization technology for directed I/O. Intel Technol. J. 10, 3 (2006).
[2]
N. Agarwal, D. Nellans, M. O’Connor, S. W. Keckler, and T. F. Wenisch. 2015. Unlocking bandwidth for GPUs in CC-NUMA systems. In Proceedings of the 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 354--365.
[3]
Nabeel Al-Saber and Milind Kulkarni. 2015. SemCache++: Semantics-aware caching for efficient multi-GPU offloading. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). ACM, New York, NY, 255--256.
[4]
AMD. 2012. AMD Graphics Cores Next (GCN) Architecture. White paper. https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf.
[5]
AMD. 2014a. AMD Launches World’s Fastest Graphics Card. Retrieved from http://www.amd.com/en-us/ press-releases/Pages/fastest-graphics-card-2014apr8.aspx.
[6]
AMD. 2014b. AMD Radeon HD 7800 Series Graphic Cards. (2014). Retrieved from http://www.amd.com/en-us/products/graphics/desktop/7000/7800.
[7]
AMD. 2015a. High Bandwidth Memory. Retrieved from http://www.amd.com/en-us/innovations/software- technologies/hbm.
[8]
AMD. 2015b. High-Bandwidth Memory (HBM): Reinventing Memory Technology. (2015). https://www.amd.com/Documents/High-Bandwidth-Memory-HBM.pdf.
[9]
AMD. 2016. AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK). Retrieved from http://developer.amd.com/sdks/amdappsdk/.
[10]
E. Bolotin, D. Nellans, O. Villa, M. O’Connor, A. Ramirez, and S. W. Keckler. 2015. Designing efficient heterogeneous memory architectures. IEEE Micro 35, 4 (July 2015), 60--68.
[11]
Pierre Boudier and Graham Sellers. 2011. MEMORY SYSTEM ON FUSION APUS: The Benefits of Zero Copy. AMD, June 2011. Web. Nov. 11 2016. http://developer.amd.com/wordpress/media/2013/06/1004_final.pdf.
[12]
Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, and Wen-mei W. Hwu. 2015. Automatic parallelization of kernels in shared-memory multi-GPU nodes. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, New York, NY, 3--13.
[13]
Patrick Dorsey. 2010. Xilinx stacked silicon interconnect technology delivers breakthrough FPGA capacity, bandwidth, and power efficiency. Xilinx White Paper: Virtex-7 FPGAs (2010), 1--10.
[14]
NVIDIA Gupta, Sumit. 2015. NVIDIA Updates GPU Roadmap; Announces Pascal. Retrieved from http://blogs.nvidia.com/blog/2014/03/25/gpu-roadmap-pascal/.
[15]
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10). ACM, New York, NY, 37--47.
[16]
Pawan Harish and P. J. Narayanan. 2007. Accelerating large graph algorithms on the GPU using CUDA. In Proceedings of the 14th International Conference on High Performance Computing (HiPC’07). Springer-Verlag, Berlin, 197--208.
[17]
NVIDIA Harris, Mark. 2013. Unified Memory in CUDA 6. Retrieved from http://devblogs.nvidia.com/parallelforall/unified-memory-in-cuda-6/.
[18]
Owen Harrison and John Waldron. 2007. AES encryption implementation and analysis on commodity graphics processing units. In Proceedings of the 9th International Workshop on Cryptographic Hardware and Embedded Systems (CHES’07). Springer-Verlag, Berlin, 209--226.
[19]
D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison cache: A scalable and effective die-stacked DRAM cache. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 25--37.
[20]
Djordje Jevdjic, Stavros Volos, and Babak Falsafi. 2013. Die-stacked DRAM caches for servers: Hit ratio, latency, or bandwidth? Have it all with footprint cache. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 404--415.
[21]
M. Kadiyala and L. N. Bhuyan. 1995. A dynamic cache sub-block design to reduce false sharing. In Proceedings of the 1995 IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD’95). 313--318.
[22]
Gwangsun Kim, John Kim, Jung Ho Ahn, and Jaeha Kim. 2013. Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT’13). IEEE Press, 145--156.
[23]
Gwangsun Kim, Minseok Lee, Jiyun Jeong, and John Kim. 2014a. Multi-GPU system design with memory networks. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, 484--495.
[24]
Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. 2011. Achieving a single compute device image in OpenCL for multiple GPUs. SIGPLAN Not. 46, 8 (Feb. 2011), 277--288.
[25]
Jesung Kim, Sang Lyul Min, Sanghoon Jeon, Byoungchu Ahn, Deog Kyoon Jeong, and Chong Sang Kim. 1995. U-cache: A cost-effective solution to synonym problem. In Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture. IEEE, 243--252.
[26]
Youngsok Kim, Jaewon Lee, Jae-Eon Jo, and Jangwoo Kim. 2014b. GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 546--557.
[27]
Y. Kim, J. Lee, D. Kim, and J. Kim. 2014c. ScaleGPU: GPU architecture for memory-unaware GPU programming. IEEE Comput. Arch. Lett. 13, 2 (July 2014), 101--104.
[28]
R. Komuravelli, M. D. Sinclair, J. Alsop, M. Huzaifa, M. Kotsifakou, P. Srivastava, S. V. Adve, and V. S. Adve. 2015. Stash: Have your scratchpad and cache it too. In Proceedings of the 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). 707--719.
[29]
Adam Lake. 2014. Getting the Most from OpenCLTM 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics. Intel 2014, Web. Nov. 11 2016. https://software.intel.com/sites/default/files/managed/f1/25/opencl-zero-copy-in-opencl-1-2.pdf.
[30]
Jason Lawley. 2014. Understanding performance of PCI express systems. Xilinx, October 28, 2014. web. Nov. 11, 2016. http://www.xilinx.com/support/documentation/white_papers/wp350.pdf.
[31]
Wenqiang Li, Guanghao Jin, Xuewen Cui, and S. See. 2015. An evaluation of unified memory technology on NVIDIA GPUs. In Proceedings of the 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 1092--1098.
[32]
Gabriel H. Loh and Mark D. Hill. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-44). ACM, New York, NY, 454--464.
[33]
Milo M. K. Martin. 2003. Token Coherence. Ph.D. Dissertation. University of Wisconsin--Madison.
[34]
Siddharth Mohanty and Murray Cole. 2007. Autotuning wavefront applications for multicore multi-GPU hybrid architectures. In Proceedings of Programming Models and Applications on Multicores and Manycores (PMAM’14). ACM, New York, NY, Article 1, 9 pages.
[35]
Dan Negrut. 2014. Unified Memory in CUDA 6.0: A Brief Overview. Retrieved from http://www.drdobbs.com/ parallel/unified-memory-in-cuda-6-a-brief-overvie/.
[36]
A. Nere, A. Hashmi, and M. Lipasti. 2011. Profiling heterogeneous multi-GPU systems to accelerate cortically inspired learning algorithms. In Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS’11). 906--920.
[37]
NVIDIA. 2012. NVIDIA Maximus System Builder’s Guide. Retrieved from http://www.nvidia.com/content/quadro/maximus/di-06471-001_v02.pdf.
[38]
NVIDIA. 2014. Whitepaper: Nvidia NVLink high-speed interconnect: application performance. NVIDIA Nov. 2014, web Nov. 11. 2016. http://info.nvidianews.com/rs/nvidia/images/NVIDIA%20NVLink%20High-Speed%20Interconnect%20Application%20Performance%20Brief.pdf.
[39]
NVIDIA. 2015a. NVIDIA CUDA C Programming Guide: Version 7.5. (2015). Retrieved from http://docs.nvidia.com/cuda/cuda-c-programming-guide/
[40]
NVIDIA. 2015b. Tesla K80 GPU Accelerator. Retrieved from https://images.nvidia.com/content/pdf/kepler/ Tesla-K80-BoardSpec-07317-001-v05.pdf.
[41]
Sreepathi Pai. 2014. Microbenchmarking Unified Memory in CUDA 6.0. Retrieved from http://users.ices. utexas.edu/sreepai/automem/.
[42]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces. ACM SIGPLAN Not. 49, 4 (2014), 743--758.
[43]
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 457--467.
[44]
Jonathan Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 568--578.
[45]
Moinuddin K. Qureshi and Gabe H. Loh. 2012. Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, 235--246.
[46]
Dana Schaa. 2014. Improving the Cooperative Capability of Heterogeneous Processors. Ph.D. Dissertation. Northeastern University.
[47]
D. Schaa and D. Kaeli. 2009. Exploring the multiple-GPU design space. In Proceedings of the IEEE International Symposium on Parallel Distributed Processing (IPDPS’09). 1--12.
[48]
Tom Shanley. 2010. X86 Instruction Set Architecture. Mindshare Press.
[49]
Premkishore Shivakumar and Norman P. Jouppi. 2001. Cacti 3.0: An Integrated Cache Timing, Power, and Area Model. Technical Report. Technical Report 2001/2, Compaq Computer Corporation.
[50]
Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2015. Efficient GPU synchronization without scopes: Saying no to complex consistency models. In Proceedings of the 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-48).
[51]
JEDEC Standard. 2013. High bandwidth memory (HBM) dram. JESD235 (2013).
[52]
J. A. Stuart and J. D. Owens. 2011. Multi-GPU MapReduce on GPU clusters. In Proceedings of the 2011 IEEE International Parallel Distributed Processing Symposium (IPDPS). 1068--1079.
[53]
Yifan Sun, Xiang Gong, Amir Kavyan Ziabari, Leiming Yu, Xiangyu Li, Saoni Mukherjee, Carter McCardwell, Alejandro Villegas, and David Kaeli. 2016. Hetero-mark, a benchmark suite for CPU-GPU collaborative computing. In Proceedings of the 2016 IEEE International Symposium on Workload Characterization (IISWC). IEEE.
[54]
NVIDIA SuperMicro. 2016. Revolutionising High Performance Computing with Supermicro Solutions Using Nvidia Tesla. Retrieved from http://goo.gl/2YEKIq.
[55]
Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques.
[56]
Rafael Ubal and David Kaeli. 2015. The Multi2Sim Simulation Framework: A CPU-GPU Model for Heterogeneous Computing. Retrieved from www.multi2sim.org.
[57]
Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A case for core-assisted Bottleneck acceleration in GPUs: Enabling flexible data compression with assist warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15). ACM, New York, NY, 41--53.

Cited By

View all

Index Terms

  1. UMH: A Hardware-Based Unified Memory Hierarchy for Systems with Multiple Discrete GPUs

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 4
      December 2016
      648 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3012405
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 02 December 2016
      Accepted: 01 September 2016
      Revised: 01 August 2016
      Received: 01 May 2016
      Published in TACO Volume 13, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Unified memory architecture
      2. graphics processing units
      3. high performance computing
      4. memory hierarchy

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • NRF
      • Spanish MINECO

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)210
      • Downloads (Last 6 weeks)35
      Reflects downloads up to 06 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media