skip to main content
research-article
Public Access

Devirtualizing Memory in Heterogeneous Systems

Published: 19 March 2018 Publication History

Abstract

Accelerators are increasingly recognized as one of the major drivers of future computational growth. For accelerators, shared virtual memory (VM) promises to simplify programming and provide safe data sharing with CPUs. Unfortunately, the overheads of virtual memory, which are high for general-purpose processors, are even higher for accelerators. Providing accelerators with direct access to physical memory (PM) in contrast, provides high performance but is both unsafe and more difficult to program. We propose Devirtualized Memory (DVM) to combine the protection of VM with direct access to PM. By allocating memory such that physical and virtual addresses are almost always identical (VA==PA), DVM mostly replaces page-level address translation with faster region-level Devirtualized Access Validation (DAV). Optionally on read accesses, DAV can be overlapped with data fetch to hide VM overheads. DVM requires modest OS and IOMMU changes, and is transparent to the application. Implemented in Linux 4.10, DVM reduces VM overheads in a graph-processing accelerator to just 1.6% on average. DVM also improves performance by 2.1X over an optimized conventional VM implementation, while consuming 3.9X less dynamic energy for memory management. We further discuss DVM's potential to extend beyond accelerators to CPUs, where it reduces VM overheads to 5% on average, down from 29% for conventional VM.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA).
[2]
AMD. 2016. AMD I/O Virtualization Technology (IOMMU) Specification, Revision 3.00. http://support.amd.com/TechDocs/48882_IOMMU.pdf. (Dec. 2016).
[3]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Benchmarks - Summary and Preliminary Results Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (SC).
[4]
Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2010. Translation Caching: Skip, Don'T Walk (the Page Table) Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA).
[5]
Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA).
[6]
James Bennett and Stan Lanning. 2017. The Netflix Prize. In KDD Cup and Workshop in conjunction with KDD, CA.
[7]
Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating Two-dimensional Page Walks for Virtualized Systems Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[8]
Abhishek Bhattacharjee. 2013. Large-reach Memory Management Unit Caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[9]
Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA).
[10]
Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-core Cooperative TLB for Chip Multiprocessors Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[11]
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.
[12]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News Vol. 39, 2 (Aug. 2011).
[13]
Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher, Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval Yarom, and Mike Hamburg. 2017. Meltdown. https://meltdownattack.com/meltdown.pdf. (2017).
[14]
Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[15]
Richard C. Murphy, Kyle B. Wheeler, Brian W. Barret, and James A. Ang. 2010. Introducing the Graph 500. In Cray User's Group (CUG).
[16]
Tony Nowatzki, Vinay Gangadhar, Newsha Ardalani, and Karthikeyan Sankaralingam. 2017. Stream-Dataflow Acceleration. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA).
[17]
Lena E. Olson, Jason Power, Mark D. Hill, and David A. Wood. 2015. Border control: Sandboxing accelerators. In 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 470--481.
[18]
Lena E. Olson, Simha Sethumadhavan, and Mark D. Hill. 2016. Security Implications of Third-Party Accelerators. IEEE Comput. Archit. Lett. Vol. 15, 1 (Jan. 2016). 1556--6056
[19]
John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman. 2010. The Case for RAMClouds: Scalable High-performance Storage Entirely in DRAM. SIGOPS Oper. Syst. Rev. Vol. 43, 4 (Jan. 2010). 0163--5980
[20]
Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[21]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[22]
Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86--64 address translation for 100s of GPU lanes 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 568--578. 1530-0897
[23]
Parthasarathy Ranganathan. 2011. From Microprocessors to Nanostores: Rethinking Data-Centric Systems. Computer (Jan. 2011). 0018-9162
[24]
RedHat. 2012. Position Independent Executables (PIE). https://access.redhat.com/blogs/766093/posts/1975793. (2012).
[25]
Ryan Roemer, Erik Buchanan, Hovav Shacham, and Stefan Savage. 2012. Return-Oriented Programming: Systems, Languages, and Applications. ACM Trans. Inf. Syst. Secur., Article 2 (March. 2012). 1094--9224
[26]
Phil Rogers. 2011. The programmer's guide to the apu galaxy. (2011).
[27]
Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, and Pradeep Dubey. 2014. Navigating the Maze of Graph Analytics Frameworks Using Massive Graph Datasets Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD).
[28]
Hovav Shacham, Matthew Page, Ben Pfaff, Eu-Jin Goh, Nagendra Modadugu, and Dan Boneh. 2004. On the Effectiveness of Address-space Randomization Proceedings of the 11th ACM Conference on Computer and Communications Security (CCS).
[29]
Kirill A. Shutemov. 2005. 5-level paging. https://lwn.net/Articles/708526/. (Jan. 2005).
[30]
Madhusudhan Talluri, Shing Kong, Mark D. Hill, and David A. Patterson. 1992. Tradeoffs in Supporting Two Page Sizes. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA).
[31]
Ian Lance Taylor. 2011. Split Stacks in GCC. https://gcc.gnu.org/wiki/SplitStacks. (Feb. 2011).
[32]
John R Tramm, Andrew R Siegel, Tanzima Islam, and Martin Schulz. 2014. XSBench - The Development and Verification of a Performance Abstraction for Monte Carlo Reactor Analysis. In The Role of Reactor Physics toward a Sustainable Future (PHYSOR). Kyoto.
[33]
Arjan van de Ven. 2005. Linux patch for virtual address space randomization. https://lwn.net/Articles/120966/. (Jan. 2005).
[34]
Oracle Vijay Tatkar. 2016. What Is the SPARC M7 Data Analytics Accelerator? https://community.oracle.com/docs/DOC-994842. (Feb. 2016).
[35]
Pirmin Vogel, Andrea Marongiu, and Luca Benini. 2015. Lightweight Virtual Memory Support for Many-core Accelerators in Heterogeneous Embedded SoCs. In Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis (CODES).
[36]
Emmett Witchel, Josh Cates, and Krste Asanoviç. 2002. Mondrian Memory Protection. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[37]
Lisa Wu, Andrea Lottarini, Timothy K. Paine, Martha A. Kim, and Kenneth A. Ross. 2014. Q100: The Architecture and Design of a Database Processing Unit Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[38]
Sam Likun Xi, Oreoluwa Babarinsa, Manos Athanassoulis, and Stratos Idreos. 2015. Beyond the Wall: Near-Data Processing for Databases Proceedings of the 11th International Workshop on Data Management on New Hardware (DAMON). Article 2.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 53, Issue 2
ASPLOS '18
February 2018
809 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296957
Issue’s Table of Contents
  • cover image ACM Conferences
    ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
    March 2018
    827 pages
    ISBN:9781450349116
    DOI:10.1145/3173162
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 March 2018
Published in SIGPLAN Volume 53, Issue 2

Check for updates

Author Tags

  1. accelerators
  2. virtual memory

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)309
  • Downloads (Last 6 weeks)31
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media