research-article

Tiered Memory Management: Access Latency is the Key!

Authors:

Midhul Vuppalapati,

Rachit AgarwalAuthors Info & Claims

SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles

Pages 79 - 94

https://doi.org/10.1145/3694715.3695968

Published: 15 November 2024 Publication History

Abstract

The emergence of tiered memory architectures has led to a renewed interest in memory management. Recent works on tiered memory management innovate on mechanisms for access tracking, page migration, and dynamic page size determination; however, they all use the same page placement algorithm---packing the hottest pages in the default tier (one with the lowest hardware-specified memory access latency). This makes an implicit assumption that, despite serving the hottest pages, the access latency of the default tier is less than that of alternate tiers. This assumption is far from real: it is well-known in the computer architecture community that, in the realistic case of multiple in-flight requests, memory access latency can be significantly larger than the hardware-specified latency. We show that, even under moderate loads, the default tier access latency can inflate to be 2.5× larger than the latency of alternate tiers; and that, under this regime, performance of state-of-the-art memory tiering systems can be 2.3× worse than the optimal.

Colloid is a memory management mechanism that embodies the principle of balancing access latencies---page placement across tiers should be performed so as to balance their average (loaded) access latencies. To realize this principle, Colloid innovates on both per-tier memory access latency measurement mechanisms, and page placement algorithms that decide the set of pages to place in each tier. We integrate Colloid with three state-of-the-art memory tiering systems---HeMem, TPP and MEMTIS. Evaluation across a wide variety of workloads demonstrates that Colloid consistently enables the underlying system to achieve near-optimal performance.

References

[1]

Neha Agarwal and Thomas F. Wenisch. 2017. Thermostat: Application-Transparent Page Management for Two-Tiered Main Memory. In ACM ASPLOS.

[2]

Saksham Agarwal, Rachit Agarwal, Behnam Montazeri, Masoud Moshref, Khaled Elmeleegy, Luigi Rizzo, Marc Asher de Kruijf, Gautam Kumar, Sylvia Ratnasamy, David Culler, and Amin Vahdat. 2022. Understanding Host Interconnect Congestion. In ACM HotNets.

[3]

Saksham Agarwal, Arvind Krishnamurthy, and Rachit Agarwal. 2023. Host Congestion Control. In ACM SIGCOMM.

[4]

AMD. 2024. Performance Monitor Counters for AMD Family 1Ah Model 00h0Fh Processors. https://www.amd.com/content/dam/amd/en/documents/epyc-technical-docs/programmer-references/58550-0.01.pdf.

[5]

Scott Beamer. 2023. GAPBS PageRank Implementation. https://github.com/sbeamer/gapbs/blob/master/src/pr.cc.

[6]

William Bolosky, Robert Fitzgerald, and Michael Scott. 1989. Simple But Effective Techniques for NUMA Memory Management. In ACM SOSP.

[7]

Timothy Brecht. 1993. On The Importance of Parallel Application Placement in NUMA Multiprocessors. In USENIX SEDMS.

[8]

Shuang Chen, Christina Delimitrou, and José F Martínez. 2019. PARTIES: QoS-Aware Resource Partitioning for Multiple Interactive Services. In ACM ASPLOS.

[9]

Chiachen Chou, Aamer Jaleel, and Moinuddin Qureshi. 2017. BATMAN: Techniques for Maximizing System Bandwidth of Memory Systems with Stacked-DRAM. In ACM MEMSYS.

Digital Library

[10]

Debendra Das Sharma, Robert Blankenship, and Daniel Berger. 2024. An Introduction to the Compute Express Link (CXL) Interconnect. In ACM CSUR.

[11]

Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In ACM ASPLOS.

[12]

Padmapriya Duraisamy, Wei Xu, Scott Hare, Ravi Rajwar, David Culler, Zhiyi Xu, Jianing Fan, Christopher Kennelly, Bill McCloskey, Danijela Mijailovic, Brian Morris, Chiranjit Mukherjee, Jingliang Ren, Greg Thelen, Paul Turner, Carlos Villavieja, Parthasarathy Ranganathan, and Amin Vahdat. 2023. Towards an Adaptable Aystems Architecture for Memory Tiering at Warehouse-Scale. In ACM ASPLOS.

[13]

Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N Patt. 2010. Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems. In ACM ASPLOS.

[14]

Saugata Ghose, Hyodong Lee, and Jose F Martinez. 2013. Improving Memory Scheduling via Processor-Side Load Criticality Information. In IEEE/ACM ISCA.

[15]

Nagendra Gulur, Mahesh Mehendale, R Manikantan, and R Govindarajan. 2014. Bi-Modal DRAM Cache: Improving Hit Rate, Hit Latency and Bandwidth. In IEEE/ACM MICRO.

[16]

Ying Huang. 2022. [PATCH -V4 0/3] Memory Tiering: Hot Page Selection. https://lwn.net/ml/linux-kernel/[email protected]/.

[17]

Intel. 2017. Intel Xeon Processor Scalable Memory Family Uncore Performance Monitoring. https://kib.kiev.ua/x86docs/Intel/PerfMon/336274-001.pdf.

[18]

Intel. 2021. 3rd Gen Intel Xeon Processor Scalable Family, Codename Ice Lake, Uncore Performance Monitoring. https://cdrdv2-public.intel.com/679093/639778%20ICX%20UPG%20v1.pdf.

[19]

Intel. 2024. Intel Xeon CPU Max Series. https://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html.

[20]

Intel. 2024. Sapphire Rapids (SPR) Uncore Events. https://github.com/intel/perfmon/blob/main/SPR/events/sapphirerapids_uncore_experimental.json.

[21]

Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell, Yan Solihin, Lisa Hsu, and Steve Reinhardt. 2007. QoS Policies and Architecture for Cache/Memory in CMP Platforms. In ACM SIGMETRICS.

[22]

Djordje Jevdjic, Gabriel H Loh, Cansu Kaynak, and Babak Falsafi. 2014. Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache. In IEEE/ACM ISCA.

[23]

Sudarsun Kannan, Ada Gavrilovska, Vishal Gupta, and Karsten Schwan. 2017. HeteroOS: OS Design for Heterogeneous Memory Management in Datacenter. In IEEE/ACM ISCA.

[24]

Jonghyeon Kim, Wonkyo Choe, and Jeongseob Ahn. 2021. Exploring the Design Space of Page Management for Multi-Tiered Memory Systems. In USENIX ATC.

[25]

Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers. In IEEE HPCA.

[26]

Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. In IEEE/ACM ISCA.

[27]

Aneesh KV Kumar. 2022. [PATCH v7 00/12] mm/demotion: Memory Tiers and Demotion. https://lwn.net/ml/linux-kernel/[email protected]/.

[28]

Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N. Patt. 2010. DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems. https://utw10235.utweb.utexas.edu/people/cjlee/TR-HPS-2010-002.pdf.

[29]

Taehyung Lee, Sumit Kumar Monga, Changwoo Min, and Young Ik Eom. 2023. MEMTIS: Efficient Memory Tiering with Dynamic Page Classification and Page Size Determination. In ACM SOSP.

[30]

Baptiste Lepers, Vivien Quema, and Alexandra Fedorova. 2015. Thread and Memory Placement on NUMA systems: Asymmetry Matters. In USENIX ATC.

[31]

Baptiste Lepers and Willy Zwaenepoel. 2023. Johnny Cache: the End of DRAM Cache Conflicts (in Tiered Main Memory Systems). In USENIX OSDI.

[32]

Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, Mark D Hill, Marcus Fontoura, and Ricardo Bianchini. 2023. Pond: CXL-Based Memory Pooling Systems for Cloud Platforms. In ACM ASPLOS.

[33]

Gabriel H Loh and Mark D Hill. 2011. Efficiently Enabling Conventional Block Sizes for Very Large Die-Stacked DRAM Caches. In IEEE/ACM MICRO.

[34]

Zoltan Majo and Thomas R Gross. 2011. Memory System Performance in a NUMA Multicore Multiprocessor. In ACM SYSTOR.

[35]

Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. In ACM ASPLOS.

[36]

John McCalpin. 2023. The Evolution of Single-Core Bandwidth in Multicore Systems. https://sites.utexas.edu/jdm4372/2023/12/19/the-evolution-of-single-core-bandwidth-in-multicore-systems-update/.

[37]

John D McCalpin. 2023. Bandwidth Limits in the Intel Xeon Max (Sapphire Rapids with HBM) Processors. In ISC High Performance.

[38]

Timothy Prickett Morgan. 2020. CXL And Gen-Z Iron Out A Coherent Interconnect Strategy. https://www.nextplatform.com/2020/04/03/cxl-and-gen-z-iron-out-a-coherent-interconnect-strategy/.

[39]

Thomas Moscibroda and Onur Mutlu. 2008. Distributed Order Scheduling and its Application to Multi-Core DRAM Controllers. In ACM PODC.

[40]

Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda. 2011. Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning. In IEEE/ACM ISCA.

[41]

Onur Mutlu. 2013. Memory Scaling: A Systems Architecture Perspective. In IEEE IMW.

[42]

Onur Mutlu and Thomas Moscibroda. 2007. Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems. In USENIX Security.

[43]

Onur Mutlu and Thomas Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In IEEE/ACM MICRO.

[44]

Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-Aware Batch Scheduling: Enhancing Both Performance and Fairness of Shared DRAM Systems. In IEEE/ACM ISCA.

[45]

Kyle J Nesbit, Nidhi Aggarwal, James Laudon, and James E Smith. 2006. Fair Queuing Memory Systems. In IEEE/ACM MICRO.

[46]

Dylan Patel, Jeff Koch, Tanj Bennett, and Wega Chu. 2024. The Memory Wall: Past, Present, and Future of DRAM. https://www.semianalysis.com/p/the-memory-wall.

[47]

Moin Qureshi and Gabriel H Loh. 2012. Fundamental Latency Trade-Offs in Architecturing DRAM Caches: Outperforming Impractical SRAM-tags With a Simple and Practical Design. In IEEE/ACM MICRO.

[48]

Amanda Raybuck, Tim Stamler, Wei Zhang, Mattan Erez, and Simon Peter. 2021. HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM. In ACM SOSP.

[49]

Shigeru Shiratake. 2020. Scaling and Performance Challenges of Future DRAM. In IEEE IMW.

[50]

Jaewoong Sim, Gabriel H Loh, Hyesoon Kim, Mike OConnor, and Mithuna Thottethodi. 2012. A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch. In IEEE/ACM MICRO.

[51]

Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Mutlu. 2014. The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost. In IEEE ICCD.

[52]

Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu. 2015. The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory. In IEEE/ACM MICRO.

[53]

Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu. 2013. MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems. In IEEE HPCA.

[54]

Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, Ren Wang, Jung Ho Ahn, Tianyin Xu, and Nam Sung Kim. 2023. Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices. In IEEE/ACM MICRO.

[55]

James Tuck, Luis Ceze, and Josep Torrellas. 2006. Scalable Cache Miss Handling for High Memory-Level Parallelism. In IEEE/ACM MICRO.

[56]

Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In ACM ASPLOS.

[57]

Midhul Vuppalapati and Rachit Agarwal. 2024. Tiered Memory Management: Access Latency is the Key! (Extended Version). https://github.com/host-architecture/colloid.

[58]

Midhul Vuppalapati, Saksham Agarwal, Henry N Schuh, Baris Kasikci, Arvind Krishnamurthy, and Rachit Agarwal. 2024. Understanding the Host Network. In ACM SIGCOMM.

[59]

Hao Wang, Chang-Jae Park, Gyung-su Byun, Jung Ho Ahn, and Nam Sung Kim. 2015. Alloy: Parallel-Serial Memory Channel Architecture for Single-Chip Heterogeneous Processor Systems. In IEEE HPCA.

[60]

Lingfeng Xiang, Zhen Lin, Weishu Deng, Hui Lu, Jia Rao, Yifan Yuan, and Ren Wang. 2024. Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration. In USENIX OSDI.

[61]

Zi Yan, Daniel Lustig, David Nellans, and Abhishek Bhattacharjee. 2019. Nimble Page Management for Tiered Memory Systems. In ACM ASPLOS.

[62]

Yuhong Zhong, Daniel S Berger, Carl Waldspurger, Ishwar Agarwal, Rajat Agarwal, Frank Hady, Karthik Kumar, Mark D Hill, Mosharaf Chowdhury, and Asaf Cidon. 2024. Managing Memory Tiers with CXL in Virtualized Environments. In USENIX OSDI.

Index Terms

Tiered Memory Management: Access Latency is the Key!
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Recommendations

vTMM: Tiered Memory Management for Virtual Machines
EuroSys '23: Proceedings of the Eighteenth European Conference on Computer Systems

The memory demand of virtual machines (VMs) is increasing, while the traditional DRAM-only memory system has limited capacity and high power consumption. The tiered memory system can effectively expand the memory capacity and increase the cost ...
Pre-Copy and post-copy VM live migration for memory intensive applications
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshops

Virtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts ...
Tiered Memory: An Iso-Power Memory Architecture to Address the Memory Power Wall

Moore's Law improvement in transistor density is driving a rapid increase in the number of cores per processor. DRAM device capacity and energy efficiency are increasing at a slower pace, so the importance of DRAM power is increasing. This problem ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '24: Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles

November 2024

765 pages

ISBN:9798400712517

DOI:10.1145/3694715

Chair:
Emmett Witchel,
Co-chair:
Christopher J Rossbach,
Program Chair:
Andrea Arpaci-Dusseau,
Program Co-chair:
Kimberly Keeton

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

In-Cooperation

USENIX

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 November 2024

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

NSF (National Science Foundation)

Conference

SOSP '24

Sponsor:

SIGOPS

SOSP '24: ACM SIGOPS 30th Symposium on Operating Systems Principles

November 4 - 6, 2024

TX, Austin, USA

Acceptance Rates

SOSP '24 Paper Acceptance Rate 43 of 245 submissions, 18%;

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25

Sponsor:
sigops

ACM SIGOPS 31st Symposium on Operating Systems Principles

October 13 - 16, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
488
Total Downloads

Downloads (Last 12 months)488
Downloads (Last 6 weeks)221

Reflects downloads up to 24 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten