skip to main content
research-article

PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation

Published: 20 September 2024 Publication History

Abstract

Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from many performance issues. First, they may introduce repeated cache line flushes and small random accesses in persistent memory for their poor heap metadata management. Second, they use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. Third, they are not aware of NUMA effect, leading to remote persistent memory accesses in memory allocation and deallocation processes. In this article, we design a novel allocator, named PMAlloc, to solve the above issues simultaneously. (1) PMAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. (2) It writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. (3) Instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. (4) It uses a local-first allocation policy to avoid allocating remote memory blocks. And it supports a two-phase deallocation mechanism including recording and synchronization to minimize the number of remote memory access in the deallocation. PMAlloc is complementary to the existing consistency models. Results on six benchmarks demonstrate that PMAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4× and 57× for small and large allocations, respectively. PMAlloc with NUMA optimizations brings a 2.9× speedup in multi-socket evaluation and is up to 36× faster than other persistent memory allocators. Using PMAlloc reduces memory usage by up to 57.8%. Besides, we integrate PMAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, PMAlloc improves the performance of this application by up to 3.1×.

References

[1]
Wilhelm Ackermann. 1928. Zum hilbertschen aufbau der reellen zahlen. Math. Ann. 99, 1 (1928), 118–133.
[2]
Martin Aigner, Christoph M. Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, multicore-scalable, low-fragmentation memory allocation through large virtual memory and global data structures. ACM SIGPLAN Not. 50, 10 (2015), 451–469.
[3]
Chloe Alverti, Vasileios Karakostas, Nikhita Kunati, Georgios Goumas, and Michael Swift. 2022. DaxVM: Stressing the limits of memory as a file interface. In Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO’22). 369–387.
[4]
Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor. 2015. Let’s talk about storage & recovery methods for non-volatile memory database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). Association for Computing Machinery, 707–722.
[5]
Josh Barnes and Piet Hut. 1986. A hierarchical O (N log N) force-calculation algorithm. Nature 324, 6096 (1986), 446–449.
[6]
Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. 2000. Hoard: A scalable memory allocator for multithreaded applications. ACM SIGPLAN Not. 35, 11 (2000), 117–128.
[7]
Bo Bernhardsson. 1991. Explicit solutions to the N-queens problem for all N. ACM SiGART Bull. 2, 2 (1991), 7.
[8]
Kumud Bhandari, Dhruva R. Chakrabarti, and Hans-J. Boehm. 2016. Makalu: Fast recoverable allocation of non-volatile memory. ACM SIGPLAN Not. 51, 10 (2016), 677–694.
[9]
Hans-Juergen Boehm. 1993. Space efficient conservative garbage collection. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’93). New York, NY, 197–206.
[10]
Wentao Cai, Haosen Wen, H. Alan Beadle, Chris Kjellqvist, Mohammad Hedayati, and Michael L. Scott. 2020. Understanding and optimizing persistent memory allocation. In Proceedings of the ACM SIGPLAN International Symposium on Memory Management (ISMM’20). 60–73.
[11]
Zhichao Cao, Siying Dong, Sagar Vemuri, and David H. C. Du. 2020. Characterizing, modeling, and benchmarking RocksDB key-value workloads at Facebook. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 209–223.
[12]
Guoyang Chen, Lei Zhang, Richa Budhiraja, Xipeng Shen, and Youfeng Wu. 2017. Efficient support of position independence on non-volatile memory. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’17). 191–203.
[13]
Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, and Jiwu Shu. 2020. FlatStore: An efficient log-structured key-value storage engine for persistent memory. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20). 1077–1091.
[14]
Zhangyu Chen, Yu Huang, Bo Ding, and Pengfei Zuo. 2020. Lock-free concurrent level hashing for persistent memory. In Proceedings of theUSENIX Annual Technical Conference (ATC’20). 799–812.
[15]
Joel Coburn, Adrian M. Caulfield, Ameen Akel, Laura M. Grupp, Rajesh K. Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: Making persistent objects fast and safe with next-generation, non-volatile memories. ACM SIGARCH Comput. Archit. News 39, 1 (2011), 105–118.
[17]
Intel Corporation. 2020. Persistent Memory Development Kit. Retrieved from http://pmem.io/
[19]
Andreia Correia, Pascal Felber, and Pedro Ramalhete. 2018. Romulus: Efficient algorithms for persistent transactional memory. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures (SPAA’18). 271–282.
[20]
Zheng Dang, Shuibing He, Peiyi Hong, Zhenxin Li, Xuechen Zhang, Xian-He Sun, and Gang Chen. 2022. NVAlloc: Rethinking heap metadata management in persistent memory allocators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’22). New York, NY, 115–127.
[21]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, NY, 381–394.
[22]
Arnaldo Carvalho De Melo. 2010. The new Linux “perf” tools. In Slides from Linux Kongress, Vol. 18. 1–42.
[23]
Anthony Demeri, Wook-Hee Kim, R. Madhava Krishnan, Jaeho Kim, Mohannad Ismail, and Changwoo Min. 2020. Poseidon: Safe, fast and scalable persistent memory allocator. In Proceedings of the 21st International Middleware Conference (Middleware’20). 207–220.
[24]
Matthias Diener, Eduardo H. M. Cruz, Marco A. Z. Alves, Philippe O. A. Navaux, Anselm Busse, and Hans-Ulrich Heiss. 2015. Kernel-based thread and data mapping for improved memory affinity. IEEE Trans. Parallel Distrib. Syst. 27, 9 (2015), 2653–2666.
[25]
Dominik Durner, Viktor Leis, and Thomas Neumann. 2019. On the impact of memory allocation on high-performance query processing. In Proceedings of the 15th International Workshop on Data Management on New Hardware (DaMoN’19). 1–3.
[27]
Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexandra Fedorova, and Vivien Quéma. 2014. Large pages may be harmful on NUMA systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC 14). 231–242.
[29]
Jinyu Gu, Qianqian Yu, Xiayang Wang, Zhaoguo Wang, Binyu Zang, Haibing Guan, and Haibo Chen. 2019. Pisces: A scalable and efficient persistent transactional memory. In Proceedings of the USENIX Annual Technical Conference (ATC’19). USENIX Association, 913–928.
[30]
Tom’s Hardware. 2022. Samsung’s Memory-Semantic CXL SSD Brings a 20x Performance Uplift. Retrieved from https://www.tomshardware.com/news/samsung-memory-semantic-cxl-ssd-brings-20x-performance-uplift
[31]
Qingda Hu, Jinglei Ren, Anirudh Badam, Jiwu Shu, and Thomas Moscibroda. 2017. Log-structured non-volatile main memory. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 703–717.
[32]
Xiameng Hu, Xiaolin Wang, Yechen Li, Lan Zhou, Yingwei Luo, Chen Ding, Song Jiang, and Zhenlin Wang. 2015. LAMA: Optimized locality-aware memory allocation for key-value cache. In Proceedings of the USENIX Annual Technical Conference (ATC’15). USENIX Association, 57–69.
[34]
Intel Inc. 2022. IPMCTL: A Command Line Interface (CLI) application for configuring and managing PMems. Retrieved from https://github.com/intel/ipmctl/
[35]
Intel Inc. 2022. Processor Counter Monitor (PCM). Retrieved from https://github.com/intel/pcm/
[36]
Intel Inc. 2023. Intel® 64 and IA-32 Architectures Optimization Reference Manual.
[37]
Abdullah Al Raqibul Islam and Dong Dai. 2023. DGAP: Efficient dynamic graph analysis on persistent memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’23). Association for Computing Machinery, New York, NY.
[38]
Keita Iwabuchi, Lance Lebanoff, Maya Gokhale, and Roger Pearce. 2019. Metall: A persistent memory allocator enabling graph processing. In Proceedings of the IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3’19). 39–44.
[39]
Jemalloc. 2023. Jemalloc(3) Manual Page. Retrieved from https://jemalloc.net/jemalloc.3.html
[40]
Hai Jin, Zhiwei Li, Haikun Liu, Xiaofei Liao, and Yu Zhang. 2020. Hotspot-aware hybrid memory management for in-memory key-value stores. IEEE Trans. Parallel Distrib. Syst. 31, 4 (2020), 779–792.
[41]
Mark S. Johnstone and Paul R. Wilson. 1998. The memory fragmentation problem: Solved? ACM SIGPLAN Not. 34, 3 (1998), 26–36.
[42]
Myoungsoo Jung. 2022. Hello bytes, bye blocks: PCIe storage meets compute express link for memory expansion (CXL-SSD). In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage’22). 45–51.
[43]
Patryk Kaminski. 2009. NUMA aware heap memory manager. AMD Devel. Centr. (2009), 46.
[44]
Sanidhya Kashyap, Changwoo Min, Kangnyeon Kim, and Taesoo Kim. 2018. A scalable ordering primitive for multicore machines. In Proceedings of the 13th EuroSys Conference (EuroSys’18). Association for Computing Machinery, New York, NY.
[45]
Wonbae Kim, Chanyeol Park, Dongui Kim, Hyeongjun Park, Young ri Choi, Alan Sussman, and Beomseok Nam. 2022. ListDB: Union of write-ahead logs and persistent skiplists for incremental checkpointing on persistent memory. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22). 161–177.
[46]
Wook-Hee Kim, R. Madhava Krishnan, Xinwei Fu, Sanidhya Kashyap, and Changwoo Min. 2021. PACTree: A high performance persistent range index using PAC guidelines. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP’21). Association for Computing Machinery, 424–439.
[47]
Joseph B. Kruskal. 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Amer. Math. Societ. 7, 1 (1956), 48–50.
[48]
Mohan Kumar Kumar, Steffen Maass, Sanidhya Kashyap, Ján Veselý, Zi Yan, Taesoo Kim, Abhishek Bhattacharjee, and Tushar Krishna. 2018. LATR: Lazy translation coherence. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). New York, NY, 651–664.
[49]
Per-Åke Larson and Murali Krishnan. 1998. Memory allocation for long-running server applications. In Proceedings of the 1st International Symposium on Memory Management (ISMM’98). 176–185.
[50]
Se Kwon Lee, K. Hyun Lim, Hyunsub Song, Beomseok Nam, and Sam H. Noh. 2017. WORT: Write Optimal Radix Tree for persistent memory storage systems. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST’17). 257–270.
[51]
Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. 2019. Recipe: Converting concurrent DRAM indexes to persistent-memory indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19). 462–477.
[52]
Daan Leijen. 2019. MiMalloc Benchmarks. Retrieved from https://github.com/daanx/mimalloc-bench
[53]
Daan Leijen, Benjamin Zorn, and Leonardo de Moura. 2019. MiMalloc: Free list sharding in action. In Proceedings of the 17th Asian Symposium on Programming Languages and Systems (APLAS’)19. Springer, 244–265.
[55]
Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and memory placement on NUMA systems: Asymmetry matters. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’15). 277–289.
[56]
Zhenxin Li, Bing Jiao, Shuibing He, and Weikuan Yu. 2022. PhaST: Hierarchical concurrent log-free skip list for persistent memory. IEEE Trans. Parallel Distrib. Syst. 33, 12 (2022), 3929–3941.
[57]
Jihang Liu, Shimin Chen, and Lujun Wang. 2020. LB+ Trees: Optimizing persistent index performance on 3DXPoint memory. Proc. VLDB Endow. 13, 7 (2020), 1078–1090.
[58]
Baotong Lu, Xiangpeng Hao, Tianzheng Wang, and Eric Lo. 2020. Dash: Scalable hashing on persistent memory. Proc. VLDB Endow. 13, 8 (2020), 1147–1161.
[59]
Shaonan Ma, Kang Chen, Shimin Chen, Mengxing Liu, Jianglang Zhu, Hongbo Kang, and Yongwei Wu. 2021. ROART: Range-query optimized persistent art. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21). 1–16.
[60]
MicroQuill Inc. 2014. shbench. Retrieved from http://www.microquill.com/
[61]
Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. 2008. STAMP: Stanford transactional applications for multi-processing. In Proceedings of the IEEE International Symposium on Workload Characterization. IEEE, 35–46.
[62]
Iulian Moraru, David G. Andersen, Michael Kaminsky, Niraj Tolia, Parthasarathy Ranganathan, and Nathan Binkert. 2013. Consistent, durable, and safe memory management for byte-addressable non-volatile main memory. In Proceedings of the 1st ACM SIGOPS Conference on Timely Results in Operating Systems (TRIOS’13). 1–17.
[63]
Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner, Thomas Willhalm, and Grégoire Gomes. 2017. Memory management techniques for large-scale persistent-main-memory systems. Proc. VLDB Endow. 10, 11 (2017), 1166–1177.
[64]
Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Willhalm, and Wolfgang Lehner. 2016. FPTree: A hybrid SCM-DRAM persistent and concurrent B-tree for storage class memory. In Proceedings of the International Conference on Management of Data (SIGMOD’16). 371–386.
[65]
Xing Pan, Yasaswini Jyothi Gownivaripalli, and Frank Mueller. 2016. TintMalloc: Reducing memory access divergence via controller-aware coloring. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’16). 363–372.
[66]
Mihail Popov, Alexandra Jimborean, and David Black-Schaffer. 2019. Efficient thread/page/parallelism autotuning for NUMA systems. In Proceedings of the ACM International Conference on Supercomputing. 342–353.
[67]
Bobby Powers, David Tench, Emery D. Berger, and Andrew McGregor. 2019. Mesh: Compacting memory management for C/C++ applications. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 333–346.
[68]
Andy Rudoff. 2020. Persistent memory programming without all that cache flushing. Retrieved from https://www.snia.org/educational-library/persistent-memory-programming-without-all-cache-flushing-2020
[69]
Stephen M. Rumble, Ankita Kejriwal, and John Ousterhout. 2014. Log-structured memory for DRAM-based storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST’14). 1–16.
[70]
Scott Schneider, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos. 2006. Scalable locality-conscious multithreaded memory allocation. In Proceedings of the 5th International Symposium on Memory Management (ISMM’06). 84–94.
[71]
David Schwalb, Tim Berning, Martin Faust, Markus Dreseler, and Hasso Plattner. 2015. Nvm malloc: Memory allocation for NVRAM. ADMS@ VLDB 15 (2015), 61–72.
[72]
Haris Volos, Andres Jaan Tack, and Michael M. Swift. 2011. Mnemosyne: Lightweight persistent memory. ACM SIGARCH Comput. Archit. News 39, 1 (2011), 91–104.
[73]
Mehul Wagle, Daniel Booss, Ivan Schreter, and Daniel Egenolf. 2015. NUMA-aware memory management with in-memory databases. In Proceedings of the Technology Conference on Performance Evaluation and Benchmarking (TPCTC’15). Springer, 45–60.
[74]
Qing Wang, Youyou Lu, Junru Li, and Jiwu Shu. 2021. Nap: A black-box approach to NUMA-aware persistent memory indexes. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI’21). USENIX Association, 93–111.
[75]
Rui Wang, Shuibing He, Weixu Zong, Yongkun Li, and Yinlong Xu. 2022. XPGraph: XPline-friendly persistent memory graph stores for large-scale evolving graphs. In Proceedings of the 55th IEEE/ACM International Symposium on Microarchitecture (MICRO’22). 1308–1325.
[76]
Paul R. Wilson, Mark S. Johnstone, Michael Neely, and David Boles. 1995. Dynamic storage allocation: A survey and critical review. In Proceedings of the International Workshop on Memory Management (IWMM’95). Springer, 1–116.
[77]
Kai Wu, Jie Ren, Ivy Peng, and Dong Li. 2021. ArchTM: Architecture-aware, high performance transaction for persistent memory. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST’21). 141–153.
[78]
Lingfeng Xiang, Xingsheng Zhao, Jia Rao, Song Jiang, and Hong Jiang. 2022. Characterizing the performance of Intel Optane persistent memory: A close look at its on-DIMM buffering. In Proceedings of the 17th European Conference on Computer Systems (EuroSys’22). 488–505.
[79]
Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST’20). 169–182.
[80]
Diyu Zhou, Yuchen Qian, Vishal Gupta, Zhifei Yang, Changwoo Min, and Sanidhya Kashyap. 2022. ODINFS: Scaling PM performance with opportunistic delegation. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22). USENIX Association, 179–193.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems
ACM Transactions on Computer Systems  Volume 42, Issue 3-4
November 2024
162 pages
EISSN:1557-7333
DOI:10.1145/3696660
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 September 2024
Online AM: 03 February 2024
Accepted: 21 January 2024
Revised: 08 September 2023
Received: 26 November 2022
Published in TOCS Volume 42, Issue 3-4

Check for updates

Author Tags

  1. Dynamic memory allocation
  2. persistent memory
  3. memory fragmentation
  4. non-uniform memory access

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program of China
  • National Science Foundation of China
  • Major Projects of Zhejiang Province
  • Program of Zhejiang Province Science and Technology
  • US National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)871
  • Downloads (Last 6 weeks)137
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media