skip to main content
10.1145/3673038.3673138acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

Yggdrasil: Reducing Network I/O Tax with (CXL-Based) Distributed Shared Memory

Published: 12 August 2024 Publication History

Abstract

In communication-intensive applications that run on hosts with high-speed network hardware, a common challenge arises from the significant burden placed on the native socket system within the OS. Researchers have devoted considerable effort to optimizing the kernel networking stack and moving the TCP/IP stack to user-space. In this paper, we describe a novel socket replacement solution, Yggdrasil, a CXL-based user-space high-performance socket system. Yggdrasil is fully compatible with Linux socket, making it a drop-in replacement for existing applications without the need for code modifications. In order to optimize performance, Yggdrasil employs CXL-based distributed shared memory (DSM) for inter-host communication whenever it is available. In cases where DSM is not accessible, Yggdrasil transparently switches back to Linux socket for communication. A key element in achieving isolation in Yggdrasil involves a trusted user-space monitoring daemon responsible for managing control plane operations like connection setup and access control. Within the data plane of Yggdrasil, a peer-to-peer model is adopted for communication between processes. To bridge the semantic gap between socket and DSM, we exploit several techniques to ensure compatibility and performance, including (1) transparent dynamic fast/slow data path navigation, (2) decentralized CXL memory management, (3) lock-free queue based QoS-aware dynamic data polling, and (4) semantics-aware memory page migration. By evaluating Yggdrasil on both emulated and real CXL hardware, we show that Yggdrasil outperforms Linux socket in Memcached throughput by 8.2 × and reduces latency by 24 ∼ 320 × in a micro benchmark across different message sizes.

Supplemental Material

PDF File - Appendix: Artifact Description/Artifact Evaluation
This artifact includes a qcow2 image file of a virtual machine (VM) designed to operate within an x86_64 environment, complete with accompanying startup instructions. Within the image file are the necessary components such as source codes, the compiled libnav.so, executable binary files essential for conducting experiments, and shell scripts for testing purposes. These resources are provided to enable straightforward replication of the experiments.
PDF File - Appendix: Artifact Description/Artifact Evaluation
This artifact includes a qcow2 image file of a virtual machine (VM) designed to operate within an x86_64 environment, complete with accompanying startup instructions. Within the image file are the necessary components such as source codes, the compiled libnav.so, executable binary files essential for conducting experiments, and shell scripts for testing purposes. These resources are provided to enable straightforward replication of the experiments.

References

[1]
Mohammad Alian and Nam Sung Kim. 2019. NetDIMM: Low-latency near-memory network interface architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 699–711.
[2]
Jongsok Choi, Ruolong Lian, Zhi Li, Andrew Canis, and Jason Anderson. 2018. Accelerating memcached on aws cloud fpgas. In Proceedings of the 9th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies. 1–8.
[3]
Ho-Ren Chuang, Robert Lyerly, Stefan Lankes, and Binoy Ravindran. 2020. Scaling Shared Memory Multiprocessing Applications in Non-cache-coherent Domains. In Proceedings of the 13th ACM International Systems and Storage Conference. 13–24.
[4]
Intel Corporation. 2024. Intel Memory Latency Checker v3.11. https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html
[5]
SM CXL Consortium 2022. Compute express link: The breakthrough CPU-to-device interconnect. Retrieved February 2 (2022), 2023.
[6]
DPDK. 2024. Data Plane Development Kit. https://www.dpdk.org/
[7]
Padmapriya Duraisamy, Wei Xu, Scott Hare, Ravi Rajwar, David Culler, Zhiyi Xu, Jianing Fan, Christopher Kennelly, Bill McCloskey, Danijela Mijailovic, 2023. Towards an adaptable systems architecture for memory tiering at warehouse-scale. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. 727–741.
[8]
Peter X Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network requirements for resource disaggregation. In 12th USENIX symposium on operating systems design and implementation (OSDI 16). 249–264.
[9]
Dan Gibson, Hema Hariharan, Eric Lance, Moray McLaren, Behnam Montazeri, Arjun Singh, Stephen Wang, Hassan MG Wassel, Zhehua Wu, Sunghwan Yoo, 2022. Aquila: A unified, low-latency fabric for datacenter networks. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22). 1249–1266.
[10]
Hossein Golestani, Amirhossein Mirhosseini, and Thomas F Wenisch. 2019. Software data planes: You can’t always spin to win. In Proceedings of the ACM Symposium on Cloud Computing. 337–350.
[11]
Minho Ha, Junhee Ryu, Jungmin Choi, Kwangjin Ko, Sunwoong Kim, Sungwoo Hyun, Donguk Moon, Byungil Koh, Hokyoon Lee, Myoungseo Kim, 2023. Dynamic Capacity Service for Improving CXL Pooled Memory Efficiency. IEEE Micro 43, 2 (2023), 39–47.
[12]
Yibo Huang, Yukai Huang, Ming Yan, Jiayu Hu, Cunming Liang, Yang Xu, Wenxiong Zou, Yiming Zhang, Rui Zhang, Chunpu Huang, 2022. An ultra-low latency and compatible PCIe interconnect for rack-scale communication. In Proceedings of the 18th International Conference on emerging Networking EXperiments and Technologies. 232–244.
[13]
Ram Huggahalli, Ravi Iyer, and Scott Tetrick. 2005. Direct cache access for high bandwidth network I/O. In 32nd International Symposium on Computer Architecture (ISCA’05). IEEE, 50–59.
[14]
Junhyeok Jang, Hanjin Choi, Hanyeoreum Bae, Seungjun Lee, Miryeong Kwon, and Myoungsoo Jung. 2023. CXL-ANNS: Software-Hardware Collaborative Memory Disaggregation and Computation for Billion-Scale Approximate Nearest Neighbor Search. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). 585–600.
[15]
Xingguo Jia, Jin Zhang, Boshi Yu, Xingyue Qian, Zhengwei Qi, and Haibing Guan. 2022. GiantVM: a novel distributed hypervisor for resource aggregation with DSM-aware optimizations. ACM Transactions on Architecture and Code Optimization (TACO) 19, 2 (2022), 1–27.
[16]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th annual international symposium on computer architecture. 1–12.
[17]
Sang-Hoon Kim, Ho-Ren Chuang, Robert Lyerly, Pierre Olivier, Changwoo Min, and Binoy Ravindran. 2020. DEX: scaling applications beyond machine boundaries. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS). IEEE, 864–876.
[18]
Steen Larsen, Parthasarathy Sarangam, Ram Huggahalli, and Siddharth Kulkarni. 2009. Architectural breakdown of end-to-end latency in a TCP/IP network. International journal of parallel programming 37 (2009), 556–571.
[19]
Daan Leijen, Benjamin Zorn, and Leonardo de Moura. 2019. Mimalloc: Free list sharding in action. In Programming Languages and Systems: 17th Asian Symposium, APLAS 2019, Nusa Dua, Bali, Indonesia, December 1–4, 2019, Proceedings 17. Springer, 244–265.
[20]
Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao Zhang. 2019. SocksDirect: Datacenter sockets can be fast and compatible. In Proceedings of the ACM Special Interest Group on Data Communication. 90–103.
[21]
Huaicheng Li, Daniel S Berger, Lisa Hsu, Daniel Ernst, Pantea Zardoshti, Stanko Novakovic, Monish Shah, Samir Rajadnya, Scott Lee, Ishwar Agarwal, 2023. Pond: Cxl-based memory pooling systems for cloud platforms. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 574–587.
[22]
Kai Li and Paul Hudak. 1986. Memory Coherence in Shared Virtual Memory Systems. In Proceedings of the Fifth Annual ACM Symposium on Principles of Distributed Computing, Calgary, Alberta, Canada, August 11-13, 1986, Joseph Y. Halpern (Ed.). ACM, 229–239. https://doi.org/10.1145/10590.10610
[23]
Tongping Liu and Emery D Berger. 2011. Sheriff: precise detection and automatic mitigation of false sharing. In Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications. 3–18.
[24]
Hasan Al Maruf, Hao Wang, Abhishek Dhanotia, Johannes Weiner, Niket Agarwal, Pallab Bhattacharya, Chris Petersen, Mosharaf Chowdhury, Shobhit Kanaujia, and Prakash Chauhan. 2023. TPP: Transparent Page Placement for CXL-Enabled Tiered-Memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3 (Vancouver, BC, Canada) (ASPLOS 2023). Association for Computing Machinery, New York, NY, USA, 742–755. https://doi.org/10.1145/3582016.3582063
[25]
Charles Masson, Jee E. Rim, and Homin K. Lee. 2019. DDSketch: a fast and fully-mergeable quantile sketch with relative-error guarantees. Proc. VLDB Endow. 12, 12 (aug 2019), 2195–2205. https://doi.org/10.14778/3352063.3352135
[26]
memcached. 2024. memcached. https://memcached.org/
[27]
Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio López-Buedo, and Andrew W Moore. 2018. Understanding PCIe performance for end host networking. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication. 327–341.
[28]
Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). 361–378.
[29]
pmem.io. 2024. Ndctl and daxctl. https://github.com/pmem/ndctl
[30]
Steven Rostedt. 2024. ftrace - Function Tracer. https://www.kernel.org/doc/html/latest/trace/ftrace.html
[31]
André Ryser, Alberto Lerner, Alex Forencich, and Philippe Cudré-Mauroux. 2022. D-RDMA: Bringing Zero-Copy RDMA to Database Systems. In CIDR.
[32]
Debendra Das Sharma. 2023. Compute Express Link™(CXL™): An Open Interconnect for Cloud Infrastructure. In 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 1–4.
[33]
D Das Sharma and Ishwar Agarwal. 2022. Compute Express Link 3.0. white paper, CXL Consortium (2022).
[34]
Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, 2023. Demystifying cxl memory with genuine cxl-ready systems and devices. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 105–121.
[35]
Dan Tang, Yungang Bao, Weiwu Hu, and Mingyu Chen. 2010. DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. In HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture. IEEE, 1–12.
[36]
Konstantin Taranov, Steve Byan, Virendra Marathe, and Torsten Hoefler. 2022. Kafkadirect: Zero-copy data access for apache kafka over rdma networks. In Proceedings of the 2022 International Conference on Management of Data. 2191–2204.
[37]
Amin Tootoonchian, Aurojit Panda, Chang Lan, Melvin Walls, Katerina Argyraki, Sylvia Ratnasamy, and Scott Shenker. 2018. ResQ: Enabling SLOs in Network Function Virtualization. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18). 283–297.
[38]
Keval Vora, Sai Charan Koduru, and Rajiv Gupta. 2014. ASPIRE: Exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based DSM. ACM SIGPLAN Notices 49, 10 (2014), 861–878.
[39]
Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, and Huatao Wu. 2024. Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL. ACM Transactions on Architecture and Code Optimization 21, 1 (2024), 1–26.
[40]
Yifan Yuan, Mohammad Alian, Yipeng Wang, Ren Wang, Ilia Kurakin, Charlie Tai, and Nam Sung Kim. 2021. Don’t forget the I/O when allocating your LLC. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 112–125.
[41]
Mingxing Zhang, Teng Ma, Jinqi Hua, Zheng Liu, Kang Chen, Ning Ding, Fan Du, Jinlei Jiang, Tao Ma, and Yongwei Wu. 2023. Partial Failure Resilient Memory Management System for (CXL-based) Distributed Shared Memory. In Proceedings of the 29th Symposium on Operating Systems Principles (Koblenz, Germany) (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 658–674.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '24: Proceedings of the 53rd International Conference on Parallel Processing
August 2024
1279 pages
ISBN:9798400717932
DOI:10.1145/3673038
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2024

Check for updates

Badges

Author Tags

  1. CXL
  2. Disaggregation
  3. Memory Management
  4. Rack-Scale Communication
  5. Socket

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP '24

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 266
    Total Downloads
  • Downloads (Last 12 months)266
  • Downloads (Last 6 weeks)132
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media