skip to main content
10.1145/3688351.3689155acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
research-article
Free access

RecTS: A Temporal-Aware Memory System Optimization for Training Deep Learning Recommendation Models

Published: 16 September 2024 Publication History

Abstract

Deep learning recommendation models (DLRM) are crucial for various services, which makes DLRM training a critical workload for data centers. To meet the memory demands of DLRM workloads, the adoption of SSD has become necessary for large-scale models. However, the irregular access patterns of embedding tables in DLRM training result in poor spatial locality and low cache hit rates in the memory system. Furthermore, the mismatched access granularity between DRAM and SSD leads to write amplification and reduced SSD lifespan. To address these challenges, we propose RecTS: a temporal-aware memory system optimization for training deep learning recommendation models. The system uses an innovative replacement policy and vector grouping to achieve a high cache hit rate and improve spatial locality, which in turn reduces write amplification. These mechanisms are designed based on the access pattern of DLRM. During DLRM training, the access pattern of embedding tables is spatially irregular but predetermined. Leveraging these a priori access patterns, the system evicts the vectors and dynamically gathers the evicted embedding vectors with the closer next access timing into an I/O chunk to enhance spatial locality. This improvement makes the system more effective for DLRM workloads. We conducted experiments on both CPU-only and CPU-GPU architectures, where the difference lies in whether the neural network computation (MLP) is computed on the CPU or GPU. Overall, the replacement policy and vector grouping increase the cache hit rate from 40-85% to over 99% across all experiment settings, speeding up the training time by 1.30-5.55 times. Moreover, the SSD write volume was reduced by 3.13-7.30 times, significantly enhancing SSD lifespan.

References

[1]
B. Acun, M. Murphy, X. Wang, J. Nie, C. Wu, and K. Hazelwood. 2021. Understanding Training Efficiency of Deep Learning Recommendation Models at Scale. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society.
[2]
Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J. Nair. 2021. Accelerating Recommendation System Training by Leveraging Popular Choices. Proceedings of VLDB Endowment (2021).
[3]
Saurabh Agarwal, Chengpo Yan, Ziyi Zhang, and Shivaram Venkataraman. 2023. Bagpipe: Accelerating deep recommendation model training. In Proceedings of the 29th Symposium on Operating Systems Principles. 348--363.
[4]
Jens Axboe. 2017. fio - Flexible I/O tester. https://fio.readthedocs.io/en/latest/fio_doc.html.
[5]
Keshav Balasubramanian, Abdulla Alshabanah, Joshua D Choe, and Murali Annavaram. 2021. CDLRM: Look Ahead Caching for Scalable Training of Recommendation Models. In Proceedings of the 15th ACM Conference on Recommender Systems(RecSys). Association for Computing Machinery.
[6]
Tae-Sun Chung, Dong-Joo Park, Sangwon Park, Dong-Ho Lee, Sang-Won Lee, and Ha-Joo Song. 2009. A survey of flash translation layer. Journal of Systems Architecture 55, 5--6 (2009), 332--343.
[7]
Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. In Proceedings of the 10th ACM Conference on Recommender Systems (RecSys). Association for Computing Machinery.
[8]
CriteoLabs. 2013. Download Terabyte Click Logs. https://labs.criteo.com/2013/12/download-terabyte-click-logs-2/.
[9]
Javier Díaz, Pablo Ibáñez, Teresa Monreal, Víctor Viñals, and José M. Llabería. 2021. Near-Optimal Replacement Policies for Shared Caches in Multicore Processors. J. Supercomput. 77, 10 (2021).
[10]
Assaf Eisenman, Maxim Naumov, Darryl Gardner, Misha Smelyanskiy, Sergey Pupyrev, Kim Hazelwood, Asaf Cidon, and Sachin Katti. 2018. Bandana: Using Non-volatile Memory for Storing Deep Learning Models.
[11]
Facebook. 2019. MLPerf Training Script for DLRM. https://github.com/facebookresearch/dlrm/blob/master/bench/run_and_time.sh. Accessed: 2024-05-30.
[12]
Udit Gupta, Carole-Jean Wu, Xiaodong Wang, Maxim Naumov, Brandon Reagen, David Brooks, Bradford Cottel, Kim Hazelwood, Mark Hempstead, Bill Jia, Hsien-Hsin S. Lee, Andrey Malevich, Dheevatsa Mudigere, Mikhail Smelyanskiy, Liang Xiong, and Xuan Zhang. 2020. The Architectural Implications of Facebook's DNN-Based Personalized Recommendation. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA).
[13]
Kim Hazelwood, Sarah Bird, David Brooks, Soumith Chintala, Utku Diril, Dmytro Dzhulgakov, Mohamed Fawzy, Bill Jia, Yangqing Jia, Aditya Kalro, James Law, Kevin Lee, Jason Lu, Pieter Noordhuis, Misha Smelyanskiy, Liang Xiong, and Xiaodong Wang. 2018. Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE Computer Society, 620--629.
[14]
Yuzhen Huang, Xiaohan Wei, Xing Wang, Jiyan Yang, Bor-Yiing Su, Shivam Bharuka, Dhruv Choudhary, Zewei Jiang, Hai Zheng, and Jack Langman. 2021. Hierarchical Training: Scaling Deep Recommendation Models on Large CPU Clusters. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining(KDD). Association for Computing Machinery.
[15]
Ben Juurlink. 2004. Approximating the Optimal Replacement Algorithm. In Proceedings of the 1st Conference on Computing Frontiers (CF). Association for Computing Machinery.
[16]
Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping Chen, Mikhail Shiryaev, and Alexander Heinecke. 2020. Optimizing Deep Learning Recommender Systems Training on CPU Cluster Architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE Press, Article 43.
[17]
Hiwot Tadese Kassa, Paul Johnson, Jason Akers, Mrinmoy Ghosh, Andrew Tulloch, Dheevatsa Mudigere, Jongsoo Park, Xing Liu, Ronald Dreslinski, and Ehsan K. Ardestani. 2023. MTrainS: Improving DLRM training efficiency using heterogeneous memories.
[18]
Liu Ke, Udit Gupta, Benjamin Youngjae Cho, David Brooks, Vikas Chandra, Utku Diril, Amin Firoozshahian, Kim Hazelwood, Bill Jia, Hsien-Hsin S. Lee, Meng Li, Bert Maher, Dheevatsa Mudigere, Maxim Naumov, Martin Schatz, Mikhail Smelyanskiy, Xiaodong Wang, Brandon Reagen, Carole-Jean Wu, Mark Hempstead, and Xuan Zhang. 2020. RecNMP: Accelerating Personalized Recommendation with near-Memory Processing. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE Press.
[19]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016).
[20]
Youngeun Kwon and Minsoo Rhu. 2022. Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward Not Backwards. In Proceedings of the 49th Annual International Symposium on Computer Architecture(ISCA). Association for Computing Machinery.
[21]
Sang-Won Lee, Dong-Joo Park, Tae-Sun Chung, Dong-Ho Lee, Sangwon Park, and Ha-Joo Song. 2007. A log buffer-based flash translation layer using fully-associative sector translation. ACM Transactions on Embedded Computing Systems (TECS) 6, 3 (2007), 18-es.
[22]
Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon.Com Recommendations: Item-to-Item Collaborative Filtering.
[23]
R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. 1970. Evaluation Techniques for Storage Hierarchies. IBM Syst. J. 9, 2 (1970), 78--117.
[24]
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie (Amy) Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Dimitry Melts, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Guoqiang Jerry Chen, Manoj Krishnan, Avinash Nayak, Krishnakumar Nair, Bharath Muthiah, Mahmoud khorashadi, Pallab Bhattacharya, Petr Lapukhov, Maxim Naumov, Ajit Mathews, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, and Vijay Rao. 2022. Software-Hardware Co-Design for Fast and Scalable Training of Deep Learning Recommendation Models. In Proceedings of the 49th Annual International Symposium on Computer Architecture (ISCA). Association for Computing Machinery.
[25]
Maxim Naumov, John Kim, Dheevatsa Mudigere, Srinivas Sridharan, Xiaodong Wang, Whitney Zhao, Serhat Yilmaz, Changkyu Kim, Hector Yuen, Mustafa Ozdal, Krishnakumar Nair, Isabel Gao, Bor-Yiing Su, Jiyan Yang, and Mikhail Smelyanskiy. 2020. Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems.
[26]
Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, and Misha Smelyanskiy. 2019. Deep Learning Recommendation Model for Personalization and Recommendation Systems.
[27]
Jongsoo Park, Maxim Naumov, Protonu Basu, Summer Deng, Aravind Kalaiah, Daya Khudia, James Law, Parth Malani, Andrey Malevich, Satish Nadathur, Juan Pino, Martin Schatz, Alexander Sidorov, Viswanath Sivakumar, Andrew Tulloch, Xiaodong Wang, Yiming Wu, Hector Yuen, Utku Diril, Dmytro Dzhulgakov, Kim Hazelwood, Bill Jia, Yangqing Jia, Lin Qiao, Vijay Rao, Nadav Rotem, Sungjoo Yoo, and Mikhail Smelyanskiy. 2018. Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications.
[28]
Geet Sethi, Bilge Acun, Niket Agarwal, Christos Kozyrakis, Caroline Trippel, and Carole-Jean Wu. 2022. RecShard: statistical feature-based memory optimization for industry-scale neural recommendation. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 344--358.
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[30]
Western Digital Co. 2020. Western Digital Ultrastar® DC SN840 Data Sheet. https://documents.westerndigital.com/content/dam/doc-library/en_us/assets/public/western-digital/product/data-center-drives/ultrastar-nvme-series/data-sheet-ultrastar-dc-sn840.pdf. Accessed: 2024-05-30.
[31]
Mark Wilkening, Udit Gupta, Samuel Hsia, Caroline Trippel, Carole-Jean Wu, David Brooks, and Gu-Yeon Wei. 2021. RecSSD: Near Data Processing for Solid State Drive Based Recommendation Inference. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Association for Computing Machinery.
[32]
Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding, Mingming Sun, and Ping Li. 2020. Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems. In Proceedings of Machine Learning and Systems 2020, MLSys 2020, Austin, TX, USA, March 2-4, 2020. arXiv.
[33]
Weijie Zhao, Jingyuan Zhang, Deping Xie, Yulei Qian, Ronglai Jia, and Ping Li. 2019. AIBox: CTR Prediction Model Training on a Single Node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). Association for Computing Machinery.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SYSTOR '24: Proceedings of the 17th ACM International Systems and Storage Conference
September 2024
212 pages
ISBN:9798400711817
DOI:10.1145/3688351
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • Technion: Israel Institute of Technology

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 September 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hierarchical Memory System
  2. Recommendation System
  3. SSD Cache Management

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

SYSTOR '24
Sponsor:

Acceptance Rates

SYSTOR '24 Paper Acceptance Rate 14 of 38 submissions, 37%;
Overall Acceptance Rate 108 of 323 submissions, 33%

Upcoming Conference

SYSTOR '24
The 17th ACM International Systems and Storage Conference
September 23 - 24, 2024
Virtual , Israel

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Sep 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media