skip to main content
research-article

RLPTO: A Reinforcement Learning-Based Performance-Time Optimized Task and Resource Scheduling Mechanism for Distributed Machine Learning

Published: 20 September 2023 Publication History

Abstract

With the wide application of deep learning, the amount of data required to train deep learning models is becoming increasingly larger, resulting in an increased training time and higher requirements for computing resources. To improve the throughput of a distributed learning system, task scheduling and resource scheduling are required. This article proposes to combine ARIMA and GRU models to predict the future task volume. In terms of task scheduling, multi-priority task queues are used to divide tasks into different queues according to their priorities to ensure that high-priority tasks can be completed in advance. In terms of resource scheduling, the reinforcement learning method is adopted to manage limited computing resources. The reward function of reinforcement learning is constructed based on the resources occupied by the task, the training time, the accuracy of the model. When a distributed learning model tends to converge, the computing resources of the task are gradually reduced so that they can be allocated to other learning tasks. The results of experiments demonstrate that RLPTO tends to use more compu-ting nodes when facing tasks with large data scale and has good scalability. The distributed learning system reward experiment shows that RLPTO can make the computing cluster get the largest reward.

References

[1]
C.-L. Chang, H.-M. Tseng, and H.-T. Chu, “Fireworks image classification with deep learning,” in Proc. Int. Conf. Technol. Appl. Artif. Intell., 2021, pp. 311–314.
[2]
Z. Huang, C. O. Dumitru, and J. Ren, “Physics-aware feature learning of SAR images with deep neural networks: A case study,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., 2021, pp. 1264–1267.
[3]
P. Wang, “Research and design of smart home speech recognition system based on deep learning,” in Proc. Int. Conf. Comput. Vis. Image Deep Learn., 2020, pp. 218–221.
[4]
S. Wan, “Research on speech separation and recognition algorithm based on deep learning,” in Proc. IEEE Int. Conf. Power Intell. Comput. Syst., 2021, pp. 722–725.
[5]
H. A. C. Okan Sakar, “A dynamic recurrent neural networks-based recommendation system for banking customers,” in Proc. 29th Signal Process. Commun. Appl. Conf., 2021, pp. 1–4.
[6]
G. Huang, “E-commerce intelligent recommendation system based on deep learning,” in Proc. IEEE Asia-Pacific Conf. Image Process. Electron. Comput., 2022, pp. 1154–1157.
[7]
S. Roy et al., “Deep learning for classification and localization of COVID-19 markers in point-of-care lung ultrasound,” IEEE Trans. Med. Imag., vol. 39, no. 8, pp. 2676–2687, Aug. 2020.
[8]
S. Xue et al., “Development of a deep learning method for CT-free correction for an ultra-long axial field of view PET scanner,” in Proc. IEEE 43rd Annu. Int. Conf. Eng. Med. Biol. Soc., 2021, pp. 4120–4122.
[9]
M. Patel, A. Das, V. K. Pant, and M. J., “Detection of tuberculosis in radiographs using deep learning-based ensemble methods,” in Proc. Smart Technol. Commun. Robot., 2021, pp. 1–7.
[10]
S. Kim, N. Pham, W. Baek, and Y. Choi, “Machine-learning based performance estimation for distributed parallel applications in virtualized heterogeneous clusters,” in Proc. IEEE 37th Int. Conf. Distrib. Comput. Syst., 2017, pp. 2610–2611.
[11]
W. Xiao et al., “Distributed graph computation meets machine learning,” IEEE Trans. Parallel Distrib. Syst., vol. 31, no. 7, pp. 1588–1604, Jul. 2020.
[12]
A. Abd Elrahman, M. El Helw, R. Elshawi, and S. Sakr, “D-smartML: A distributed automated machine learning framework,” in Proc. IEEE 40th Int. Conf. Distrib. Comput. Syst., 2020, pp. 1215–1218.
[13]
M. Abadi et al., “Tensorflow: A system for large-scale machine learning,” in Proc.12th USENIX Symp. Operating Syst. Des. Implementation, Savannah, 2016, pp. 265–283.
[14]
D. Amodei et al., “Deep speech 2: End-to-end speech recognition in English and Mandarin,” in Proc. Int. Conf. Mach. Learn., New York, NY, USA City, 2016, pp. 173–182.
[15]
F. N. Iandola et al., “FireCaffe: Near-linear acceleration of deep neural network training on compute clusters,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, 2016, pp. 2592–2600.
[16]
S. Jeon et al., “MapReduce tuning to improve distributed machine learning performance,” in Proc. IEEE 1st Int. Conf. Artif. Intell. Knowl. Eng., 2018, pp. 198–200.
[17]
W. Zhao and J. A. Stankovic, “Performance analysis of FCFS and improved FCFS scheduling algorithms for dynamic real-time computer systems,” in Proc. Real-Time Syst. Symp., 1989, pp. 156–165.
[18]
M. Isard et al., “Quincy: Fair scheduling for distributed computing clusters,” in Proc. ACM SIGOPS 22nd Symp. Operating Syst. Princ., 2009, pp. 261–276.
[19]
M. Zaharia et al., “Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling,” in Proc. 5th Eur. Conf. Comput. Syst., 2010, pp. 265–278.
[20]
W. Wang, B. Li, and B. Liang, “Dominant resource fairness in cloud computing systems with heterogeneous servers,” in Proc. IEEE Conf. Comput. Commun., 2014, pp. 583–591.
[21]
F. Yan et al., “Performance modeling and scalability optimization of distributed deep learning systems,” in Proc. 21th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, Sydney, 2015, pp. 1355–1364.
[22]
Y. S. L. Lee et al., “Dolphin: Runtime optimization for distributed machine learning,” in Proc. Int. Conf. Mach. Learn. ML Syst. Workshop, New York, NY, USA City, 2016, pp. 1–14.
[23]
C. Gao, R. Ren, and H. Cai, “GAI: A centralized tree-based scheduler for machine learning workload in large shared clusters,” in Proc. Int. Conf. Algorithms Architectures Parallel Process., Cham, Switzerland: Springer, 2018, pp. 611–629.
[24]
J. Gu et al., “Tiresias: A GPU cluster manager for distributed deep learning,” in Proc. 16th USENIX Symp. Networked Syst. Des. Implementation, 2019, pp. 485–500.
[25]
Y. Peng et al., “Optimus: An efficient dynamic resource scheduler for deep learning clusters,” in Proce. 13th EuroSys Conf., 2018, pp. 1–14.
[26]
H. Zhang et al., “SLAQ: Quality-driven scheduling for distributed machine learning,” in Proc. Symp. Cloud Comput., 2017, pp. 390–404.
[27]
W. Zheng et al., “Target-based resource allocation for deep learning applications in a multi-tenancy system,” in Proc. IEEE High Perform. Extreme Comput. Conf., 2019, pp. 1–7.
[28]
Y. S. L. Lee et al., “Dolphin: Runtime optimization for distributed machine learning,” in Proc. ICML ML Syst. Workshop, 2016. pp. 1–14.
[29]
R. Hyndman, “Time series data library [EB/OL],” 2018. [Online]. Available: https://pkg.yangzhuoranyang.com/tsdl

Index Terms

  1. RLPTO: A Reinforcement Learning-Based Performance-Time Optimized Task and Resource Scheduling Mechanism for Distributed Machine Learning
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Parallel and Distributed Systems
    IEEE Transactions on Parallel and Distributed Systems  Volume 34, Issue 12
    Dec. 2023
    311 pages

    Publisher

    IEEE Press

    Publication History

    Published: 20 September 2023

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 0
      Total Downloads
    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 06 Nov 2024

    Other Metrics

    Citations

    View Options

    View options

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media