research-article

RLPTO: A Reinforcement Learning-Based Performance-Time Optimized Task and Resource Scheduling Mechanism for Distributed Machine Learning

Authors:

Pan HuiAuthors Info & Claims

IEEE Transactions on Parallel and Distributed Systems, Volume 34, Issue 12

Pages 3266 - 3279

https://doi.org/10.1109/TPDS.2023.3317388

Published: 20 September 2023 Publication History

Abstract

With the wide application of deep learning, the amount of data required to train deep learning models is becoming increasingly larger, resulting in an increased training time and higher requirements for computing resources. To improve the throughput of a distributed learning system, task scheduling and resource scheduling are required. This article proposes to combine ARIMA and GRU models to predict the future task volume. In terms of task scheduling, multi-priority task queues are used to divide tasks into different queues according to their priorities to ensure that high-priority tasks can be completed in advance. In terms of resource scheduling, the reinforcement learning method is adopted to manage limited computing resources. The reward function of reinforcement learning is constructed based on the resources occupied by the task, the training time, the accuracy of the model. When a distributed learning model tends to converge, the computing resources of the task are gradually reduced so that they can be allocated to other learning tasks. The results of experiments demonstrate that RLPTO tends to use more compu-ting nodes when facing tasks with large data scale and has good scalability. The distributed learning system reward experiment shows that RLPTO can make the computing cluster get the largest reward.

References

[1]

C.-L. Chang, H.-M. Tseng, and H.-T. Chu, “Fireworks image classification with deep learning,” in Proc. Int. Conf. Technol. Appl. Artif. Intell., 2021, pp. 311–314.

[2]

Z. Huang, C. O. Dumitru, and J. Ren, “Physics-aware feature learning of SAR images with deep neural networks: A case study,” in Proc. IEEE Int. Geosci. Remote Sens. Symp., 2021, pp. 1264–1267.

[3]

P. Wang, “Research and design of smart home speech recognition system based on deep learning,” in Proc. Int. Conf. Comput. Vis. Image Deep Learn., 2020, pp. 218–221.

[4]

S. Wan, “Research on speech separation and recognition algorithm based on deep learning,” in Proc. IEEE Int. Conf. Power Intell. Comput. Syst., 2021, pp. 722–725.

[5]

H. A. C. Okan Sakar, “A dynamic recurrent neural networks-based recommendation system for banking customers,” in Proc. 29th Signal Process. Commun. Appl. Conf., 2021, pp. 1–4.

[6]

G. Huang, “E-commerce intelligent recommendation system based on deep learning,” in Proc. IEEE Asia-Pacific Conf. Image Process. Electron. Comput., 2022, pp. 1154–1157.

[7]

S. Roy et al., “Deep learning for classification and localization of COVID-19 markers in point-of-care lung ultrasound,” IEEE Trans. Med. Imag., vol. 39, no. 8, pp. 2676–2687, Aug. 2020.

[8]

S. Xue et al., “Development of a deep learning method for CT-free correction for an ultra-long axial field of view PET scanner,” in Proc. IEEE 43rd Annu. Int. Conf. Eng. Med. Biol. Soc., 2021, pp. 4120–4122.

[9]

M. Patel, A. Das, V. K. Pant, and M. J., “Detection of tuberculosis in radiographs using deep learning-based ensemble methods,” in Proc. Smart Technol. Commun. Robot., 2021, pp. 1–7.

[10]

S. Kim, N. Pham, W. Baek, and Y. Choi, “Machine-learning based performance estimation for distributed parallel applications in virtualized heterogeneous clusters,” in Proc. IEEE 37th Int. Conf. Distrib. Comput. Syst., 2017, pp. 2610–2611.

[11]

W. Xiao et al., “Distributed graph computation meets machine learning,” IEEE Trans. Parallel Distrib. Syst., vol. 31, no. 7, pp. 1588–1604, Jul. 2020.

Digital Library

[12]

A. Abd Elrahman, M. El Helw, R. Elshawi, and S. Sakr, “D-smartML: A distributed automated machine learning framework,” in Proc. IEEE 40th Int. Conf. Distrib. Comput. Syst., 2020, pp. 1215–1218.

[13]

M. Abadi et al., “Tensorflow: A system for large-scale machine learning,” in Proc.12th USENIX Symp. Operating Syst. Des. Implementation, Savannah, 2016, pp. 265–283.

[14]

D. Amodei et al., “Deep speech 2: End-to-end speech recognition in English and Mandarin,” in Proc. Int. Conf. Mach. Learn., New York, NY, USA City, 2016, pp. 173–182.

[15]

F. N. Iandola et al., “FireCaffe: Near-linear acceleration of deep neural network training on compute clusters,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Las Vegas, 2016, pp. 2592–2600.

[16]

S. Jeon et al., “MapReduce tuning to improve distributed machine learning performance,” in Proc. IEEE 1st Int. Conf. Artif. Intell. Knowl. Eng., 2018, pp. 198–200.

[17]

W. Zhao and J. A. Stankovic, “Performance analysis of FCFS and improved FCFS scheduling algorithms for dynamic real-time computer systems,” in Proc. Real-Time Syst. Symp., 1989, pp. 156–165.

[18]

M. Isard et al., “Quincy: Fair scheduling for distributed computing clusters,” in Proc. ACM SIGOPS 22nd Symp. Operating Syst. Princ., 2009, pp. 261–276.

[19]

M. Zaharia et al., “Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling,” in Proc. 5th Eur. Conf. Comput. Syst., 2010, pp. 265–278.

[20]

W. Wang, B. Li, and B. Liang, “Dominant resource fairness in cloud computing systems with heterogeneous servers,” in Proc. IEEE Conf. Comput. Commun., 2014, pp. 583–591.

[21]

F. Yan et al., “Performance modeling and scalability optimization of distributed deep learning systems,” in Proc. 21th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, Sydney, 2015, pp. 1355–1364.

[22]

Y. S. L. Lee et al., “Dolphin: Runtime optimization for distributed machine learning,” in Proc. Int. Conf. Mach. Learn. ML Syst. Workshop, New York, NY, USA City, 2016, pp. 1–14.

[23]

C. Gao, R. Ren, and H. Cai, “GAI: A centralized tree-based scheduler for machine learning workload in large shared clusters,” in Proc. Int. Conf. Algorithms Architectures Parallel Process., Cham, Switzerland: Springer, 2018, pp. 611–629.

[24]

J. Gu et al., “Tiresias: A GPU cluster manager for distributed deep learning,” in Proc. 16th USENIX Symp. Networked Syst. Des. Implementation, 2019, pp. 485–500.

[25]

Y. Peng et al., “Optimus: An efficient dynamic resource scheduler for deep learning clusters,” in Proce. 13th EuroSys Conf., 2018, pp. 1–14.

[26]

H. Zhang et al., “SLAQ: Quality-driven scheduling for distributed machine learning,” in Proc. Symp. Cloud Comput., 2017, pp. 390–404.

[27]

W. Zheng et al., “Target-based resource allocation for deep learning applications in a multi-tenancy system,” in Proc. IEEE High Perform. Extreme Comput. Conf., 2019, pp. 1–7.

[28]

Y. S. L. Lee et al., “Dolphin: Runtime optimization for distributed machine learning,” in Proc. ICML ML Syst. Workshop, 2016. pp. 1–14.

[29]

R. Hyndman, “Time series data library [EB/OL],” 2018. [Online]. Available: https://pkg.yangzhuoranyang.com/tsdl

Index Terms

RLPTO: A Reinforcement Learning-Based Performance-Time Optimized Task and Resource Scheduling Mechanism for Distributed Machine Learning
1. Theory of computation

Index terms have been assigned to the content through auto-classification.

Recommendations

Task Placement and Resource Allocation for Edge Machine Learning: A GNN-Based Multi-Agent Reinforcement Learning Paradigm
Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to each task, falling short of best utilizing the limited edge resources for ...
Reinforcement Learning in Dynamic Task Scheduling: A Review
Abstract
Scheduling is assigning shared resources over time to efficiently complete the tasks over a given period of time. The term is applied separately for tasks and resources correspondingly in task scheduling and resource allocation. Scheduling is a ...
Robustness challenges in Reinforcement Learning based time-critical cloud resource scheduling: A Meta-Learning based solution
Abstract
Cloud computing attracts increasing attention in processing dynamic computing tasks and automating the software development and operation pipeline. In many cases, the computing tasks have strict deadlines. The cloud resource manager (e.g., ...
Highlights
- Improving the robustness of Reinforcement Learning-based task scheduling.
- Enhancing Reinforcement Learning-based scheduling.
- Providing a Meta-Learning-based robust time-critical deep reinforcement learning scheduling (MLR-TC-DRLS) ...

Comments

Information & Contributors

Information

Published In

cover image IEEE Transactions on Parallel and Distributed Systems

IEEE Transactions on Parallel and Distributed Systems Volume 34, Issue 12

Dec. 2023

311 pages

ISSN:1045-9219

Issue’s Table of Contents

© 2023 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.

Publisher

IEEE Press

Publication History

Published: 20 September 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 06 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents