skip to main content
10.1109/SC41406.2024.00032acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters

Published: 17 November 2024 Publication History

Abstract

Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and growing use of ML, including in some scientific applications, optimizing these clusters for ML workloads is particularly important. However, recent work has demonstrated that accelerators in these clusters can suffer from performance variability and this variability can lead to resource under-utilization and load imbalance. In this work we focus on how clusters schedulers, which are used to share accelerator-rich clusters across many concurrent ML jobs, can embrace performance variability to mitigate its effects. Our key insight to address this challenge is to characterize which applications are more likely to suffer from performance variability and take that into account while placing jobs on the cluster. We design a novel cluster scheduler, PAL, which uses performance variability measurements and application-specific profiles to improve job performance and resource utilization. PAL also balances performance variability with locality to ensure jobs are spread across as few nodes as possible. Overall, PAL significantly improves GPU-rich cluster scheduling: across traces for six ML workload applications spanning image, language, and vision models with a variety of variability profiles, PAL improves geomean job completion time by 42%, cluster utilization by 28%, and makespan by 47% over existing state-of-the-art schedulers.

References

[1]
N. Benaich and I. Hogarth, "State of AI Report 2022," https://www.stateof.ai/, 2022.
[2]
Z. Fan and E. Ma, "Predicting orientation-dependent plastic susceptibility from static structure in amorphous solids via deep learning," Nature communications, vol. 12, no. 1, pp. 1--13, 2021.
[3]
J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Žídek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli, and D. Hassabis, "Highly accurate protein structure prediction with alphafold," Nature, vol. 596, no. 7873, pp. 583--589, 2021.
[4]
J. Kates-Harbeck, A. Svyatkovskiy, and W. Tang, "Predicting Disruptive Instabilities in Controlled Fusion Plasmas through Deep Learning," Nature, vol. 568, no. 7753, pp. 526--531, 2019.
[5]
H. Wang, L. Zhang, J. Han, and W. E, "DeePMD-kit: A deep learning package for many-body potential energy representation and molecular dynamics," Computer Physics Communications, vol. 228, pp. 178--184, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0010465518300882
[6]
J. Zeng, D. Zhang, D. Lu, P. Mo, Z. Li, Y. Chen, M. Rynik, L. Huang, Z. Li, S. Shi, Y. Wang, H. Ye, P. Tuo, J. Yang, Y. Ding, Y. Li, D. Tisi, Q. Zeng, H. Bao, Y. Xia, J. Huang, K. Muraoka, Y. Wang, J. Chang, F. Yuan, S. L. Bore, C. Cai, Y. Lin, B. Wang, J. Xu, J.-X. Zhu, C. Luo, Y. Zhang, R. E. A. Goodall, W. Liang, A. K. Singh, S. Yao, J. Zhang, R. Wentzcovitch, J. Han, J. Liu, W. Jia, D. M. York, W. E, R. Car, L. Zhang, and H. Wang, "DeePMD-kit v2: A software package for deep potential models," The Journal of Chemical Physics, vol. 159, no. 5, p. 054801, 08 2023. [Online]. Available: https://doi.org/10.1063/5.0155600
[7]
G. Derevyanko, G. Lamoureux, C. Outeiral, T. Oda, F. Fuchs, S. P. Mahajan, J. Moult, J. Haas, P. Maragakis, T. Ruzmetov, and M. AlQuraishi, "OpenFold2: Replicating AlphaFold2 in the Dark," https://lupoglaz.github.io/OpenFold2/, 2023.
[8]
R. Stevens, "Argonne's "AuroraGPT" Project," Trillion Parameter Consortium Seminar, 2023.
[9]
J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi, S. Heil, P. Patel, A. Sapek, G. Weisz, L. Woods, S. Lanka, S. K. Reinhardt, A. M. Caulfield, E. S. Chung, and D. Burger, "A Configurable Cloud-scale DNN Processor for Real-time AI," in Proceedings of the 45th Annual International Symposium on Computer Architecture, ser. ISCA. Piscataway, NJ, USA: IEEE Press, 2018, pp. 1--14. [Online]. Available: https://doi.org/10.1109/ISCA.2018.00012
[10]
N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, N. Patil, S. Prasad, C. Young, Z. Zhou, and D. Patterson, "Ten Lessons from Three Generations Shaped Google's TPUv4i," in Proceedings of the 48th Annual International Symposium on Computer Architecture, ser. ISCA, 2021.
[11]
top500.org, "TOP500 List," 2023, https://www.top500.org/lists/top500/list/2023/11/.
[12]
J. Coplin and M. Burtscher, "Energy, Power, and Performance Characterization of GPGPU Benchmark Programs," in IEEE International Parallel and Distributed Processing Symposium Workshops, ser. IPDPSW, May 2016, pp. 1190--1199. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/IPDPSW.2016.164
[13]
N. DeBardeleben, S. Blanchard, L. Monroe, P. Romero, D. Grunau, C. Idler, and C. Wright, "GPU Behavior on a Large HPC Cluster," in Euro-Par 2013: Parallel Processing Workshops - BigDataCloud, DIHC, FedICI, HeteroPar, HiBB, LSDVE, MHPC, OMHI, PADABS, PROPER, Resilience, ROME, and UCHPC 2013, Aachen, Germany, August 26-27, 2013. Revised Selected Papers, ser. Lecture Notes in Computer Science, D. an Mey, M. Alexander, P. Bientinesi, M. Cannataro, C. Clauss, A. Costan, G. Kecskemeti, C. Morin, L. Ricci, J. Sahuquillo, M. Schulz, V. Scarano, S. L. Scott, and J. Weidendorfer, Eds., vol. 8374. Springer, 2013, pp. 680--689. [Online]. Available: https://doi.org/10.1007/978-3-642-54420-0_66
[14]
F. Fraternali, A. Bartolini, C. Cavazzoni, and L. Benini, "Quantifying the Impact of Variability and Heterogeneity on the Energy Efficiency for a Next-Generation Ultra-Green Supercomputer," IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 7, pp. 1575--1588, 2018.
[15]
Y. Jiao, H. Lin, P. Balaji, and W. Feng, "Power and Performance Characterization of Computational Kernels on the GPU," in IEEE/ACM International Conference on Green Computing and Communications & International Conference on Cyber, Physical and Social Computing, 2010, pp. 221--228.
[16]
T. Scogland, J. Azose, D. Rohr, S. Rivoire, N. Bates, and D. Hackenberg, "Node Variability in Large-Scale Power Measurements: Perspectives from the Green500, Top500 and EEHPCWG," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '15. New York, NY, USA: Association for Computing Machinery, 2015. [Online]. Available: https://doi.org/10.1145/2807591.2807653
[17]
J. Tan and K. Yan, "Efficiently Managing the Impact of Hardware Variability on GPUs' Streaming Processors," ACM Trans. Des. Autom. Electron. Syst., vol. 24, no. 1, dec 2018. [Online]. Available: https://doi.org/10.1145/3287308
[18]
P. Sinha, A. Guliani, R. Jain, B. Tran, M. D. Sinclair, and S. Venkataraman, "Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich Systems," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '22. IEEE Press, 2022.
[19]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, "Automatic differentiation in PyTorch," in NIPS-W, 2017.
[20]
TACC, "Texas Advanced Computing Center," https://www.tacc.utexas.edu/, 2024.
[21]
W. Gao, Q. Hu, Z. Ye, P. Sun, X. Wang, Y. Luo, T. Zhang, and Y. Wen, "Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision," 2022.
[22]
J. Gu, M. Chowdhury, K. G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, and C. Guo, "Tiresias: A GPU Cluster Manager for Distributed Deep Learning," in 16th USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI. Boston, MA: USENIX Association, Feb. 2019, pp. 485--500. [Online]. Available: https://www.usenix.org/conference/nsdi19/presentation/gu
[23]
A. Qiao, S. K. Choe, S. J. Subramanya, W. Neiswanger, Q. Ho, H. Zhang, G. R. Ganger, and E. P. Xing, "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning," in 15th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI. USENIX Association, Jul. 2021, pp. 1--18. [Online]. Available: https://www.usenix.org/conference/osdi21/presentation/qiao
[24]
D. Narayanan, F. Kazhamiaka, F. Abuzaid, P. Kraft, A. Agrawal, S. Kandula, S. Boyd, and M. Zaharia, "Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP," in Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles, ser. SOSP '21. New York, NY, USA: Association for Computing Machinery, 2021, p. 521--537. [Online]. Available: https://doi.org/10.1145/3477132.3483588
[25]
D. Narayanan, K. Santhanam, F. Kazhamiaka, A. Phanishayee, and M. Zaharia, "Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads," in 14th USENIX Symposium on Operating Systems Design and Implementation, ser. OSDI. USENIX Association, Nov. 2020, pp. 481--498. [Online]. Available: https://www.usenix.org/conference/osdi20/presentation/narayanan-deepak
[26]
S. Agarwal, A. Phanishayee, and S. Venkataraman, "Blox: A Modular Toolkit for Deep Learning Schedulers," in Proceedings of the Nineteenth European Conference on Computer Systems, ser. EuroSys '24. New York, NY, USA: Association for Computing Machinery, 2024, p. 1093--1109. [Online]. Available: https://doi.org/10.1145/3627703.3629583
[27]
W. Xiao, R. Bhardwaj, R. Ramjee, M. Sivathanu, N. Kwatra, Z. Han, P. Patel, X. Peng, H. Zhao, Q. Zhang, F. Yang, and L. Zhou, "Gandiva: Introspective Cluster Scheduling for Deep Learning," in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). Carlsbad, CA: USENIX Association, Oct. 2018, pp. 595--610. [Online]. Available: https://www.usenix.org/conference/osdi18/presentation/xiao
[28]
S. Jayaram Subramanya, D. Arfeen, S. Lin, A. Qiao, Z. Jia, and G. R. Ganger, "Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling," in Proceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP '23. New York, NY, USA: Association for Computing Machinery, 2023, p. 642--657. [Online]. Available: https://doi.org/10.1145/3600006.3613175
[29]
J. Mohan, A. Phanishayee, J. Kulkarni, and V. Chidambaram, "Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters," in 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022, Carlsbad, CA, USA, July 11-13, 2022, M. K. Aguilera and H. Weatherspoon, Eds. USENIX Association, 2022, pp. 579--596. [Online]. Available: https://www.usenix.org/conference/osdi22/presentation/mohan
[30]
G. Ostrouchov, D. Maxwell, R. A. Ashraf, C. Engelmann, M. Shankar, and J. H. Rogers, "GPU Lifetimes on Titan Supercomputer: Survival Analysis and Reliability," in SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, pp. 1--14.
[31]
A. Gholami, "AI and Memory Wall," 2021, https://medium.com/riselab/ai-and-memory-wall-2cb4265cb0b8.
[32]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism," 2019.
[33]
S. Naffziger, N. Beck, T. Burd, K. Lepak, G. H. Loh, M. Subramony, and S. White, "Pioneering Chiplet Technology and Design for the AMD EPYC™ and Ryzen™ Processor Families: Industrial Product," in ACM/IEEE 48th Annual International Symposium on Computer Architecture, ser. ISCA, 2021, pp. 57--70.
[34]
R. Adolf, S. Rama, B. Reagen, G. Wei, and D. Brooks, "Fathom: reference workloads for modern deep learning methods," in 2016 IEEE International Symposium on Workload Characterization (IISWC). Los Alamitos, CA, USA: IEEE Computer Society, sep 2016, pp. 1--10. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/IISWC.2016.7581275
[35]
J. Guerreiro, A. Ilic, N. Roma, and P. Tomás, "DVFS-aware application classification to improve GPGPUs energy efficiency," Parallel Computing, vol. 83, pp. 93--117, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819118300243
[36]
NVIDIA, "Nsight Compute Documentation," Accessed 2024. [Online]. Available: https://docs.nvidia.com/nsight-compute/NsightCompute/index.html
[37]
P. J. Rousseeuw, "Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis," Journal of Computational and Applied Mathematics, vol. 20, pp. 53--65, 1987. [Online]. Available: https://www.sciencedirect.com/science/article/pii/0377042787901257
[38]
Google, "gRPC: A high performance, open source universal RPC framework," Accessed 2024. [Online]. Available: https://grpc.io/
[39]
M. Jeon, S. Venkataraman, A. Phanishayee, u. Qian, W. Xiao, and F. Yang, "Analysis of Large-scale Multi-tenant GPU Clusters for DNN Training Workloads," in Proceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference, ser. USENIX ATC '19. USA: USENIX Association, 2019, p. 947--960.
[40]
K. Mahajan, A. Balasubramanian, A. Singhvi, S. Venkataraman, A. Akella, A. Phanishayee, and S. Chawla, "Themis: Fair and Efficient GPU Cluster Scheduling," in 17th USENIX Symposium on Networked Systems Design and Implementation, ser. NSDI. Santa Clara, CA: USENIX Association, Feb. 2020, pp. 289--304. [Online]. Available: https://www.usenix.org/conference/nsdi20/presentation/mahajan
[41]
M. Amaral, J. Polo, D. Carrera, S. Seelam, and M. Steinder, "Topology-aware GPU scheduling for learning workloads in cloud environments," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '17. New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: https://doi.org/10.1145/3126908.3126933
[42]
A. Hankin, D. Werner, M. Amiraski, J. Sebot, K. Vaidyanathan, and M. Hempstead, "HotGauge: A Methodology for Characterizing Advanced Hotspots in Modern and Next Generation Processors," in IEEE International Symposium on Workload Characterization, ser. IISWC, 2021, pp. 163--175.
[43]
R. Jain, B. Tran, K. Chen, M. D. Sinclair, and S. Venkataraman, "PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters," 2024. [Online]. Available: https://arxiv.org/abs/2408.11919
[44]
R. Q. Charles, H. Su, M. Kaichun, and L. J. Guibas, "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation," in 2017 IEEE Conference on Computer Vision and Pattern Recognition, ser. CVPR, 2017, pp. 77--85.
[45]
A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, "ShapeNet: An Information-Rich 3D Model Repository," CoRR, vol. abs/1512.03012, 2015. [Online]. Available: http://arxiv.org/abs/1512.03012
[46]
K. Simonyan and A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition," in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2015. [Online]. Available: http://arxiv.org/abs/1409.1556
[47]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database," in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248--255.
[48]
A. Radford, L. Metz, and S. Chintala, "Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks," in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. [Online]. Available: http://arxiv.org/abs/1511.06434
[49]
F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao, "LSUN: construction of a large-scale image dataset using deep learning with humans in the loop," CoRR, vol. abs/1506.03365, 2015. [Online]. Available: http://arxiv.org/abs/1506.03365
[50]
J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171--4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
[51]
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro, "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism," CoRR, vol. abs/1909.08053, 2019. [Online]. Available: http://arxiv.org/abs/1909.08053
[52]
G. Attardi, "WikiExtractor," https://github.com/attardi/wikiextractor, 2015.
[53]
K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image Recognition," in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770--778.
[54]
"ResNet: Deep residual networks pre-trained on ImageNet," https://pytorch.org/hub/pytorch_vision_resnet/.
[55]
J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, ser. NAACL-HLT, J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171--4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
[56]
S. Che, B. M. Beckmann, S. K. Reinhardt, and K. Skadron, "Pannotia: Understanding Irregular GPGPU Graph Applications," in IEEE International Symposium on Workload Characterization, ser. IISWC, Sept 2013, pp. 185--195.
[57]
NVIDIA, "System Management Interface SMI," Accessed 2024. [Online]. Available: https://developer.nvidia.com/nvidia-system-management-interface
[58]
H. Menon, B. Acun, S. G. De Gonzalo, O. Sarood, and L. Kalé, "Thermal aware automated load balancing for HPC applications," in IEEE International Conference on Cluster Computing, ser. CLUSTER, 2013, pp. 1--8.
[59]
B. Acun and L. V. Kale, "Mitigating Processor Variation through Dynamic Load Balancing," in 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2016, pp. 1073--1076.
[60]
B. Acun, E. K. Lee, Y. Park, and L. V. Kale, "Support for Power Efficient Proactive Cooling Mechanisms," in 2017 IEEE 24th International Conference on High Performance Computing (HiPC), 2017, pp. 94--103.
[61]
D. Chasapis, M. Casas, M. Moretó, M. Schulz, E. Ayguadé, J. Labarta, and M. Valero, "Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes," in Proceedings of the 2016 International Conference on Supercomputing, ser. ICS '16. New York, NY, USA: Association for Computing Machinery, 2016. [Online]. Available: https://doi.org/10.1145/2925426.2926279
[62]
Y. Inadomi, T. Patki, K. Inoue, M. Aoyagi, B. Rountree, M. Schulz, D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, M. Kondo, and I. Miyoshi, "Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing," in SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2015, pp. 1--12.
[63]
J. Tan, M. Chen, Y. Yi, and X. Fu, "Mitigating the Impact of Hardware Variability for GPGPUs Register File," IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 11, pp. 3283--3297, 2016.
[64]
D. S. Maura, T. Goel, K. Goswami, D. S. Banerjee, and S. Das, "Variation aware power management for GPU memories," Microprocessors and Microsystems, vol. 96, p. 104711, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0141933122002411
[65]
A. B. Yoo, M. A. Jette, and M. Grondona, "SLURM: Simple Linux Utility for Resource Management," in Job Scheduling Strategies for Parallel Processing, D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 44--60.
[66]
Q. Hu, P. Sun, S. Yan, Y. Wen, and T. Zhang, "Characterization and prediction of deep learning workloads in large-scale GPU datacenters," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. ACM, Nov. 2021. [Online]. Available: http://dx.doi.org/10.1145/3458817.3476223
[67]
Z. Bian, S. Li, W. Wang, and Y. You, "Online evolutionary batch size orchestration for scheduling deep learning workloads in GPU clusters," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. New York, NY, USA: Association for Computing Machinery, 2021. [Online]. Available: https://doi.org/10.1145/3458817.3480859
[68]
Q. Hu, P. Sun, S. Yan, Y. Wen, and T. Zhang, "Characterization and prediction of deep learning workloads in large-scale GPU datacenters," in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '21. New York, NY, USA: Association for Computing Machinery, 2021. [Online]. Available: https://doi.org/10.1145/3458817.3476223

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '24: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis
November 2024
1758 pages
ISBN:9798350352917

Sponsors

Publisher

IEEE Press

Publication History

Published: 17 November 2024

Check for updates

Badges

Author Tags

  1. Cluster Scheduling
  2. GPGPU
  3. Machine Learning
  4. Performance Variability
  5. Power Management

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SC '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 156
    Total Downloads
  • Downloads (Last 12 months)156
  • Downloads (Last 6 weeks)142
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media