research-article

Practical performance guarantees for pipelined DNN inference

AUTHORs: Aaron Archer, Matthew Fahrbach, Kuikui Liu, Prakash PrabhuAuthors Info & Claims

ICML'24: Proceedings of the 41st International Conference on Machine Learning

Article No.: 67, Pages 1655 - 1671

Published: 03 January 2025 Publication History

Abstract

We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into k stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner's dilemma of deciding when a solution is good enough. To this end, we design novel mixed-integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for k ∈ {2, 4, 8, 16, 32, 64}, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across a production testbed with k = 16 pipeline stages, our MIP formulations raise the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds close the optimality gap by a factor of 9.855x.

References

[1]

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265-283, 2016.

Digital Library

[2]

Ahn, B. H., Lee, J., Lin, J. M., Cheng, H.-P., Hou, J., and Esmaeilzadeh, H. Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices. Proceedings of Machine Learning and Systems, 2:44-57, 2020.

[3]

Albareda-Sambola, M., Marín, A., and Rodríguez-Chía, A. M. Reformulated acyclic partitioning for rail-rail containers transshipment. European Journal of Operational Research, 277(1):153-165, 2019.

[4]

Bestuzheva, K., Besançon, M., Chen, W.-K., Chmiela, A., Donkiewicz, T., van Doornmalen, J., Eifler, L., Gaul, O., Gamrath, G., Gleixner, A., Gottwald, L., Graczyk, C., Halbig, K., Hoen, A., Hojny, C., van der Hulst, R., Koch, T., Lübbecke, M., Maher, S. J., Matter, F., Mühmer, E., Müller, B., Pfetsch, M. E., Rehfeldt, D., Schlein, S., Schlösser, F., Serrano, F., Shinano, Y., Sofranac, B., Turner, M., Vigerske, S., Wegscheider, F., Wellner, P., Weninger, D., and Witzig, J. The SCIP Optimization Suite 8.0. Technical report, Optimization Online, December 2021. URL http://www.optimization-online.org/DB_HTML/2021/12/8728.html.

[5]

Bubley, R. and Dyer, M. Faster random generation of linear extensions. Discrete Mathematics, 201(1):81-88, 1999.

Digital Library

[6]

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.

[7]

Chen, Y., Xie, C., Ma, M., Gu, J., Peng, Y., Lin, H., Wu, C., and Zhu, Y. SAPipe: Staleness-aware pipeline for data parallel DNN training. In Advances in Neural Information Processing Systems, 2022.

[8]

Cong, J., Li, Z., and Bagrodia, R. Acyclic multi-way partitioning of boolean networks. In Proceedings of the 31st Annual Design Automation Conference, pp. 670-675, 1994.

Digital Library

[9]

Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 35:16344-16359, 2022.

[10]

Dasari, U. K., Temam, O., Narayanaswami, R., and Woo, D. H. Apparatus and mechanism for processing neural network tasks using a single chip package with multiple identical dies, March 2 2021. US Patent 10,936,942.

[11]

Fradet, P., Girault, A., and Honorat, A. Sequential scheduling of dataflow graphs for memory peak minimization. In Proceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 76-86, 2023.

Digital Library

[12]

Gao, Y., Liu, Y., Zhang, H., Li, Z., Zhu, Y., Lin, H., and Yang, M. Estimating GPU memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1342-1352, 2020.

Digital Library

[13]

García-Segador, P. and Miranda, P. Bottom-up: A new algorithm to generate random linear extensions of a poset. Order, 36(3):437-462, 2019.

[14]

Garey, M. R. and Johnson, D. S. Computers and Intractability. W. H. Freeman, 1979.

Digital Library

[15]

Gonçalves, J. F. and Resende, M. G. Biased random-key genetic algorithms for combinatorial optimization. Journal of Heuristics, 17(5):487-525, 2011.

Digital Library

[16]

Gordon, M. I., Thies, W., and Amarasinghe, S. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. ACM SIGPLAN Notices, 41(11): 151-162, 2006.

Digital Library

[17]

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.

[18]

Herrmann, J., Ozkaya, M. Y., Uçar, B., Kaya, K., and Çatalyürek, Ü. V. V. Multilevel algorithms for acyclic partitioning of directed acyclic graphs. SIAM Journal on Scientific Computing, 41(4):A2117-A2145, 2019.

Digital Library

[19]

Hochbaum, D. S. and Shmoys, D. B. Using dual approximation algorithms for scheduling problems theoretical and practical results. Journal of the ACM, 34(1):144-162, 1987.

Digital Library

[20]

Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems, 32, 2019.

[21]

Huber, M. Fast perfect sampling from linear extensions. Discrete Mathematics, 306(4):420-428, 2006.

Digital Library

[22]

Jin, C., Purohit, M., Svitkina, Z., Vee, E., and Wang, J. R. New tools for peak memory scheduling. arXiv preprint arXiv:2312.13526, 2023.

[23]

Jouppi, N., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., Patil, N., Subramanian, S., Swing, A., Towles, B., et al. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pp. 1-14, 2023.

Digital Library

[24]

Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1-12, 2017.

Digital Library

[25]

Kahn, A. B. Topological sorting of large networks. Communications of the ACM, 5(11):558-562, 1962.

Digital Library

[26]

Kaufman, S., Phothilimthana, P., Zhou, Y., Mendis, C., Roy, S., Sabne, A., and Burrows, M. A learned performance model for tensor processing units. Proceedings of Machine Learning and Systems, 3:387-400, 2021.

[27]

Kim, C., Lee, H., Jeong, M., Baek, W., Yoon, B., Kim, I., Lim, S., and Kim, S. torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910, 2020.

[28]

Kim, T., Kim, H., Yu, G.-I., and Chun, B.-G. BPipe: Memory-balanced pipeline parallelism for training large language models. International Conference on Machine Learning, pp. 16639-16653, 2023.

[29]

Lamy-Poirier, J. Breadth-first pipeline parallelism. Proceedings of Machine Learning and Systems, 5, 2023.

[30]

Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O. MLIR: Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 2-14. IEEE, 2021.

Digital Library

[31]

Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, 2021.

[32]

Li, Z., Zhuang, S., Guo, S., Zhuo, D., Zhang, H., Song, D., and Stoica, I. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pp. 6543-6552. PMLR, 2021.

[33]

Lin, J., Chen, W.-M., Cai, H., Gan, C., and Han, S. Mcunetv2: Memory-efficient patch-based inference for tiny deep learning. arXiv preprint arXiv:2110.15352, 2021.

[34]

Luo, Z., Yi, X., Long, G., Fan, S., Wu, C., Yang, J., and Lin, W. Efficient pipeline planning for expedited distributed DNN training. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications, pp. 340-349. IEEE, 2022.

Digital Library

[35]

Marchal, L., Simon, B., and Vivien, F. Limiting the memory footprint when dynamically scheduling DAGs on shared-memory platforms. Journal of Parallel and Distributed Computing, 128:30-42, 2019.

Digital Library

[36]

Matthews, P. Generating a random linear extension of a partial order. The Annals of Probability, 19(3):1367-1392, 1991.

[37]

Mei, X. and Chu, X. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems, 28(1):72-86, 2016.

Digital Library

[38]

Mishra, P. A new algorithm for updating and querying sub-arrays of multidimensional arrays. arXiv preprint arXiv:1311.6093, 2013.

[39]

Moreira, O., Popp, M., and Schulz, C. Graph partitioning with acyclicity constraints. In 16th International Symposium on Experimental Algorithms (SEA), volume 75, pp. 30:1-30:15, 2017.

[40]

Moreira, O., Popp, M., and Schulz, C. Evolutionary multi-level acyclic graph partitioning. Journal of Heuristics, 26 (5):771-799, 2020.

Digital Library

[41]

Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Zaharia, M. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1-15, 2019.

Digital Library

[42]

Narayanan, D., Phanishayee, A., Shi, K., Chen, X., and Zaharia, M. Memory-efficient pipeline-parallel DNN training. In International Conference on Machine Learning, pp. 7937-7947. PMLR, 2021.

[43]

Nossack, J. and Pesch, E. A branch-and-bound algorithm for the acyclic partitioning problem. Computers & Operations Research, 41:174-184, 2014.

Digital Library

[44]

Özkaya, M. Y. and Çatalyürek, Ü. V. A simple and elegant mathematical formulation for the acyclic DAG partitioning problem. arXiv preprint arXiv:2207.13638, 2022.

[45]

Paliwal, A., Gimeno, F., Nair, V., Li, Y., Lubin, M., Kohli, P., and Vinyals, O. Reinforced genetic algorithm learning for optimizing computation graphs. In Proceedings of the 8th International Conference on Learning Representations, 2020.

[46]

Papp, P. A., Anegg, G., and Yzelman, A.-J. N. Partitioning hypergraphs is hard: Models, inapproximability, and applications. In Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 415-425, 2023.

Digital Library

[47]

Park, J. H., Yun, G., Chang, M. Y., Nguyen, N. T., Lee, S., Choi, J., Noh, S. H., and Choi, Y.-r. HetPipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In 2020 USENIX Annual Technical Conference, pp. 307-321, 2020.

[48]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.

[49]

Popp, M., Schlag, S., Schulz, C., and Seemaier, D. Multilevel acyclic hypergraph partitioning. In Proceedings of the Workshop on Algorithm Engineering and Experiments, pp. 1-15. SIAM, 2021.

[50]

Ryabinin, M., Dettmers, T., Diskin, M., and Borzunov, A. SWARM parallelism: Training large models can be surprisingly communication-efficient. In International Conference on Machine Learning, pp. 29416-29440. PMLR, 2023.

[51]

Sanchez, D., Lo, D., Yoo, R. M., Sugerman, J., and Kozyrakis, C. Dynamic fine-grain scheduling of pipeline parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 22-32. IEEE, 2011.

Digital Library

[52]

Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., et al. Mesh-TensorFlow: Deep learning for super-computers. Advances in Neural Information Processing Systems, 31, 2018.

[53]

Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-LM: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.

[54]

Vee, E. N., Purohit, M. D., Wang, J. R., Ravikumar, S., and Svitkina, Z. Scheduling operations on a computation graph, March 30 2021. US Patent 10,963,301.

[55]

Wang, T., Geng, T., Li, A., Jin, X., and Herbordt, M. FPDeep: Scalable acceleration of cnn training on deeply-pipelined FPGA clusters. IEEE Transactions on Computers, 69(8):1143-1158, 2020.

Digital Library

[56]

Xie, X., Prabhu, P., Beaugnon, U., Phothilimthana, P., Roy, S., Mirhoseini, A., Brevdo, E., Laudon, J., and Zhou, Y. A transferable approach for partitioning machine learning models on multi-chip-modules. Proceedings of Machine Learning and Systems, 4:370-381, 2022.

[57]

Xu, Y., Lee, H., Chen, D., Hechtman, B., Huang, Y., Joshi, R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M., et al. GSPMD: General and scalable parallelization for ML computation graphs. arXiv preprint arXiv:2105.04663, 2021.

[58]

Yang, B., Zhang, J., Li, J., Ré, C., Aberger, C., and De Sa, C. Pipemare: Asynchronous pipeline parallel DNN training. Proceedings of Machine Learning and Systems, 3:269-296, 2021.

[59]

Yuan, B., He, Y., Davis, J., Zhang, T., Dao, T., Chen, B., Liang, P. S., Re, C., and Zhang, C. Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems, pp. 25464-25477, 2022.

[60]

Zhang, K., Wang, H., Hu, H., Zou, S., Qiu, J., Li, T., and Wang, Z. TENSILE: A tensor granularity dynamic GPU memory scheduling method toward multiple dynamic workloads system. IEEE Transactions on Knowledge and Data Engineering, 2022.

Digital Library

[61]

Zhao, L., Xu, R., Wang, T., Tian, T., Wang, X., Wu, W., Ieong, C.-I., and Jin, X. BaPipe: Balanced pipeline parallelism for DNN training. Parallel Processing Letters, 32(3&4):2250005:1-2250005:17, 2022.

Index Terms

Practical performance guarantees for pipelined DNN inference

Index terms have been assigned to the content through auto-classification.

Recommendations

A Lagrangian---DNN relaxation: a fast method for computing tight lower bounds for a class of quadratic optimization problems

We propose an efficient computational method for linearly constrained quadratic optimization problems (QOPs) with complementarity constraints based on their Lagrangian and doubly nonnegative (DNN) relaxation and first-order algorithms. The simplified ...
A study of scalar compilation techniques for pipelined supercomputers

This paper studies two compilation techniques for enhancing scalar performance in high-speed scientific processors: software pipelining and loop unrolling. We study the impact of the architecture (size of the register file) and of the hardware (size of ...
Register allocation for software pipelined multi-dimensional loops
Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation

Software pipelining of a multi-dimensional loop is an important optimization that overlaps the execution of successive outermost loop iterations to explore instruction-level parallelism from the entire n-dimensional iteration space. This paper ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

ICML'24: Proceedings of the 41st International Conference on Machine Learning

July 2024

63010 pages

Copyright © 2024.

Publisher

JMLR.org

Publication History

Published: 03 January 2025

Qualifiers

Research-article
Research
Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents