skip to main content
10.5555/3692070.3692137guideproceedingsArticle/Chapter ViewAbstractPublication PagesicmlConference Proceedingsconference-collections
research-article

Practical performance guarantees for pipelined DNN inference

Published: 03 January 2025 Publication History

Abstract

We optimize pipeline parallelism for deep neural network (DNN) inference by partitioning model graphs into k stages and minimizing the running time of the bottleneck stage, including communication. We give practical and effective algorithms for this NP-hard problem, but our emphasis is on tackling the practitioner's dilemma of deciding when a solution is good enough. To this end, we design novel mixed-integer programming (MIP) relaxations for proving lower bounds. Applying these methods to a diverse testbed of 369 production models, for k ∈ {2, 4, 8, 16, 32, 64}, we empirically show that these lower bounds are strong enough to be useful in practice. Our lower bounds are substantially stronger than standard combinatorial bounds. For example, evaluated via geometric means across a production testbed with k = 16 pipeline stages, our MIP formulations raise the lower bound from 0.4598 to 0.9452, expressed as a fraction of the best partition found. In other words, our improved lower bounds close the optimality gap by a factor of 9.855x.

References

[1]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. TensorFlow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 265-283, 2016.
[2]
Ahn, B. H., Lee, J., Lin, J. M., Cheng, H.-P., Hou, J., and Esmaeilzadeh, H. Ordering chaos: Memory-aware scheduling of irregularly wired neural networks for edge devices. Proceedings of Machine Learning and Systems, 2:44-57, 2020.
[3]
Albareda-Sambola, M., Marín, A., and Rodríguez-Chía, A. M. Reformulated acyclic partitioning for rail-rail containers transshipment. European Journal of Operational Research, 277(1):153-165, 2019.
[4]
Bestuzheva, K., Besançon, M., Chen, W.-K., Chmiela, A., Donkiewicz, T., van Doornmalen, J., Eifler, L., Gaul, O., Gamrath, G., Gleixner, A., Gottwald, L., Graczyk, C., Halbig, K., Hoen, A., Hojny, C., van der Hulst, R., Koch, T., Lübbecke, M., Maher, S. J., Matter, F., Mühmer, E., Müller, B., Pfetsch, M. E., Rehfeldt, D., Schlein, S., Schlösser, F., Serrano, F., Shinano, Y., Sofranac, B., Turner, M., Vigerske, S., Wegscheider, F., Wellner, P., Weninger, D., and Witzig, J. The SCIP Optimization Suite 8.0. Technical report, Optimization Online, December 2021. URL http://www.optimization-online.org/DB_HTML/2021/12/8728.html.
[5]
Bubley, R. and Dyer, M. Faster random generation of linear extensions. Discrete Mathematics, 201(1):81-88, 1999.
[6]
Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
[7]
Chen, Y., Xie, C., Ma, M., Gu, J., Peng, Y., Lin, H., Wu, C., and Zhu, Y. SAPipe: Staleness-aware pipeline for data parallel DNN training. In Advances in Neural Information Processing Systems, 2022.
[8]
Cong, J., Li, Z., and Bagrodia, R. Acyclic multi-way partitioning of boolean networks. In Proceedings of the 31st Annual Design Automation Conference, pp. 670-675, 1994.
[9]
Dao, T., Fu, D., Ermon, S., Rudra, A., and Ré, C. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 35:16344-16359, 2022.
[10]
Dasari, U. K., Temam, O., Narayanaswami, R., and Woo, D. H. Apparatus and mechanism for processing neural network tasks using a single chip package with multiple identical dies, March 2 2021. US Patent 10,936,942.
[11]
Fradet, P., Girault, A., and Honorat, A. Sequential scheduling of dataflow graphs for memory peak minimization. In Proceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems, pp. 76-86, 2023.
[12]
Gao, Y., Liu, Y., Zhang, H., Li, Z., Zhu, Y., Lin, H., and Yang, M. Estimating GPU memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1342-1352, 2020.
[13]
García-Segador, P. and Miranda, P. Bottom-up: A new algorithm to generate random linear extensions of a poset. Order, 36(3):437-462, 2019.
[14]
Garey, M. R. and Johnson, D. S. Computers and Intractability. W. H. Freeman, 1979.
[15]
Gonçalves, J. F. and Resende, M. G. Biased random-key genetic algorithms for combinatorial optimization. Journal of Heuristics, 17(5):487-525, 2011.
[16]
Gordon, M. I., Thies, W., and Amarasinghe, S. Exploiting coarse-grained task, data, and pipeline parallelism in stream programs. ACM SIGPLAN Notices, 41(11): 151-162, 2006.
[17]
Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL https://www.gurobi.com.
[18]
Herrmann, J., Ozkaya, M. Y., Uçar, B., Kaya, K., and Çatalyürek, Ü. V. V. Multilevel algorithms for acyclic partitioning of directed acyclic graphs. SIAM Journal on Scientific Computing, 41(4):A2117-A2145, 2019.
[19]
Hochbaum, D. S. and Shmoys, D. B. Using dual approximation algorithms for scheduling problems theoretical and practical results. Journal of the ACM, 34(1):144-162, 1987.
[20]
Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. GPipe: Efficient training of giant neural networks using pipeline parallelism. Advances in Neural Information Processing Systems, 32, 2019.
[21]
Huber, M. Fast perfect sampling from linear extensions. Discrete Mathematics, 306(4):420-428, 2006.
[22]
Jin, C., Purohit, M., Svitkina, Z., Vee, E., and Wang, J. R. New tools for peak memory scheduling. arXiv preprint arXiv:2312.13526, 2023.
[23]
Jouppi, N., Kurian, G., Li, S., Ma, P., Nagarajan, R., Nai, L., Patil, N., Subramanian, S., Swing, A., Towles, B., et al. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pp. 1-14, 2023.
[24]
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1-12, 2017.
[25]
Kahn, A. B. Topological sorting of large networks. Communications of the ACM, 5(11):558-562, 1962.
[26]
Kaufman, S., Phothilimthana, P., Zhou, Y., Mendis, C., Roy, S., Sabne, A., and Burrows, M. A learned performance model for tensor processing units. Proceedings of Machine Learning and Systems, 3:387-400, 2021.
[27]
Kim, C., Lee, H., Jeong, M., Baek, W., Yoon, B., Kim, I., Lim, S., and Kim, S. torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910, 2020.
[28]
Kim, T., Kim, H., Yu, G.-I., and Chun, B.-G. BPipe: Memory-balanced pipeline parallelism for training large language models. International Conference on Machine Learning, pp. 16639-16653, 2023.
[29]
Lamy-Poirier, J. Breadth-first pipeline parallelism. Proceedings of Machine Learning and Systems, 5, 2023.
[30]
Lattner, C., Amini, M., Bondhugula, U., Cohen, A., Davis, A., Pienaar, J., Riddle, R., Shpeisman, T., Vasilache, N., and Zinenko, O. MLIR: Scaling compiler infrastructure for domain specific computation. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 2-14. IEEE, 2021.
[31]
Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. In 9th International Conference on Learning Representations, 2021.
[32]
Li, Z., Zhuang, S., Guo, S., Zhuo, D., Zhang, H., Song, D., and Stoica, I. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning, pp. 6543-6552. PMLR, 2021.
[33]
Lin, J., Chen, W.-M., Cai, H., Gan, C., and Han, S. Mcunetv2: Memory-efficient patch-based inference for tiny deep learning. arXiv preprint arXiv:2110.15352, 2021.
[34]
Luo, Z., Yi, X., Long, G., Fan, S., Wu, C., Yang, J., and Lin, W. Efficient pipeline planning for expedited distributed DNN training. In IEEE INFOCOM 2022-IEEE Conference on Computer Communications, pp. 340-349. IEEE, 2022.
[35]
Marchal, L., Simon, B., and Vivien, F. Limiting the memory footprint when dynamically scheduling DAGs on shared-memory platforms. Journal of Parallel and Distributed Computing, 128:30-42, 2019.
[36]
Matthews, P. Generating a random linear extension of a partial order. The Annals of Probability, 19(3):1367-1392, 1991.
[37]
Mei, X. and Chu, X. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems, 28(1):72-86, 2016.
[38]
Mishra, P. A new algorithm for updating and querying sub-arrays of multidimensional arrays. arXiv preprint arXiv:1311.6093, 2013.
[39]
Moreira, O., Popp, M., and Schulz, C. Graph partitioning with acyclicity constraints. In 16th International Symposium on Experimental Algorithms (SEA), volume 75, pp. 30:1-30:15, 2017.
[40]
Moreira, O., Popp, M., and Schulz, C. Evolutionary multi-level acyclic graph partitioning. Journal of Heuristics, 26 (5):771-799, 2020.
[41]
Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Zaharia, M. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1-15, 2019.
[42]
Narayanan, D., Phanishayee, A., Shi, K., Chen, X., and Zaharia, M. Memory-efficient pipeline-parallel DNN training. In International Conference on Machine Learning, pp. 7937-7947. PMLR, 2021.
[43]
Nossack, J. and Pesch, E. A branch-and-bound algorithm for the acyclic partitioning problem. Computers & Operations Research, 41:174-184, 2014.
[44]
Özkaya, M. Y. and Çatalyürek, Ü. V. A simple and elegant mathematical formulation for the acyclic DAG partitioning problem. arXiv preprint arXiv:2207.13638, 2022.
[45]
Paliwal, A., Gimeno, F., Nair, V., Li, Y., Lubin, M., Kohli, P., and Vinyals, O. Reinforced genetic algorithm learning for optimizing computation graphs. In Proceedings of the 8th International Conference on Learning Representations, 2020.
[46]
Papp, P. A., Anegg, G., and Yzelman, A.-J. N. Partitioning hypergraphs is hard: Models, inapproximability, and applications. In Proceedings of the 35th ACM Symposium on Parallelism in Algorithms and Architectures, pp. 415-425, 2023.
[47]
Park, J. H., Yun, G., Chang, M. Y., Nguyen, N. T., Lee, S., Choi, J., Noh, S. H., and Choi, Y.-r. HetPipe: Enabling large DNN training on (whimpy) heterogeneous GPU clusters through integration of pipelined model parallelism and data parallelism. In 2020 USENIX Annual Technical Conference, pp. 307-321, 2020.
[48]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
[49]
Popp, M., Schlag, S., Schulz, C., and Seemaier, D. Multilevel acyclic hypergraph partitioning. In Proceedings of the Workshop on Algorithm Engineering and Experiments, pp. 1-15. SIAM, 2021.
[50]
Ryabinin, M., Dettmers, T., Diskin, M., and Borzunov, A. SWARM parallelism: Training large models can be surprisingly communication-efficient. In International Conference on Machine Learning, pp. 29416-29440. PMLR, 2023.
[51]
Sanchez, D., Lo, D., Yoo, R. M., Sugerman, J., and Kozyrakis, C. Dynamic fine-grain scheduling of pipeline parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques, pp. 22-32. IEEE, 2011.
[52]
Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., et al. Mesh-TensorFlow: Deep learning for super-computers. Advances in Neural Information Processing Systems, 31, 2018.
[53]
Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-LM: Training multibillion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
[54]
Vee, E. N., Purohit, M. D., Wang, J. R., Ravikumar, S., and Svitkina, Z. Scheduling operations on a computation graph, March 30 2021. US Patent 10,963,301.
[55]
Wang, T., Geng, T., Li, A., Jin, X., and Herbordt, M. FPDeep: Scalable acceleration of cnn training on deeply-pipelined FPGA clusters. IEEE Transactions on Computers, 69(8):1143-1158, 2020.
[56]
Xie, X., Prabhu, P., Beaugnon, U., Phothilimthana, P., Roy, S., Mirhoseini, A., Brevdo, E., Laudon, J., and Zhou, Y. A transferable approach for partitioning machine learning models on multi-chip-modules. Proceedings of Machine Learning and Systems, 4:370-381, 2022.
[57]
Xu, Y., Lee, H., Chen, D., Hechtman, B., Huang, Y., Joshi, R., Krikun, M., Lepikhin, D., Ly, A., Maggioni, M., et al. GSPMD: General and scalable parallelization for ML computation graphs. arXiv preprint arXiv:2105.04663, 2021.
[58]
Yang, B., Zhang, J., Li, J., Ré, C., Aberger, C., and De Sa, C. Pipemare: Asynchronous pipeline parallel DNN training. Proceedings of Machine Learning and Systems, 3:269-296, 2021.
[59]
Yuan, B., He, Y., Davis, J., Zhang, T., Dao, T., Chen, B., Liang, P. S., Re, C., and Zhang, C. Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems, pp. 25464-25477, 2022.
[60]
Zhang, K., Wang, H., Hu, H., Zou, S., Qiu, J., Li, T., and Wang, Z. TENSILE: A tensor granularity dynamic GPU memory scheduling method toward multiple dynamic workloads system. IEEE Transactions on Knowledge and Data Engineering, 2022.
[61]
Zhao, L., Xu, R., Wang, T., Tian, T., Wang, X., Wu, W., Ieong, C.-I., and Jin, X. BaPipe: Balanced pipeline parallelism for DNN training. Parallel Processing Letters, 32(3&4):2250005:1-2250005:17, 2022.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
ICML'24: Proceedings of the 41st International Conference on Machine Learning
July 2024
63010 pages

Publisher

JMLR.org

Publication History

Published: 03 January 2025

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Acceptance Rates

Overall Acceptance Rate 140 of 548 submissions, 26%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media