skip to main content
research-article

Differentiable Slimming for Memory-Efficient Transformers

Published: 01 December 2023 Publication History

Abstract

Transformer models are continuously achieving state-of-the-art performance on a wide range of benchmarks. To meet demanding performance targets, the number of model parameters is continuously increased. As a result, state-of-the-art Transformers require substantial computational resources prohibiting their deployment on consumer-grade hardware. In the literature, overparameterized Transformers are successfully reduced in size with the help of pruning strategies. Existing works lack the ability to optimize the full architecture, without incurring significant overheads, in a fully differentiable manner. Our work proposes a single-stage approach for training a Transformer for memory-efficient inference and various resource-constrained scenarios. Transformer blocks are extended with trainable gate parameters, which attribute importance and control information flow. Their integration into a differentiable pruning-aware training scheme allows the extraction of extremely sparse subnetworks at runtime, with minimal performance degradation. Evaluative pruning results, at the attention head and layer levels, illustrate the memory efficiency of our trained subnetworks under various memory budgets.

References

[1]
Y. Liuet al., “A survey of visual transformers,” IEEE Trans. Neural Netw. Learn. Syst., early access, Mar. 30, 2023. 10.1109/TNNLS.2022.3227717.
[2]
T. Brownet al., “Language models are few-shot learners,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 1877–1901.
[3]
P. Michel, O. Levy, and G. Neubig, “Are sixteen heads really better than one?” in Proc. Adv. Neural Inf. Process. Syst., vol. 32, 2019, pp. 1–11.
[4]
A. Fan, E. Grave, and A. Joulin, “Reducing transformer depth on demand with structured dropout,” in Proc. Int. Conf. Learn. Representations, 2020, pp. 1–16.
[5]
E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned,” in Proc. 57th Annu. Meet. Assoc. Comput. Linguistics, 2019, pp. 5797–5808.
[6]
M. Elbayad, J. Gu, E. Grave, and M. Auli, “Depth-adaptive transformer,” in Proc. Int. Conf. Learn. Representations, 2020.
[7]
L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and Q. Liu, “DynaBERT: Dynamic BERT with adaptive width and depth,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 9782–9793.
[8]
H. Wanget al., “Hat: Hardware-aware transformers for efficient natural language processing,” in Proc. Annu. Conf. Assoc. Comput. Linguistics, 2020, pp. 7675–7688.
[9]
A. Vaswaniet al., “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
[10]
H. Liu, K. Simonyan, and Y. Yang, “DARTS: Differentiable architecture search,” in Proc. Int. Conf. Learn. Representations, 2019.

Index Terms

  1. Differentiable Slimming for Memory-Efficient Transformers
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image IEEE Embedded Systems Letters
        IEEE Embedded Systems Letters  Volume 15, Issue 4
        Dec. 2023
        73 pages

        Publisher

        IEEE Press

        Publication History

        Published: 01 December 2023

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 0
          Total Downloads
        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 25 Jan 2025

        Other Metrics

        Citations

        View Options

        View options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media