PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Wu, Gaojie; Zheng, Wei-Shi; Lu, Yutong; Tian, Qi

doi:10.1109/TPAMI.2023.3265499

Computer Science > Computer Vision and Pattern Recognition

arXiv:2304.03481 (cs)

[Submitted on 7 Apr 2023]

Title:PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Authors:Gaojie Wu, Wei-Shi Zheng, Yutong Lu, Qi Tian

View PDF

Abstract:Vision Transformer (ViT) has shown great potential for various visual tasks due to its ability to model long-range dependency. However, ViT requires a large amount of computing resource to compute the global self-attention. In this work, we propose a ladder self-attention block with multiple branches and a progressive shift mechanism to develop a light-weight transformer backbone that requires less computing resources (e.g. a relatively small number of parameters and FLOPs), termed Progressive Shift Ladder Transformer (PSLT). First, the ladder self-attention block reduces the computational cost by modelling local self-attention in each branch. In the meanwhile, the progressive shift mechanism is proposed to enlarge the receptive field in the ladder self-attention block by modelling diverse local self-attention for each branch and interacting among these branches. Second, the input feature of the ladder self-attention block is split equally along the channel dimension for each branch, which considerably reduces the computational cost in the ladder self-attention block (with nearly 1/3 the amount of parameters and FLOPs), and the outputs of these branches are then collaborated by a pixel-adaptive fusion. Therefore, the ladder self-attention block with a relatively small number of parameters and FLOPs is capable of modelling long-range interactions. Based on the ladder self-attention block, PSLT performs well on several vision tasks, including image classification, objection detection and person re-identification. On the ImageNet-1k dataset, PSLT achieves a top-1 accuracy of 79.9% with 9.2M parameters and 1.9G FLOPs, which is comparable to several existing models with more than 20M parameters and 4G FLOPs. Code is available at this https URL.

Comments:	Accepted to IEEE Transaction on Pattern Analysis and Machine Intelligence, 2023 (Submission date: 08-Jul-202)
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2304.03481 [cs.CV]
	(or arXiv:2304.03481v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2304.03481
Journal reference:	IEEE Transaction on Pattern Analysis and Machine Intelligence, 2023
Related DOI:	https://doi.org/10.1109/TPAMI.2023.3265499

Submission history

From: Gaojie Wu [view email]
[v1] Fri, 7 Apr 2023 05:21:37 UTC (703 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:PSLT: A Light-weight Vision Transformer with Ladder Self-Attention and Progressive Shift

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators