Improving Local Features with Relevant Spatial Information by Vision Transformer for Crowd Counting


Nguyen Hoang Tran (VinBrain),* Ta Duc Huy (Vinbrain), Soan T. M. Duong (Le Quy Don Technical University), Phan Nguyen (VinBrain), Dao Huu Hung (VinBrain), Chanh D Tr Nguyen (VinBrain), Trung Bui (Individual), QUOC HUNG TRUONG (VINBRAIN)
The 33rd British Machine Vision Conference

Abstract

Vision Transformer (ViT) variants have demonstrated state-of-the-art performances in plenty of computer vision benchmarks, including crowd counting. Although Transformer based models have shown breakthroughs in crowd counting, existing methods have some limitations. Global embeddings extracted from ViTs do not encapsulate fine-grained local features and, thus, are prone to errors in crowded scenes with diverse human scales and densities. In this paper, we propose LoViTCrowd with the argument that: LOcal features with spatial information from relevant regions via the attention mechanism of ViT can effectively reduce the crowd counting error. To this end, we divide each image into a cell grid. Considering patches of 3 x 3 cells, in which the main parts of the human body are encapsulated, the surrounding cells provide meaningful cues for crowd estimation. ViT is adapted on each patch to employ the attention mechanism across the 3 x 3 cells to count the number of people in the central cell. The number of people in the image is obtained by summing up the counts of its non overlapping cells. Extensive experiments on four public datasets of sparse and dense scenes, i.e., Mall, ShanghaiTech Part A, ShanghaiTech Part B, and UCF QNRF, demonstrate our method's state-of-the-art performance. Compared to TransCrowd, LoViTCrowd reduces the root mean square errors (RMSE) and the mean absolute errors (MAE) by an average of 14.2% and 9.7%, respectively. The source is available at https://github.com/nguyen1312/LoViTCrowd.

Video



Citation

@inproceedings{Tran_2022_BMVC,
author    = {Nguyen Hoang Tran and Ta Duc Huy and Soan T. M.  Duong and Phan Nguyen and Dao Huu Hung and Chanh D Tr Nguyen and Trung Bui and QUOC HUNG TRUONG},
title     = {Improving Local Features with Relevant Spatial Information by Vision Transformer for Crowd Counting},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {https://bmvc2022.mpi-inf.mpg.de/0729.pdf}
}


Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection