research-article

Multi-Zone Transformer Based on Self-Distillation for Facial Attribute Recognition

Authors:

Yun WuAuthors Info & Claims

2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)

Pages 1 - 7

https://doi.org/10.1109/FG57933.2023.10042513

Published: 05 January 2023 Publication History

Abstract

Recently, transformers have shown great promising performance in various computer vision tasks. However, the current transformer based methods ignore the information exchanges between transformer blocks, and they have not been applied in the facial attribute recognition task. In this paper, we propose a multi-zone transformer based on self-distillation for FAR, termed MZTS, to predict the facial attributes. A multi-zone transformer encoder is firstly presented to achieve the interactions of the different transformer encoder blocks, thus avoiding forgetting the effective information between the transformer encoder block groups during the iteration process. Furthermore, we introduce a new self-distillation mechanism based on class tokens, which distills the class tokens obtained from the last transformer encoder block group to the other shallow groups by interacting with the significant information between the different transformer blocks through attention. Extensive experiments on the challenging CelebA and LFWA datasets have demonstrated the excellent performance of the proposed method for FAR.

References

[1]

B.-C. Chen, Y.-Y. Chen, Y.-H. Kuo, and W. H. Hsu. Scalable face image retrieval using attribute-enhanced sparse codewords. IEEE Transactions on Multimedia. 15(5):1163–1173. 2013.

Digital Library

[2]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16×16 words: Trans-formers for image recognition at scale. In Proceedings of International Conference on Learning Representations, pages 1,2,3,4,6, 2021.

[3]

P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Ky-rola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:, 2017.

[4]

E. M. Hand and R. Chellappa. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4068–4074, 2017.

[5]

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[6]

R. He, T. Tan, L. Davis, and Z. Sun. Learning structured ordinal measures for video based face recognition. Pattern Recognition, 75:4–14. 2018.

Digital Library

[7]

Z. Huang, Y. Zou, B. V. K.V. Kumar, and D. Huang. Comprehensive attention self-distillation for weakly-supervised object detection. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Proceedings of Advances in Neural Information Processing Systems, volume 33, pages 16797–16807. Curran Associates. Inc., 2020.

[8]

N. Kumar, A. Berg, P. N. Belhumeur, and S. Nayar. Describable visual attributes for face verification and image search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(10):1962–1977. 2011.

Digital Library

[9]

J. Li, F. Zhao, J. Feng, S. Roy, S. Yan, and T. Sim. Landmark free face attribute prediction. IEEE Transactions on Image Processing, 27(9):4651–4662, 2018.

[10]

Q. Li, Q. Hu, S. Qi, Y. Qi, D. Wu, Y. Lin, and J. S. Dong. Stochastic ghost batch for self-distillation with dynamic soft label. Knowledge-Based Systems, 241: 107936. 2022.

Digital Library

[11]

Y. Li, L. Song, X. Wu, R. He, and T. Tan. Learning a bi-level adversarial network with global and local perception for makeup-invariant face verification. Pattern Recognition, 90:99–108, 2019.

Digital Library

[12]

Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.

Digital Library

[13]

I. Loshchilov and F. Hutter. SGDR: stochastic gradient descent with warm restarts. In Proceedings of the International Conference on Learning Representations, 2017.

[14]

T. Ma, W. Tian, and Y. Xie. Multi-level knowledge distillation for low-resolution object detection and facial expression recognition. Knowledge-Based Systems, 240:108136, 2022.

Digital Library

[15]

U. Mahbub, S. Sarkar, and R. Chellappa. Segment-based methods for facial attribute detection from partial faces. IEEE Transactions on Affective Computing, 11(4):601–613, 2020.

Digital Library

[16]

L. Mao, Y. Yan, J.-H. Xue, and H. Wang. Deep multi-task multi-label cnn for effective facial attribute classification. IEEE Transactions on Affective Computing, 13(2):818–828, 2022.

[17]

A. K. Sharma and H. Foroosh. Slim-cnn: A light-weight cnn for face attribute prediction. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition, pages 329–335, 2020.

[18]

C. Shi, L. Fang, Z. Lv, and M. Zhao. Explainable scale distillation for hyperspectral image classification. Pattern Recognition, 122:108316, 2022.

[19]

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou. Training data-efficient image transformers & distillation through attention. In Proceedings of International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.

[20]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.

[21]

Z. Wu, Q. Ke, J. Sun, and H.-Y. Shum. Scalable face image retrieval with identity-based quantization and multireference reranking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(10): 1991–2001. 2011.

Digital Library

[22]

S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.

[23]

T.-B. Xu and C.-L. Liu. Data-distortion guided self-distillation for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33. pages 5565–5572. 2019.

[24]

L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. H. Tay, J. Feng, and S. Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE International Conference on Computer Vision, pages 538–547, 2021.

[25]

S. Zagoruyko and N. Komodakis. Wide residual networks. In British Machine Vision Conference 2016, 2016.

[26]

L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE International Conference on Computer Vision, pages 3713–3722, 2019.

[27]

N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1637–1644, 2014.

[28]

J. Zhu, J. Liu, W. Li, J. Lai, X. He, L. Chen, and Z. Zheng. Ensembled ctr prediction via knowledge distillation. In Proceedings of the ACM International Conference on Information & Knowledge Management, pages 2941–2958, 2020.

[29]

N. Zhuang, Y. Yan, S. Chen, and H. Wang. Multi-task learning of cascaded cnn for facial attribute classification. In Proceedings of the IEEE International Conference on Pattern Recognition, pages 2069–2074, 2018.

Recommendations

Former-DFER: Dynamic Facial Expression Recognition Transformer
MM '21: Proceedings of the 29th ACM International Conference on Multimedia

This paper proposes a dynamic facial expression recognition transformer (Former-DFER) for the in-the-wild scenario. Specifically, the proposed Former-DFER mainly consists of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-...
Prior-Guided Multi-scale Fusion Transformer for Face Attribute Recognition
Pattern Recognition and Computer Vision
Abstract
Multi-label face attribute recognition (FAR) refers to the task of predicting a set of attribute labels for a facial image. However, existing FAR methods do not work well for recognizing attributes of different scales, since most frameworks use ...
Harnessing synthesized abstraction images to improve facial attribute recognition
IJCAI'18: Proceedings of the 27th International Joint Conference on Artificial Intelligence

Facial attribute recognition is an important and yet challenging research topic. Different from most previous approaches which predict attributes only based on the whole images, this paper leverages facial parts locations for better attribute ...

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings

2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG)

Jan 2023

540 pages

Copyright © 2023.

Publisher

IEEE Press

Publication History

Published: 05 January 2023

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents