skip to main content
10.1145/3581783.3611934acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Object Part Parsing with Hierarchical Dual Transformer

Published: 27 October 2023 Publication History

Abstract

Object part parsing involves segmenting objects into semantic parts, which has drawn great attention recently. The current methods ignore the specific hierarchical structure of the object, which can be used as strong prior knowledge. To address this, we propose the Hierarchical Dual Transformer (HDTR) to explore the contribution of the typical structural priors of the object parts. HDTR first generates the pyramid multi-granularity pixel representations under the supervision of the object part parsing maps at different semantic levels and then assigns each region an initial part embedding. Moreover, HDTR generates an edge pixel representation to extend the capability of the network to capture detailed information. Afterward, we design a Hierarchical Part Transformer to upgrade the part embeddings to their hierarchical counterparts with the assistance of the multi-granularity pixel representations. Next, we propose a Hierarchical Pixel Transformer to infer the hierarchical information from the part embeddings to enrich the pixel representations. Note that both transformer decoders rely on the structural relations between object parts, i.e., dependency, composition, and decomposition relations. The experiments on five large-scale datasets, i.e., LaPa, CelebAMask-HQ, CIHP, LIP and Pascal Animal, demonstrate that our method sets a new state-of-the-art performance for object part parsing.

References

[1]
Hiroaki Aizawa, Yukihiro Domae, and Kunihito Kato. 2021. Hierarchical Pyramid Representations for Semantic Segmentation. arXiv preprint arXiv:2104.01792 (2021).
[2]
Shubhankar Borse, Ying Wang, Yizhe Zhang, and Fatih Porikli. 2021. Inverseform: A loss function for structured boundary-aware segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5901--5911.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.
[4]
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision.
[5]
Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1971--1978.
[6]
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1290--1299.
[7]
Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 17864--17875.
[8]
MMSegmentation Contributors. 2020. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. https://github.com/open-mmlab/mmsegmentation.
[9]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
[10]
Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758--2766.
[11]
Arthur Douillard, Yifu Chen, Arnaud Dapogny, and Matthieu Cord. 2021. Plop: Learning without forgetting for continual semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4040--4050.
[12]
Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, and Liang Lin. 2019. Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7450--7459.
[13]
Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. 2018. Instance-level human parsing via part grouping network. In Proceedings of the European conference on computer vision (ECCV). 770--785.
[14]
Haoyu He, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2020. Grapy-ML: Graph pyramid mutual learning for cross-dataset human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10949--10956.
[15]
Lei He, Jiwen Lu, Guanghui Wang, Shiyu Song, and Jie Zhou. 2021. SOSD-Net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing, Vol. 440 (2021), 251--263.
[16]
Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5549--5558.
[17]
Jihyun Lee, Binod Bhattarai, and Tae-Kyun Kim. 2021. Face Parsing From RGB and Depth Using Cross-Domain Mutual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1501--1510.
[18]
Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. 2022. Deep Hierarchical Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1246--1257.
[19]
Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai Tong. 2020. Improving semantic segmentation via decoupled body and edge supervision. In European Conference on Computer Vision. Springer, 435--452.
[20]
Jinpeng Lin, Hao Yang, Dong Chen, Ming Zeng, Fang Wen, and Lu Yuan. 2019. Face parsing with roi tanh-warping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5654--5663.
[21]
Yiming Lin, Jie Shen, Yujiang Wang, and Maja Pantic. 2021. RoI Tanh-polar transformer network for face parsing in the wild. Image and Vision Computing, Vol. 112 (2021), 104190.
[22]
Yinglu Liu, Hailin Shi, Hao Shen, Yue Si, Xiaobo Wang, and Tao Mei. 2020. A new dataset and boundary-attention semantic segmentation for face parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11637--11644.
[23]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.
[24]
Umberto Michieli, Edoardo Borsato, Luca Rossi, and Pietro Zanuttigh. 2020. Gmnet: Graph matching network for large scale part semantic segmentation in the wild. In European conference on computer vision. Springer, 397--414.
[25]
Umberto Michieli and Pietro Zanuttigh. 2021. Continual semantic segmentation via repulsion-attraction of sparse and disentangled latent representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1114--1124.
[26]
Jiteng Mu, Weichao Qiu, Gregory D Hager, and Alan L Yuille. 2020. Learning from synthetic animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12386--12395.
[27]
Bo Pang, Yizhuo Li, Jiefeng Li, Muchen Li, Hanwen Cao, and Cewu Lu. 2021. Tdaf: Top-down attention framework for vision tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2384--2392.
[28]
Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693--5703.
[29]
Gusi Te, Wei Hu, Yinglu Liu, Hailin Shi, and Tao Mei. 2021. Adaptive Graph Representation Learning and Reasoning for Face Parsing. arXiv e-prints (2021), arXiv-2101.
[30]
Gusi Te, Yinglu Liu, Wei Hu, Hailin Shi, and Tao Mei. 2020. Edge-aware graph representation learning and reasoning for face parsing. In European Conference on Computer Vision. Springer, 258--274.
[31]
Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, and Ling Shao. 2020. Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8929--8939.
[32]
Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. 2021. Anchor DETR: Query Design for Transformer-Based Object Detection. arXiv preprint arXiv:2109.07107 (2021).
[33]
Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, Vol. 34 (2021), 12077--12090.
[34]
Linwei Ye, Zhi Liu, and Yang Wang. 2017. Depth-aware object instance segmentation. In 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 325--329.
[35]
Zi Yin, Valentin Yiu, Xiaolin Hu, and Liang Tang. 2021. End-to-end face parsing via interlinked convolutional neural networks. Cognitive Neurodynamics, Vol. 15, 1 (2021), 169--179.
[36]
Yuhui Yuan, Xilin Chen, and Jingdong Wang. 2020. Object-contextual representations for semantic segmentation. In European Conference on Computer Vision. Springer, 173--190.
[37]
Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. 2021. Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems, Vol. 34 (2021), 7281--7293.
[38]
Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin Chen, and Jingdong Wang. 2018. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018).
[39]
Xiaomei Zhang, Yingying Chen, Bingke Zhu, Jinqiao Wang, and Ming Tang. 2020. Part-aware context network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8971--8980.
[40]
Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2881--2890.
[41]
Mingmin Zhen, Jinglu Wang, Lei Zhou, Shiwei Li, Tianwei Shen, Jiaxiang Shang, Tian Fang, and Long Quan. 2020. Joint semantic segmentation and boundary detection using iterative pyramid contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13666--13675.
[42]
Qingping Zheng, Jiankang Deng, Zheng Zhu, Ying Li, and Stefanos Zafeiriou. 2022. Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4156--4165.
[43]
Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6881--6890.
[44]
Yisu Zhou, Xiaolin Hu, and Bo Zhang. 2015. Interlinked convolutional neural networks for face parsing. In International Symposium on Neural Networks. Springer, 222--231.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. computer vision
  2. face parsing
  3. human parsing
  4. semantic segmentation

Qualifiers

  • Research-article

Funding Sources

  • the Key Research and Development Program of Shaanxi

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)79
  • Downloads (Last 6 weeks)6
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media