research-article

Object Part Parsing with Hierarchical Dual Transformer

Authors:

Chen QianAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 2016 - 2024

https://doi.org/10.1145/3581783.3611934

Published: 27 October 2023 Publication History

Abstract

Object part parsing involves segmenting objects into semantic parts, which has drawn great attention recently. The current methods ignore the specific hierarchical structure of the object, which can be used as strong prior knowledge. To address this, we propose the Hierarchical Dual Transformer (HDTR) to explore the contribution of the typical structural priors of the object parts. HDTR first generates the pyramid multi-granularity pixel representations under the supervision of the object part parsing maps at different semantic levels and then assigns each region an initial part embedding. Moreover, HDTR generates an edge pixel representation to extend the capability of the network to capture detailed information. Afterward, we design a Hierarchical Part Transformer to upgrade the part embeddings to their hierarchical counterparts with the assistance of the multi-granularity pixel representations. Next, we propose a Hierarchical Pixel Transformer to infer the hierarchical information from the part embeddings to enrich the pixel representations. Note that both transformer decoders rely on the structural relations between object parts, i.e., dependency, composition, and decomposition relations. The experiments on five large-scale datasets, i.e., LaPa, CelebAMask-HQ, CIHP, LIP and Pascal Animal, demonstrate that our method sets a new state-of-the-art performance for object part parsing.

References

[1]

Hiroaki Aizawa, Yukihiro Domae, and Kunihito Kato. 2021. Hierarchical Pyramid Representations for Semantic Segmentation. arXiv preprint arXiv:2104.01792 (2021).

[2]

Shubhankar Borse, Ying Wang, Yizhe Zhang, and Fatih Porikli. 2021. Inverseform: A loss function for structured boundary-aware segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5901--5911.

[3]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European Conference on Computer Vision. Springer, 213--229.

Digital Library

[4]

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In European Conference on Computer Vision.

[5]

Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1971--1978.

Digital Library

[6]

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. 2022. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1290--1299.

[7]

Bowen Cheng, Alex Schwing, and Alexander Kirillov. 2021. Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, Vol. 34 (2021), 17864--17875.

[8]

MMSegmentation Contributors. 2020. MMSegmentation: OpenMMLab Semantic Segmentation Toolbox and Benchmark. https://github.com/open-mmlab/mmsegmentation.

[9]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.

[10]

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. 2015. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision. 2758--2766.

Digital Library

[11]

Arthur Douillard, Yifu Chen, Arnaud Dapogny, and Matthieu Cord. 2021. Plop: Learning without forgetting for continual semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4040--4050.

[12]

Ke Gong, Yiming Gao, Xiaodan Liang, Xiaohui Shen, Meng Wang, and Liang Lin. 2019. Graphonomy: Universal human parsing via graph transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7450--7459.

[13]

Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. 2018. Instance-level human parsing via part grouping network. In Proceedings of the European conference on computer vision (ECCV). 770--785.

Digital Library

[14]

Haoyu He, Jing Zhang, Qiming Zhang, and Dacheng Tao. 2020. Grapy-ML: Graph pyramid mutual learning for cross-dataset human parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10949--10956.

[15]

Lei He, Jiwen Lu, Guanghui Wang, Shiyu Song, and Jie Zhou. 2021. SOSD-Net: Joint semantic object segmentation and depth estimation from monocular images. Neurocomputing, Vol. 440 (2021), 251--263.

[16]

Cheng-Han Lee, Ziwei Liu, Lingyun Wu, and Ping Luo. 2020. Maskgan: Towards diverse and interactive facial image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5549--5558.

[17]

Jihyun Lee, Binod Bhattarai, and Tae-Kyun Kim. 2021. Face Parsing From RGB and Depth Using Cross-Domain Mutual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1501--1510.

[18]

Liulei Li, Tianfei Zhou, Wenguan Wang, Jianwu Li, and Yi Yang. 2022. Deep Hierarchical Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1246--1257.

[19]

Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, and Yunhai Tong. 2020. Improving semantic segmentation via decoupled body and edge supervision. In European Conference on Computer Vision. Springer, 435--452.

Digital Library

[20]

Jinpeng Lin, Hao Yang, Dong Chen, Ming Zeng, Fang Wen, and Lu Yuan. 2019. Face parsing with roi tanh-warping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5654--5663.

[21]

Yiming Lin, Jie Shen, Yujiang Wang, and Maja Pantic. 2021. RoI Tanh-polar transformer network for face parsing in the wild. Image and Vision Computing, Vol. 112 (2021), 104190.

[22]

Yinglu Liu, Hailin Shi, Hao Shen, Yue Si, Xiaobo Wang, and Tao Mei. 2020. A new dataset and boundary-attention semantic segmentation for face parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11637--11644.

[23]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012--10022.

[24]

Umberto Michieli, Edoardo Borsato, Luca Rossi, and Pietro Zanuttigh. 2020. Gmnet: Graph matching network for large scale part semantic segmentation in the wild. In European conference on computer vision. Springer, 397--414.

Digital Library

[25]

Umberto Michieli and Pietro Zanuttigh. 2021. Continual semantic segmentation via repulsion-attraction of sparse and disentangled latent representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1114--1124.

[26]

Jiteng Mu, Weichao Qiu, Gregory D Hager, and Alan L Yuille. 2020. Learning from synthetic animals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12386--12395.

[27]

Bo Pang, Yizhuo Li, Jiefeng Li, Muchen Li, Hanwen Cao, and Cewu Lu. 2021. Tdaf: Top-down attention framework for vision tasks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2384--2392.

[28]

Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. 2019. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5693--5703.

[29]

Gusi Te, Wei Hu, Yinglu Liu, Hailin Shi, and Tao Mei. 2021. Adaptive Graph Representation Learning and Reasoning for Face Parsing. arXiv e-prints (2021), arXiv-2101.

[30]

Gusi Te, Yinglu Liu, Wei Hu, Hailin Shi, and Tao Mei. 2020. Edge-aware graph representation learning and reasoning for face parsing. In European Conference on Computer Vision. Springer, 258--274.

Digital Library

[31]

Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, and Ling Shao. 2020. Hierarchical human parsing with typed part-relation reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8929--8939.

[32]

Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. 2021. Anchor DETR: Query Design for Transformer-Based Object Detection. arXiv preprint arXiv:2109.07107 (2021).

[33]

Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. 2021. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, Vol. 34 (2021), 12077--12090.

[34]

Linwei Ye, Zhi Liu, and Yang Wang. 2017. Depth-aware object instance segmentation. In 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 325--329.

Digital Library

[35]

Zi Yin, Valentin Yiu, Xiaolin Hu, and Liang Tang. 2021. End-to-end face parsing via interlinked convolutional neural networks. Cognitive Neurodynamics, Vol. 15, 1 (2021), 169--179.

[36]

Yuhui Yuan, Xilin Chen, and Jingdong Wang. 2020. Object-contextual representations for semantic segmentation. In European Conference on Computer Vision. Springer, 173--190.

Digital Library

[37]

Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. 2021. Hrformer: High-resolution vision transformer for dense predict. Advances in Neural Information Processing Systems, Vol. 34 (2021), 7281--7293.

[38]

Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin Chen, and Jingdong Wang. 2018. Ocnet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916 (2018).

[39]

Xiaomei Zhang, Yingying Chen, Bingke Zhu, Jinqiao Wang, and Ming Tang. 2020. Part-aware context network for human parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8971--8980.

[40]

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2881--2890.

[41]

Mingmin Zhen, Jinglu Wang, Lei Zhou, Shiwei Li, Tianwei Shen, Jiaxiang Shang, Tian Fang, and Long Quan. 2020. Joint semantic segmentation and boundary detection using iterative pyramid contexts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13666--13675.

[42]

Qingping Zheng, Jiankang Deng, Zheng Zhu, Ying Li, and Stefanos Zafeiriou. 2022. Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4156--4165.

[43]

Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. 2021. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6881--6890.

[44]

Yisu Zhou, Xiaolin Hu, and Bo Zhang. 2015. Interlinked convolutional neural networks for face parsing. In International Symposium on Neural Networks. Springer, 222--231.

Digital Library

Cited By

Yang YWang ZLiu NWang JPang SLiu RGao J(2024)Physically Driven Self-Supervised Learning and its Applications in Geophysical InversionIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.336801662(1-11)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3368016

Index Terms

Object Part Parsing with Hierarchical Dual Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation

Recommendations

Discriminative hierarchical part-based models for human parsing and action recognition

We consider the problem of parsing human poses and recognizing their actions in static images with part-based models. Most previous work in part-based models only considers rigid parts (e.g., torso, head, half limbs) guided by human anatomy. We argue ...
Boundary-guided part reasoning network for human parsing
Abstract
The task of human parsing aims to segment the human body into different semantic regions. Despite advancements in this field, there are still two issues with current works: boundary indistinction and parsing inconsistency. In this paper, we ...
Reducing vulnerable internal feature correlations to enhance efficient topological structure parsing
Abstract
Most cropping-and-segmenting pattern parsers typically establish a single metric/scheme to reason diverse inner correlations, resulting in over-general and redundant representations. To make pattern parsing procedure more streamlined and concise, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the Key Research and Development Program of Shaanxi

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
133
Total Downloads

Downloads (Last 12 months)79
Downloads (Last 6 weeks)6

Reflects downloads up to 09 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yang YWang ZLiu NWang JPang SLiu RGao J(2024)Physically Driven Self-Supervised Learning and its Applications in Geophysical InversionIEEE Transactions on Geoscience and Remote Sensing10.1109/TGRS.2024.336801662(1-11)Online publication date: 2024
https://doi.org/10.1109/TGRS.2024.3368016

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents