applsci-logo

Journal Browser

Journal Browser

Advances in Image Recognition and Processing Technologies

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 30 November 2024 | Viewed by 15382

Special Issue Editors


E-Mail Website
Guest Editor
College of Information Science and Technology, Beijing University of Chemical Technology (BUCT) and Engineering Research Center of Intelligent PSE, Ministry of Education of China, Beijing 100029, China
Interests: pattern recognition; detection and tracking; visual intelligence
State Key Laboratory of Virtual Reality Technology and Systems, School of Computer Science and Engineering, Beihang University, Beijing 100191, China
Interests: computer vision; pattern recognition; image processing; edge intelligence

Special Issue Information

Dear Colleagues,

Image Recognition and Processing Technologies have contributed to significant advances in many fields in recent years. However, there remain many challenges to be addressed due to the inherent complexity of computer vision, limiting its performance in various applications. Therefore, this Special Issue is being assembled in order to share various in-depth research results related to image recognition and processing methods, including, but not limited to, object detection, object tracking, image super-resolution, depth estimation, semantic segmentation and so on. We hope that these advanced methods can boost the application of these technologies in the real world.

It is our pleasure to invite you to join this Special Issue, entitled “Advances in Image Recognition and Processing Technologies”, whereby you are welcome to contribute a manuscript presenting your valuable research progress. Thank you very much.

Dr. Yang Zhang
Dr. Shuai Wang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • artificial intelligence
  • image recognition
  • image processing
  • image super-resolution
  • depth estimation
  • semantic segmentation
  • object detection
  • object tracking
  • semantic segmentation

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (12 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 4253 KiB  
Article
RSTSRN: Recursive Swin Transformer Super-Resolution Network for Mars Images
by Fanlu Wu, Xiaonan Jiang, Tianjiao Fu, Yao Fu, Dongdong Xu and Chunlei Zhao
Appl. Sci. 2024, 14(20), 9286; https://doi.org/10.3390/app14209286 - 12 Oct 2024
Viewed by 613
Abstract
High-resolution optical images will provide planetary geology researchers with finer and more microscopic image data information. In order to maximize scientific output, it is necessary to further increase the resolution of acquired images, so image super-resolution (SR) reconstruction techniques have become the best [...] Read more.
High-resolution optical images will provide planetary geology researchers with finer and more microscopic image data information. In order to maximize scientific output, it is necessary to further increase the resolution of acquired images, so image super-resolution (SR) reconstruction techniques have become the best choice. Aiming at the problems of large parameter quantity and high computational complexity in current deep learning-based image SR reconstruction methods, we propose a novel Recursive Swin Transformer Super-Resolution Network (RSTSRN) for SR applied to images. The RSTSRN improves upon the LapSRN, which we use as our backbone architecture. A Residual Swin Transformer Block (RSTB) is used for more efficient residual learning, which consists of stacked Swin Transformer Blocks (STBs) with a residual connection. Moreover, the idea of parameter sharing was introduced to reduce the number of parameters, and a multi-scale training strategy was designed to accelerate convergence speed. Experimental results show that the proposed RSTSRN achieves superior performance on 2×, 4× and 8×SR tasks to state-of-the-art methods with similar parameters. Especially on high-magnification SR tasks, the RSTSRN has great performance superiority. Compared to the LapSRN network, for 2×, 4× and 8× Mars image SR tasks, the RSTSRN network has increased PSNR values by 0.35 dB, 0.88 dB and 1.22 dB, and SSIM values by 0.0048, 0.0114 and 0.0311, respectively. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

20 pages, 5228 KiB  
Article
Remote Sensing Image Change Detection Based on Deep Learning: Multi-Level Feature Cross-Fusion with 3D-Convolutional Neural Networks
by Sibo Yu, Chen Tao, Guang Zhang, Yubo Xuan and Xiaodong Wang
Appl. Sci. 2024, 14(14), 6269; https://doi.org/10.3390/app14146269 - 18 Jul 2024
Viewed by 968
Abstract
Change detection (CD) in high-resolution remote sensing imagery remains challenging due to the complex nature of objects and varying spectral characteristics across different times and locations. Convolutional neural networks (CNNs) have shown promising performance in CD tasks by extracting meaningful semantic features. However, [...] Read more.
Change detection (CD) in high-resolution remote sensing imagery remains challenging due to the complex nature of objects and varying spectral characteristics across different times and locations. Convolutional neural networks (CNNs) have shown promising performance in CD tasks by extracting meaningful semantic features. However, traditional 2D-CNNs may struggle to accurately integrate deep features from multi-temporal images, limiting their ability to improve CD accuracy. This study proposes a Multi-level Feature Cross-Fusion (MFCF) network with 3D-CNNs for remote sensing image change detection. The network aims to effectively extract and fuse deep features from multi-temporal images to identify surface changes. To bridge the semantic gap between high-level and low-level features, a MFCF module is introduced. A channel attention mechanism (CAM) is also integrated to enhance model performance, interpretability, and generalization capabilities. The proposed methodology is validated on the LEVIR construction dataset (LEVIR-CD). The experimental results demonstrate superior performance compared to the current state-of-the-art in evaluation metrics including recall, F1 score, and IOU. The MFCF network, which combines 3D-CNNs and a CAM, effectively utilizes multi-temporal information and deep feature fusion, resulting in precise and reliable change detection in remote sensing imagery. This study significantly contributes to the advancement of change detection methods, facilitating more efficient management and decision making across various domains such as urban planning, natural resource management, and environmental monitoring. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

18 pages, 3407 KiB  
Article
GenTrajRec: A Graph-Enhanced Trajectory Recovery Model Based on Signaling Data
by Hongyao Huang, Haozhi Xie, Zihang Xu, Mingzhe Liu, Yi Xu and Tongyu Zhu
Appl. Sci. 2024, 14(13), 5934; https://doi.org/10.3390/app14135934 - 8 Jul 2024
Viewed by 648
Abstract
Signaling data are records of the interactions of users’ mobile phones with their nearest cellular stations, which could provide long-term and continuous-time location data of large-scale citizens, and therefore have great potential in intelligent transportation, smart cities, and urban sensing. However, utilizing the [...] Read more.
Signaling data are records of the interactions of users’ mobile phones with their nearest cellular stations, which could provide long-term and continuous-time location data of large-scale citizens, and therefore have great potential in intelligent transportation, smart cities, and urban sensing. However, utilizing the raw signaling data often suffers from two problems: (1) Low positioning accuracy. Since the signaling data only describes the interaction between the user and the mobile base station, they can only restore users’ approximate geographical location. (2) Poor data quality. Due to the limitations of mobile signals, user signaling may be missing and drifting. To address the above issues, we propose a graph-enhanced trajectory recovery network, GenTrajRec, to recover precise trajectories from signaling data. GenTrajRec encodes signaling data through spatiotemporal encoders and enhances the traveling semantics by constructing a signaling transition graph. In fusing the spatiotemporal information as well as the deep traveling semantics, GenTrajRec can well tackle the challenge of poor data quality, and recover precise trajectory from raw signaling data. Extensive experiments have been conducted on two real-world datasets from Mobile Signaling and Geolife, and the results confirm the effectiveness of our approach, and the positioning accuracy can be improved from 315 m per point to 82 m per point for signaling data using our network. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

15 pages, 4267 KiB  
Article
Steel Surface Defect Detection Algorithm Based on Improved YOLOv8n
by Tian Zhang, Pengfei Pan, Jie Zhang and Xiaochen Zhang
Appl. Sci. 2024, 14(12), 5325; https://doi.org/10.3390/app14125325 - 20 Jun 2024
Cited by 2 | Viewed by 1516
Abstract
The traditional detection methods of steel surface defects have some problems, such as a lack of feature extraction ability, sluggish detection speed, and subpar detection performance. In this paper, a YOLOv8-based DDI-YOLO model is suggested for effective steel surface defect detection. First, on [...] Read more.
The traditional detection methods of steel surface defects have some problems, such as a lack of feature extraction ability, sluggish detection speed, and subpar detection performance. In this paper, a YOLOv8-based DDI-YOLO model is suggested for effective steel surface defect detection. First, on the Backbone network, the extended residual module (DWR) is fused with the C2f module to obtain C2f_DWR, and the two-step approach is used to carry out the effective extraction of multiscale contextual information, and then fusing feature maps formed from the multiscale receptive fields to enhance the capacity for feature extraction. Also based on the above, an extended heavy parameter module (DRB) is added to the structure of C2f_DWR to make up for the lack of C2f’s ability to capture small-scale pattern defects between training to enhance the training fluency of the model. Finally, the Inner-IoU loss function is employed to enhance the regression accuracy and training speed of the model. The experimental results show that the detection of DDI-YOLO on the NEU-DET dataset improves the mAP by 2.4%, the accuracy by 3.3%, and the FPS by 59 frames/s compared with the original YOLOv8n.Therefore, this paper’s proposed model has a superior mAP, accuracy, and FPS in identifying surface defects in steel. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

20 pages, 10215 KiB  
Article
RANDnet: Vehicle Re-Identification with Relation Attention and Nuance–Disparity Masks
by Yang Huang, Hao Sheng and Wei Ke
Appl. Sci. 2024, 14(11), 4929; https://doi.org/10.3390/app14114929 - 6 Jun 2024
Viewed by 666
Abstract
Vehicle re-identification (vehicle ReID) is designed to recognize all instances of a specific vehicle across various camera viewpoints, facing significant challenges such as high similarity among different vehicles from the same viewpoint and substantial variance for the same vehicle across different viewpoints. In [...] Read more.
Vehicle re-identification (vehicle ReID) is designed to recognize all instances of a specific vehicle across various camera viewpoints, facing significant challenges such as high similarity among different vehicles from the same viewpoint and substantial variance for the same vehicle across different viewpoints. In this paper, we introduce the RAND network, which is equipped with relation attention mechanisms, nuance, and disparity masks to tackle these issues effectively. The disparity mask specifically targets the automatic suppression of irrelevant foreground and background noise, while the nuance mask reveals less obvious, sub-discriminative regions to enhance the overall feature robustness. Additionally, our relation attention module, which incorporates an advanced transformer architecture, significantly reduces intra-class distances, thereby improving the accuracy of vehicle identification across diverse viewpoints. The performance of our approach has been thoroughly evaluated on widely recognized datasets such as VeRi-776 and VehicleID, where it demonstrates superior effectiveness and competes robustly with other leading methods. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

18 pages, 5405 KiB  
Article
Expressway Vehicle Trajectory Prediction Based on Fusion Data of Trajectories and Maps from Vehicle Perspective
by Yuning Duan, Jingdong Jia, Yuhui Jin, Haitian Zhang and Jian Huang
Appl. Sci. 2024, 14(10), 4181; https://doi.org/10.3390/app14104181 - 15 May 2024
Viewed by 838
Abstract
Research on vehicle trajectory prediction based on road monitoring video data often utilizes a global map as an input, disregarding the fact that drivers rely on the road structures observable from their own positions for path planning. This oversight reduces the accuracy of [...] Read more.
Research on vehicle trajectory prediction based on road monitoring video data often utilizes a global map as an input, disregarding the fact that drivers rely on the road structures observable from their own positions for path planning. This oversight reduces the accuracy of prediction. To address this, we propose the CVAE-VGAE model, a novel trajectory prediction approach. Initially, our method transforms global perspective map data into vehicle-centric map data, representing it through a graph structure. Subsequently, Variational Graph Auto-Encoders (VGAEs), an unsupervised learning framework tailored for graph-structured data, are employed to extract road environment features specific to each vehicle’s location from the map data. Finally, a prediction network based on the Conditional Variational Autoencoder (CVAE) structure is designed, which first predicts the driving endpoint and then fits the complete future trajectory. The proposed CVAE-VGAE model integrates a self-attention mechanism into its encoding and decoding modules to infer endpoint intent and incorporate road environment features for precise trajectory prediction. Through a series of ablation experiments, we demonstrate the efficacy of our method in enhancing vehicle trajectory prediction metrics. Furthermore, we compare our model with traditional and frontier approaches, highlighting significant improvements in prediction accuracy. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

23 pages, 2717 KiB  
Article
Why Not Both? An Attention-Guided Transformer with Pixel-Related Deconvolution Network for Face Super-Resolution
by Zhe Zhang and Chun Qi
Appl. Sci. 2024, 14(9), 3793; https://doi.org/10.3390/app14093793 - 29 Apr 2024
Viewed by 718
Abstract
Transformer-based encoder-decoder networks for face super-resolution (FSR) have achieved promising success in delivering stunningly clear and detailed facial images by capturing local and global dependencies. However, these methods have certain limitations. Specifically, the deconvolution in upsampling layers neglects the relationship between adjacent pixels, [...] Read more.
Transformer-based encoder-decoder networks for face super-resolution (FSR) have achieved promising success in delivering stunningly clear and detailed facial images by capturing local and global dependencies. However, these methods have certain limitations. Specifically, the deconvolution in upsampling layers neglects the relationship between adjacent pixels, which is crucial in facial structure reconstruction. Additionally, raw feature maps are fed to the transformer blocks directly without mining their potential feature information, resulting in suboptimal face images. To circumvent these problems, we propose an attention-guided transformer with pixel-related deconvolution network for FSR. Firstly, we devise a novel Attention-Guided Transformer Module (AGTM), which is composed of an Attention-Guiding Block (AGB) and a Channel-wise Multi-head Transformer Block (CMTB). AGTM at the top of the encoder-decoder network (AGTM-T) promotes both local facial details and global facial structures, while AGTM at the bottleneck side (AGTM-B) optimizes the encoded features. Secondly, a Pixel-Related Deconvolution (PRD) layer is specially designed to establish direct relationships among adjacent pixels in the upsampling process. Lastly, we develop a Multi-scale Feature Fusion Module (MFFM) to fuse multi-scale features for better network flexibility and reconstruction results. Quantitative and qualitative experimental results on various datasets demonstrate that the proposed method outperforms other state-of-the-art FSR methods. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

16 pages, 8740 KiB  
Article
Dynamic Downsampling Algorithm for 3D Point Cloud Map Based on Voxel Filtering
by Wenqi Lyu, Wei Ke, Hao Sheng, Xiao Ma and Huayun Zhang
Appl. Sci. 2024, 14(8), 3160; https://doi.org/10.3390/app14083160 - 9 Apr 2024
Cited by 4 | Viewed by 2729
Abstract
In response to the challenge of handling large-scale 3D point cloud data, downsampling is a common approach, yet it often leads to the problem of feature loss. We present a dynamic downsampling algorithm for 3D point cloud maps based on an improved voxel [...] Read more.
In response to the challenge of handling large-scale 3D point cloud data, downsampling is a common approach, yet it often leads to the problem of feature loss. We present a dynamic downsampling algorithm for 3D point cloud maps based on an improved voxel filtering approach. The algorithm consists of two modules, namely, dynamic downsampling and point cloud edge extraction. The former adapts voxel downsampling according to the features of the point cloud, while the latter preserves edge information within the 3D point cloud map. Comparative experiments with voxel downsampling, grid downsampling, clustering-based downsampling, random downsampling, uniform downsampling, and farthest-point downsampling were conducted. The proposed algorithm exhibited favorable downsampling simplification results, with a processing time of 0.01289 s and a simplification rate of 91.89%. Additionally, it demonstrated faster downsampling speed and showcased improved overall performance. This enhancement not only benefits productivity but also highlights the system’s efficiency and effectiveness. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

12 pages, 3423 KiB  
Article
AI Somatotype System Using 3D Body Images: Based on Deep-Learning and Transfer Learning
by Jiwun Yoon, Sang-Yong Lee and Ji-Yong Lee
Appl. Sci. 2024, 14(6), 2608; https://doi.org/10.3390/app14062608 - 20 Mar 2024
Cited by 1 | Viewed by 1248
Abstract
Humans share a similar body structure, but each individual possesses unique characteristics, which we define as one’s body type. Various classification methods have been devised to understand and assess these body types. Recent research has applied artificial intelligence technology utilizing noninvasive measurement tools, [...] Read more.
Humans share a similar body structure, but each individual possesses unique characteristics, which we define as one’s body type. Various classification methods have been devised to understand and assess these body types. Recent research has applied artificial intelligence technology utilizing noninvasive measurement tools, such as 3D body scanner, which minimize physical contact. The purpose of this study was to develop an artificial intelligence somatotype system capable of predicting the three body types proposed by Heath-Carter’s somatotype theory using 3D body images collected using a 3D body scanner. To classify body types, measurements were taken to determine the three somatotype components (endomorphy, mesomorphy, and ectomorphy). MobileNetV2 was utilized as the transfer learning model. The results of this study are as follows: first, the AI somatotype model showed good performance, with a training accuracy around 91% and a validation accuracy around 72%. The respective loss values were 0.26 for the training set and 0.69 for the validation set. Second, validation of the model’s performance using test data resulted in accurate predictions for 18 out of 21 new data points, with prediction errors occurring in three cases, indicating approximately 85% classification accuracy. This study provides foundational data for subsequent research aiming to predict 13 detailed body types across the three body types. Furthermore, it is hoped that the outcomes of this research can be applied in practical settings, enabling anyone with a smartphone camera to identify various body types based on captured images and predict obesity and diseases. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

15 pages, 563 KiB  
Article
Camouflaged Object Detection Based on Deep Learning with Attention-Guided Edge Detection and Multi-Scale Context Fusion
by Yalin Wen, Wei Ke and Hao Sheng
Appl. Sci. 2024, 14(6), 2494; https://doi.org/10.3390/app14062494 - 15 Mar 2024
Viewed by 1816
Abstract
In nature, objects that use camouflage have features like colors and textures that closely resemble their background. This creates visual illusions that help them hide and protect themselves from predators. This similarity also makes the task of detecting camouflaged objects very challenging. Methods [...] Read more.
In nature, objects that use camouflage have features like colors and textures that closely resemble their background. This creates visual illusions that help them hide and protect themselves from predators. This similarity also makes the task of detecting camouflaged objects very challenging. Methods for camouflaged object detection (COD), which rely on deep neural networks, are increasingly gaining attention. These methods focus on improving model performance and computational efficiency by extracting edge information and using multi-layer feature fusion. Our improvement is based on researching ways to enhance efficiency in the encode–decode process. We have developed a variant model that combines Swin Transformer (Swin-T) and EfficientNet-B7. This model integrates the strengths of both Swin-T and EfficientNet-B7, and it employs an attention-guided tracking module to efficiently extract edge information and identify objects in camouflaged environments. Additionally, we have incorporated dense skip links to enhance the aggregation of deep-level feature information. A boundary-aware attention module has been incorporated into the final layer of the initial shallow information recognition phase. This module utilizes the Fourier transform to quickly relay specific edge information from the initially obtained shallow semantics to subsequent stages, thereby more effectively achieving feature recognition and edge extraction. In the latter phase, which is focused on deep semantic extraction, we employ a dense skip joint attention module to enhance the decoder’s performance and efficiency, ensuring accurate capture of deep-level information, feature recognition, and edge extraction. In the later stage of deep semantic extraction, we use a dense skip joint attention module to improve the decoder’s performance and efficiency in capturing precise deep information. This module efficiently identifies the specifics and edge information of undetected camouflaged objects across channels and spaces. Differing from previous methods, we introduce an adaptive pixel strength loss function for handling key captured information. Our proposed method shows strong competitive performance on three current benchmark datasets (CHAMELEON, CAMO, COD10K). Compared to 26 previously proposed methods using 4 measurement metrics, our approach exhibits favorable competitiveness. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

20 pages, 6853 KiB  
Article
GLBRF: Group-Based Lightweight Human Behavior Recognition Framework in Video Camera
by Young-Chan Lee, So-Yeon Lee, Byeongchang Kim and Dae-Young Kim
Appl. Sci. 2024, 14(6), 2424; https://doi.org/10.3390/app14062424 - 13 Mar 2024
Viewed by 892
Abstract
Behavioral recognition is an important technique for recognizing actions by analyzing human behavior. It is used in various fields, such as anomaly detection and health estimation. For this purpose, deep learning models are used to recognize and classify the features and patterns of [...] Read more.
Behavioral recognition is an important technique for recognizing actions by analyzing human behavior. It is used in various fields, such as anomaly detection and health estimation. For this purpose, deep learning models are used to recognize and classify the features and patterns of each behavior. However, video-based behavior recognition models require a lot of computational power as they are trained using large datasets. Therefore, there is a need for a lightweight learning framework that can efficiently recognize various behaviors. In this paper, we propose a group-based lightweight human behavior recognition framework (GLBRF) that achieves both low computational burden and high accuracy in video-based behavior recognition. The GLBRF system utilizes a relatively small dataset to reduce computational cost using a 2D CNN model and improves behavior recognition accuracy by applying location-based grouping to recognize interaction behaviors between people. This enables efficient recognition of multiple behaviors in various services. With grouping, the accuracy was as high as 98%, while without grouping, the accuracy was relatively low at 68%. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

20 pages, 7751 KiB  
Article
SCGFormer: Semantic Chebyshev Graph Convolution Transformer for 3D Human Pose Estimation
by Jiayao Liang and Mengxiao Yin
Appl. Sci. 2024, 14(4), 1646; https://doi.org/10.3390/app14041646 - 18 Feb 2024
Viewed by 1345
Abstract
With the rapid advancement of deep learning, 3D human pose estimation has largely freed itself from reliance on manually annotated methods. The effective utilization of joint features has become significant. Utilizing 2D human joint information to predict 3D human skeletons is of paramount [...] Read more.
With the rapid advancement of deep learning, 3D human pose estimation has largely freed itself from reliance on manually annotated methods. The effective utilization of joint features has become significant. Utilizing 2D human joint information to predict 3D human skeletons is of paramount importance. Effectively leveraging 2D joint data can improve the accuracy of 3D human skeleton prediction. In this paper, we propose the SCGFormer model to reduce the error in predicting human skeletal poses in three-dimensional space. The network architecture of SCGFormer encompasses Transformer and two distinct types of graph convolution, organized into two interconnected modules: SGraAttention and AcChebGconv. SGraAttention extracts global feature information from each 2D human joint, thereby augmenting local feature learning by integrating prior knowledge of human joint relationships. Simultaneously, AcChebGconv broadens the receptive field for graph structure information and constructs implicit joint relationships to aggregate more valuable adjacent features. SCGraFormer is tested on widely recognized benchmark datasets such as Human3.6M and MPI-INF-3DHP and achieves excellent results. In particular, on Human3.6M, our method achieves the best results in 9 actions (out of a total of 15 actions), with an overall average error reduction of about 1.5 points compared to state-of-the-art methods, demonstrating the excellent performance of SCGFormer. Full article
(This article belongs to the Special Issue Advances in Image Recognition and Processing Technologies)
Show Figures

Figure 1

Back to TopTop