Topic Editors

Department of Engineering (DING), University of Sannio, Benevento, Italy
Department of Computer Science, Royal Holloway, University of London, Surrey TW20 0EX, UK

Computer Vision and Image Processing, 2nd Edition

Abstract submission deadline
closed (30 September 2024)
Manuscript submission deadline
closed (31 December 2024)
Viewed by
70462

Topic Information

Dear Colleagues,

The field of computer vision and image processing has advanced significantly in recent years, with new techniques and applications emerging constantly. Building on the success of our first edition, we are pleased to announce a second edition on this exciting topic. We invite researchers, academics, and practitioners to submit original research articles, reviews, or case studies that address the latest developments in computer vision and image processing. Topics of interest include but are not limited to:

  • Deep learning for image classification and recognition
  • Object detection and tracking
  • Image segmentation and analysis
  • 3D reconstruction and modeling
  • Image and video compression
  • Image enhancement and restoration
  • Medical image processing and analysis
  • Augmented and virtual reality

Submissions should be original and should not have been published or submitted elsewhere. All papers will be peer-reviewed by at least two experts in the field, and accepted papers will be published together on the topic website. To submit your paper, please visit the journal's website and follow the submission guidelines. For any queries, please contact the guest editors of the topic.

We look forward to receiving your submissions and sharing the latest advancements in computer vision and image processing with our readers.

Prof. Silvia Liberata Ullo
Prof. Dr. Li Zhang
Topic Editors

Keywords

  • 3D acquisition, processing, and visualization
  • scene understanding
  • multimodal sensor processing and fusion
  • multispectral, color, and greyscale image processing
  • industrial quality inspection
  • computer vision for robotics
  • computer vision for surveillance
  • airborne and satellite on-board image acquisition platforms
  • computational models of vision
  • imaging psychophysics

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
Applied Sciences
applsci
2.5 5.3 2011 18.4 Days CHF 2400
Electronics
electronics
2.6 5.3 2012 16.4 Days CHF 2400
Journal of Imaging
jimaging
2.7 5.9 2015 18.3 Days CHF 1800
Mathematics
mathematics
2.3 4.0 2013 18.3 Days CHF 2600
Remote Sensing
remotesensing
4.2 8.3 2009 23.9 Days CHF 2700

Preprints.org is a multidiscipline platform providing preprint service that is dedicated to sharing your research from the start and empowering your research journey.

MDPI Topics is cooperating with Preprints.org and has built a direct connection between MDPI journals and Preprints.org. Authors are encouraged to enjoy the benefits by posting a preprint at Preprints.org prior to publication:

  1. Immediately share your ideas ahead of publication and establish your research priority;
  2. Protect your idea from being stolen with this time-stamped preprint article;
  3. Enhance the exposure and impact of your research;
  4. Receive feedback from your peers in advance;
  5. Have it indexed in Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (62 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
17 pages, 1000 KiB  
Article
Zero-Shot Day–Night Domain Adaptation for Face Detection Based on DAl-CLIP-Dino
by Huadong Sun, Yinghui Liu, Ziyang Chen and Pengyi Zhang
Viewed by 252
Abstract
Two challenges in computer vision (CV) related to face detection are the difficulty of acquisition in the target domain and the degradation of image quality. Especially in low-light situations, the poor visibility of images is difficult to label, which results in detectors trained [...] Read more.
Two challenges in computer vision (CV) related to face detection are the difficulty of acquisition in the target domain and the degradation of image quality. Especially in low-light situations, the poor visibility of images is difficult to label, which results in detectors trained under well-lit conditions exhibiting reduced performance in low-light environments. Conventional works image enhancement and object detection techniques are unable to resolve the inherent difficulties in collecting and labeling low-light images. The Dark-Illuminated Network with Contrastive Language–Image Pretraining (CLIP) and Self-Supervised Vision Transformer (Dino), abbreviated as DAl-CLIP-Dino is proposed to address the degradation of object detection performance in low-light environments and achieve zero-shot day–night domain adaptation. Specifically, an advanced reflectance representation learning module (which leverages Retinex decomposition to extract reflectance and illumination features from both low-light and well-lit images) and an interchange–redecomposition coherence process (which performs a second decomposition on reconstructed images after the exchange to generate a second round of reflectance and illumination predictions while validating their consistency using redecomposition consistency loss) are employed to achieve illumination invariance and enhance model performance. CLIP (VIT-based image encoder part) and Dino have been integrated for feature extraction, improving performance under extreme lighting conditions and enhancing its generalization capability. Our model achieves a mean average precision (mAP) of 29.6% for face detection on the DARK FACE dataset, outperforming other models in zero-shot domain adaptation for face detection. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

17 pages, 783 KiB  
Article
IOF-Tracker: A Two-Stage Multiple Targets Tracking Method Using Spatial-Temporal Fusion Algorithm
by Hongbin Liu, Yongze Zhao, Peng Dong, Xiuyi Guo and Yilin Wang
Appl. Sci. 2025, 15(1), 107; https://doi.org/10.3390/app15010107 - 26 Dec 2024
Viewed by 426
Abstract
Multi-object tracking aims to track multiple objects across consecutive frames in a video, assigning a unique classifier to each object. However, issues such as occlusions, directional changes, or shape alterations can cause appearance variations, leading to detection and matching problems that in turn [...] Read more.
Multi-object tracking aims to track multiple objects across consecutive frames in a video, assigning a unique classifier to each object. However, issues such as occlusions, directional changes, or shape alterations can cause appearance variations, leading to detection and matching problems that in turn result in frequent ID switches. To solve these issues, this paper proposes a two-stage multi-object tracking framework based on a spatial and temporal fusion algorithm. First, the video frames are processed by a detector to identify objects and form rectangular detection areas. Meanwhile, an estimator predicts the target rectangular areas in the next frame. Then, we extract the optical flow of the target pixels within the detection and prediction areas, and then a temporal information model is established by calculating the average of the target pixels’ optical flow. Afterward, we present a spatial information model using the R-IoU (Reverse of Intersection over Union) between the detection and prediction areas. This spatial and temporal information is combined with weighted matrix fusion, which achieves the feature matching and association task. Finally, we implement a two-stage association multi-object tracking model using the mentioned fusion algorithm. Experiments on the MOTChallenge dataset using the official detector show that our two-stage multi-object tracking method based on the spatial and temporal fusion algorithm is robust in handling occlusions and ID switch issues. As of the submission of this paper, the proposed method has achieved the top ranking in the MOT17 benchmark when evaluated with the official detector. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

27 pages, 9095 KiB  
Article
BMFusion: Bridging the Gap Between Dark and Bright in Infrared-Visible Imaging Fusion
by Chengwen Liu, Bin Liao and Zhuoyue Chang
Electronics 2024, 13(24), 5005; https://doi.org/10.3390/electronics13245005 - 19 Dec 2024
Viewed by 478
Abstract
The fusion of infrared and visible light images is a crucial technology for enhancing visual perception in complex environments. It plays a pivotal role in improving visual perception and subsequent performance in advanced visual tasks. However, due to the significant degradation of visible [...] Read more.
The fusion of infrared and visible light images is a crucial technology for enhancing visual perception in complex environments. It plays a pivotal role in improving visual perception and subsequent performance in advanced visual tasks. However, due to the significant degradation of visible light image quality in low-light or nighttime scenes, most existing fusion methods often struggle to obtain sufficient texture details and salient features when processing such scenes. This can lead to a decrease in fusion quality. To address this issue, this article proposes a new image fusion method called BMFusion. Its aim is to significantly improve the quality of fused images in low-light or nighttime scenes and generate high-quality fused images around the clock. This article first designs a brightness attention module composed of brightness attention units. It extracts multimodal features by combining the SimAm attention mechanism with a Transformer architecture. Effective enhancement of brightness and features has been achieved, with gradual brightness attention performed during feature extraction. Secondly, a complementary fusion module was designed. This module deeply fuses infrared and visible light features to ensure the complementarity and enhancement of each modal feature during the fusion process, minimizing information loss to the greatest extent possible. In addition, a feature reconstruction network combining CLIP-guided semantic vectors and neighborhood attention enhancement was proposed in the feature reconstruction stage. It uses the KAN module to perform channel adaptive optimization on the reconstruction process, ensuring semantic consistency and detail integrity of the fused image during the reconstruction phase. The experimental results on a large number of public datasets demonstrate that the BMFusion method can generate fusion images with higher visual quality and richer details in night and low-light environments compared with various existing state-of-the-art (SOTA) algorithms. At the same time, the fusion image can significantly improve the performance of advanced visual tasks. This shows the great potential and application prospect of this method in the field of multimodal image fusion. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

23 pages, 6881 KiB  
Article
Research on Lightweight Open-Pit Mine Driving Obstacle Detection Algorithm Based on Improved YOLOv8s
by Bo Xu, Wubin Xu, Bing Li, Hanwen Zhang, Yuanbin Xiao and Weixin Zhou
Appl. Sci. 2024, 14(24), 11741; https://doi.org/10.3390/app142411741 - 16 Dec 2024
Viewed by 385
Abstract
The road environment of open-pit mines is complex and unstructured. Unmanned construction machinery driving faces huge challenges. Improving the accuracy and speed of obstacle detection during driving is of great significance to ensuring the safety of mine automation construction and improving the efficiency [...] Read more.
The road environment of open-pit mines is complex and unstructured. Unmanned construction machinery driving faces huge challenges. Improving the accuracy and speed of obstacle detection during driving is of great significance to ensuring the safety of mine automation construction and improving the efficiency of overall unmanned operations. In view of the fact that the current obstacle detection algorithm struggles to strike a balance between high precision and real-time performance, and there are problems such as difficulty in model deployment or unsuitability for practical applications, a lightweight open-pit mine driving obstacle detection algorithm based on improved YOLOv8s is proposed, which is committed to improving the driving safety of unmanned engineering machinery in open-pit mines. In order to enhance the ability of the backbone to capture features, the idea of the guidance module (CGBlock) of contextual information is introduced to construct a new CGC2f module; the efficient squeeze excitation (ESE) attention mechanism is embedded in the feature fusion layer to make the model pay more attention to the channels containing important feature information; in order to enhance the model’s learning ability for obstacles of different sizes in the open-pit mine, a more suitable dynamic head network (DyHead) is used at the output end; in order to further improve real-time performance, the layer-based adaptive amplitude pruning (LAMP) score algorithm is used to prune redundant weight parameters. To verify the effectiveness of the algorithm in this paper, an experimental verification is carried out on the constructed open-pit mine driving obstacle dataset. The results show that compared with YOLOv8s, the mAP50 of this algorithm reaches 95.3%, the detection speed is increased by 40.2%, the model parameters are reduced by 71.2%, and the calculation amount is reduced by 73.7%. It meets the requirements of real-time and high-precision obstacle detection in open-pit mine driving and provides technical support for smart mine driving. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

17 pages, 15701 KiB  
Article
Unsupervised Semantic Scene Reconstruction via Transformer-Based Quantized Vector Reconstruction and Autoregressive Completion
by Yubin Miao, Shuxin Xie, Tianrui Quan, Junkang Wan and Mengxiang Hao
Electronics 2024, 13(24), 4922; https://doi.org/10.3390/electronics13244922 - 13 Dec 2024
Viewed by 388
Abstract
Semantic scene reconstruction from sparse and incomplete point clouds is a vital task in understanding point scenes. This task involves assigning semantic labels to objects and reconstructing their complete shapes as meshes. In recent years, researchers have adopted a “reconstruction from recognition” approach, [...] Read more.
Semantic scene reconstruction from sparse and incomplete point clouds is a vital task in understanding point scenes. This task involves assigning semantic labels to objects and reconstructing their complete shapes as meshes. In recent years, researchers have adopted a “reconstruction from recognition” approach, which first segments foreground objects from the point cloud and then completes and reconstructs them as mesh representations. This method has successfully facilitated both the semantic and geometric understanding of point scenes. However, existing approaches based on deep learning often depend on supervised training, requiring extensive annotations and incurring high training costs. To address this limitation, we introduce unsupervised algorithms for completing and reconstructing partial observations. While Transformer-based autoregressive shape completion shows great potential, there has been limited research on applying it to complete instances segmented from real-world scenes. To bridge this gap, we propose VRC (unsupervised semantic scene reconstruction via Transformer-based quantized Vector Reconstruction and autoregressive Completion), a novel framework that integrates unsupervised algorithms with Transformer-based autoregressive completion. Our approach enables the unsupervised reconstruction of real-world scenes. Comparisons with state-of-the-art methods on authoritative public datasets demonstrate that VRC achieves superior reconstruction performance with significantly reduced data costs. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

16 pages, 1289 KiB  
Article
DAT: Deep Learning-Based Acceleration-Aware Trajectory Forecasting
by Ali Asghar Sharifi, Ali Zoljodi and Masoud Daneshtalab
J. Imaging 2024, 10(12), 321; https://doi.org/10.3390/jimaging10120321 - 13 Dec 2024
Viewed by 482
Abstract
As the demand for autonomous driving (AD) systems has increased, the enhancement of their safety has become critically important. A fundamental capability of AD systems is object detection and trajectory forecasting of vehicles and pedestrians around the ego-vehicle, which is essential for preventing [...] Read more.
As the demand for autonomous driving (AD) systems has increased, the enhancement of their safety has become critically important. A fundamental capability of AD systems is object detection and trajectory forecasting of vehicles and pedestrians around the ego-vehicle, which is essential for preventing potential collisions. This study introduces the Deep learning-based Acceleration-aware Trajectory forecasting (DAT) model, a deep learning-based approach for object detection and trajectory forecasting, utilizing raw sensor measurements. DAT is an end-to-end model that processes sequential sensor data to detect objects and forecasts their future trajectories at each time step. The core innovation of DAT lies in its novel forecasting module, which leverages acceleration data to enhance trajectory forecasting, leading to the consideration of a variety of agent motion models. We propose a robust and innovative method for estimating ground-truth acceleration for objects, along with an object detector that predicts acceleration attributes for each detected object and a novel method for trajectory forecasting. DAT is trained and evaluated on the NuScenes dataset, demonstrating its empirical effectiveness through extensive experiments. The results indicate that DAT significantly surpasses state-of-the-art methods, particularly in enhancing forecasting accuracy for objects exhibiting both linear and nonlinear motion patterns, achieving up to a 2× improvement. This advancement highlights the critical role of incorporating acceleration data into predictive models, representing a substantial step forward in the development of safer autonomous driving systems. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

14 pages, 9209 KiB  
Communication
Implementation of an FPGA-Based System to Process Images and Match Keypoints on High-Resolution Pictures
by Sina Bundschuh, Jan Kunze and Klaus-Dieter Kuhnert
Electronics 2024, 13(23), 4774; https://doi.org/10.3390/electronics13234774 - 3 Dec 2024
Viewed by 557
Abstract
Processing scenery and finding points of interest is crucial for applications in robotics and aerospace missions. Those areas require efficient and reliable visual input processing. Here, field programmable gate arrays (FPGAs) offer essential advantages, like low power consumption compared to CPUs, performing a [...] Read more.
Processing scenery and finding points of interest is crucial for applications in robotics and aerospace missions. Those areas require efficient and reliable visual input processing. Here, field programmable gate arrays (FPGAs) offer essential advantages, like low power consumption compared to CPUs, performing a large number of calculations simultaneously, and having compact hardware. This paper presents an FPGA system that processes incoming camera data, finds points of interest, and matches them across different images on high-resolution images (2048 × 1088). It is a novel approach to implement the complete image processing pipeline on high-resolution images within the FPGA fabric without additional hardware. For keypoint detection and matching, our work uses a modified SIFT algorithm optimized for FPGA implementation processing and a nearest neighbor-based matching method. It was implemented on a Xilinx Kintex-7 FPGA and partially on a NanoXplore NG-Ultra to evaluate a radiation-hardened FPGA for space applications. On the Kintex-7, the keypoint detection achieves a speed of 33 ms per image, and its features are matched on up to 5 images per second. Judging by the resource utilization of one image processing module on the NG-Ultra, porting the entire system on a radiation-hardened FPGA appears feasible. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 27038 KiB  
Article
A Knowledge Base Driven Task-Oriented Image Semantic Communication Scheme
by Chang Guo, Junhua Xi, Zhanhao He, Jiaqi Liu and Jungang Yang
Remote Sens. 2024, 16(21), 4044; https://doi.org/10.3390/rs16214044 - 30 Oct 2024
Viewed by 642
Abstract
With the development of artificial intelligence and computer hardware, semantic communication has been attracting great interest. As an emerging communication paradigm, semantic communication can reduce the requirement for channel bandwidth by extracting semantic information. This is an effective method that can be applied [...] Read more.
With the development of artificial intelligence and computer hardware, semantic communication has been attracting great interest. As an emerging communication paradigm, semantic communication can reduce the requirement for channel bandwidth by extracting semantic information. This is an effective method that can be applied to image acquisition of unmanned aerial vehicles, which can transmit high-data-volume images within the constraints of limited available bandwidth. However, the existing semantic communication schemes fail to adequately incorporate the guidance of task requirements into the semantic communication process and are difficult to adapt to the dynamic changes of tasks. A task-oriented image semantic communication scheme driven by knowledge base is proposed, aiming at achieving high compression ratio and high quality image reconstruction, and effectively solving the bandwidth limitation. This scheme segments the input image into several semantic information unit under the guidance of task requirements by Yolo-World and Segment Anything Model. The assigned bandwidth for each unit is according to the task relevance scores, which enables high-quality transmission of task-related information with lower communication overheads. An improved metric weighted learned perceptual image patch similarity (LPIPS) is proposed to evaluate the transmission accuracy of the novel scheme. Experimental results show that our scheme achieves a notable performance improvement on weighted LPIPS while the same compression ratio compared with traditional image compression schemes. Our scheme has a higher target capture ratio than traditional image compression schemes under the task of target detection. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

16 pages, 3941 KiB  
Article
DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning
by Mingliang Zhang, Xia Hou, Yujing Yan and Meng Sun
Electronics 2024, 13(21), 4207; https://doi.org/10.3390/electronics13214207 - 27 Oct 2024
Viewed by 590
Abstract
Image captioning aims to describe the content in an image, which plays a critical role in image understanding. Existing methods tend to generate the text for more distinct natural images. These models can not be well for paintings containing more abstract meaning due [...] Read more.
Image captioning aims to describe the content in an image, which plays a critical role in image understanding. Existing methods tend to generate the text for more distinct natural images. These models can not be well for paintings containing more abstract meaning due to the limitation of objective parsing without related knowledge. To alleviate, we propose a novel cross-modality decouple model to generate the objective and subjective parsing separately. Concretely, we propose to encode both subjective semantic and implied knowledge contained in the paintings. The key point of our framework is decoupled CLIP-based branches (DecoupleCLIP). For the objective caption branch, we utilize the CLIP model as the global feature extractor and construct a feature fusion module for global clues. Based on the objective caption branch structure, we add a multimodal fusion module called the artistic conception branch. In this way, the objective captions can constrain artistic conception content. We conduct extensive experiments to demonstrate our DecoupleCLIP’s superior ability over our new dataset. Our model achieves nearly 2% improvement over other comparison models on CIDEr. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

20 pages, 16843 KiB  
Technical Note
STCA: High-Altitude Tracking via Single-Drone Tracking and Cross-Drone Association
by Yu Qiao, Huijie Fan, Qiang Wang, Tinghui Zhao and Yandong Tang
Remote Sens. 2024, 16(20), 3861; https://doi.org/10.3390/rs16203861 - 17 Oct 2024
Viewed by 692
Abstract
In this paper, we introduce a high-altitude multi-drone multi-target (HAMDMT) tracking method called STCA, which aims to collaboratively track similar targets that are easily confused. We approach this challenge by categorizing the HAMDMT tracking into two principal tasks: Single-Drone Tracking and Cross-Drone Association. [...] Read more.
In this paper, we introduce a high-altitude multi-drone multi-target (HAMDMT) tracking method called STCA, which aims to collaboratively track similar targets that are easily confused. We approach this challenge by categorizing the HAMDMT tracking into two principal tasks: Single-Drone Tracking and Cross-Drone Association. Single-Drone Tracking employs positional and appearance data vectors to overcome the challenges arising from similar target appearances within the field of view of a single drone. The Cross-Drone Association employs image-matching technology (LightGlue) to ascertain the topological relationships between images captured by disparate drones, thereby accurately determining the associations between targets across multiple drones. In Cross-Drone Association, we enhanced LightGlue into a more efficacious method, designated T-LightGlue, for cross-drone target tracking. This approach markedly accelerates the tracking process while reducing indicator dropout. To narrow down the range of targets involved in the cross-drone association, we develop a Common View Area Model based on the four vertices of the image. Considering to mitigate the occlusion encountered by high-altitude drones, we design a Local-Matching Model that assigns the same ID to the mutually nearest pair of targets from different drones after mapping the centroids of the targets across drones. The MDMT dataset is the only one captured by a high-altitude drone and contains a substantial number of similar vehicles. In the MDMT dataset, the STCA achieves the highest MOTA in Single-Drone Tracking, with the IDF1 system achieving the second-highest performance and the MDA system achieving the highest performance in Cross-Drone Association. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Graphical abstract

22 pages, 29294 KiB  
Article
Ghost Removal from Forward-Scan Sonar Views near the Sea Surface for Image Enhancement and 3-D Object Modeling
by Yuhan Liu and Shahriar Negahdaripour
Remote Sens. 2024, 16(20), 3814; https://doi.org/10.3390/rs16203814 - 14 Oct 2024
Viewed by 911
Abstract
Underwater sonar is the primary remote sensing and imaging modality within turbid environments with poor visibility. The two-dimensional (2-D) images of a target near the air–sea interface (or resting on a hard seabed), acquired by forward-scan sonar (FSS), are generally corrupted by the [...] Read more.
Underwater sonar is the primary remote sensing and imaging modality within turbid environments with poor visibility. The two-dimensional (2-D) images of a target near the air–sea interface (or resting on a hard seabed), acquired by forward-scan sonar (FSS), are generally corrupted by the ghost and sometimes mirror components, formed by the multipath propagation of transmitted acoustic beams. In the processing of the 2-D FSS views to generate an accurate three-dimensional (3-D) object model, the corrupted regions have to be discarded. The sonar tilt angle and distance from the sea surface are two important parameters for the accurate localization of the ghost and mirror components. We propose a unified optimization technique for improving both the measurements of these two parameters from inexpensive sensors and the accuracy of a 3-D object model using 2-D FSS images at known poses. The solution is obtained by the recursive updating of sonar parameters and 3-D object model. Utilizing the 3-D object model, we can enhance the original images and generate synthetic views for arbitrary sonar poses. We demonstrate the performance of our method in experiments with the synthetic and real images of three targets: two dominantly convex coral rocks and a highly concave toy wood table. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Graphical abstract

18 pages, 3999 KiB  
Article
SS-YOLOv8: A Lightweight Algorithm for Surface Litter Detection
by Zhipeng Fan, Zheng Qin, Wei Liu, Ming Chen and Zeguo Qiu
Appl. Sci. 2024, 14(20), 9283; https://doi.org/10.3390/app14209283 - 12 Oct 2024
Viewed by 1031
Abstract
With the advancement of science and technology, pollution in rivers and water surfaces has increased, impacting both ecology and public health. Timely identification of surface waste is crucial for effective cleanup. Traditional edge detection devices struggle with limited memory and resources, making the [...] Read more.
With the advancement of science and technology, pollution in rivers and water surfaces has increased, impacting both ecology and public health. Timely identification of surface waste is crucial for effective cleanup. Traditional edge detection devices struggle with limited memory and resources, making the YOLOv8 algorithm inefficient. This paper introduces a lightweight network model for detecting water surface litter. We enhance the CSP Bottleneck with a two-convolutions (C2f) module to improve image recognition tasks. By implementing the powerful intersection over union 2 (PIoU2), we enhance model accuracy over the original CIoU. Our novel Shared Convolutional Detection Head (SCDH) minimizes parameters, while the scale layer optimizes feature scaling. Using a slimming pruning method, we further reduce the model’s size and computational needs. Our model achieves a mean average precision (mAP) of 79.9% on the surface litter dataset, with a compact size of 2.3 MB and a processing rate of 128 frames per second, meeting real-time detection requirements. This work significantly contributes to efficient environmental monitoring and offers a scalable solution for deploying advanced detection models on resource-constrained devices. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 9898 KiB  
Article
Land Cover Mapping in East China for Enhancing High-Resolution Weather Simulation Models
by Bingxin Ma, Yang Shao, Hequn Yang, Yiwen Lu, Yanqing Gao, Xinyao Wang, Ying Xie and Xiaofeng Wang
Remote Sens. 2024, 16(20), 3759; https://doi.org/10.3390/rs16203759 - 10 Oct 2024
Viewed by 969
Abstract
This study was designed to develop a 30 m resolution land cover dataset to improve the performance of regional weather forecasting models in East China. A 10-class land cover mapping scheme was established, reflecting East China’s diverse landscape characteristics and incorporating a new [...] Read more.
This study was designed to develop a 30 m resolution land cover dataset to improve the performance of regional weather forecasting models in East China. A 10-class land cover mapping scheme was established, reflecting East China’s diverse landscape characteristics and incorporating a new category for plastic greenhouses. Plastic greenhouses are key to understanding surface heterogeneity in agricultural regions, as they can significantly impact local climate conditions, such as heat flux and evapotranspiration, yet they are often not represented in conventional land cover classifications. This is mainly due to the lack of high-resolution datasets capable of detecting these small yet impactful features. For the six-province study area, we selected and processed Landsat 8 imagery from 2015–2018, filtering for cloud cover. Complementary datasets, such as digital elevation models (DEM) and nighttime lighting data, were integrated to enrich the inputs for the Random Forest classification. A comprehensive training dataset was compiled to support Random Forest training and classification accuracy. We developed an automated workflow to manage the data processing, including satellite image selection, preprocessing, classification, and image mosaicking, thereby ensuring the system’s practicality and facilitating future updates. We included three Weather Research and Forecasting (WRF) model experiments in this study to highlight the impact of our land cover maps on daytime and nighttime temperature predictions. The resulting regional land cover dataset achieved an overall accuracy of 83.2% and a Kappa coefficient of 0.81. These accuracy statistics are higher than existing national and global datasets. The model results suggest that the newly developed land cover, combined with a mosaic option in the Unified Noah scheme in WRF, provided the best overall performance for both daytime and nighttime temperature predictions. In addition to supporting the WRF model, our land cover map products, with a planned 3–5-year update schedule, could serve as a valuable data source for ecological assessments in the East China region, informing environmental policy and promoting sustainability. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Graphical abstract

20 pages, 6728 KiB  
Article
Diffusion Model for Camouflaged Object Segmentation with Frequency Domain
by Wei Cai, Weijie Gao, Yao Ding, Xinhao Jiang, Xin Wang and Xingyu Di
Electronics 2024, 13(19), 3922; https://doi.org/10.3390/electronics13193922 - 3 Oct 2024
Viewed by 1223
Abstract
The task of camouflaged object segmentation (COS) is a challenging endeavor that entails the identification of objects that closely blend in with their surrounding background. Furthermore, the camouflaged object’s obscure form and its subtle differentiation from the background present significant challenges during the [...] Read more.
The task of camouflaged object segmentation (COS) is a challenging endeavor that entails the identification of objects that closely blend in with their surrounding background. Furthermore, the camouflaged object’s obscure form and its subtle differentiation from the background present significant challenges during the feature extraction phase of the network. In order to extract more comprehensive information, thereby improving the accuracy of COS, we propose a diffusion model for a COS network that utilizes frequency domain information as auxiliary input, and we name it FreDiff. Firstly, we proposed a frequency auxiliary module (FAM) to extract frequency domain features. Then, we designed a Global Fusion Module (GFM) to make FreDiff pay attention to the global features. Finally, we proposed an Upsample Enhancement Module (UEM) to enhance the detailed information of the features and perform upsampling before inputting them into the diffusion model. Additionally, taking into account the specific characteristics of COS, we develop the specialized training strategy for FreDiff. We compared FreDiff with 17 COS models on the four challenging COS datasets. Experimental results showed that FreDiff outperforms or is consistent with other state-of-the-art methods under five evaluation metrics. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 3869 KiB  
Article
Spatio-Temporal Dynamic Attention Graph Convolutional Network Based on Skeleton Gesture Recognition
by Xiaowei Han, Ying Cui, Xingyu Chen, Yunjing Lu and Wen Hu
Electronics 2024, 13(18), 3733; https://doi.org/10.3390/electronics13183733 - 20 Sep 2024
Cited by 1 | Viewed by 790
Abstract
Dynamic gesture recognition based on skeletal data has garnered significant attention with the rise of graph convolutional networks (GCNs). Existing methods typically calculate dependencies between joints and utilize spatio-temporal attention features. However, they often rely on joint topological features of limited spatial extent [...] Read more.
Dynamic gesture recognition based on skeletal data has garnered significant attention with the rise of graph convolutional networks (GCNs). Existing methods typically calculate dependencies between joints and utilize spatio-temporal attention features. However, they often rely on joint topological features of limited spatial extent and short-time features, making it challenging to extract intra-frame spatial features and long-term inter-frame temporal features. To address this, we propose a new GCN architecture for dynamic hand gesture recognition, called a spatio-temporal dynamic attention graph convolutional network (STDA-GCN). This model employs dynamic attention spatial graph convolution, enhancing spatial feature extraction capabilities while reducing computational complexity through improved cross-channel information interaction. Additionally, a salient location channel attention mechanism is integrated between spatio-temporal convolutions to extract useful spatial features and avoid redundancy. Finally, dynamic multi-scale temporal convolution is used to extract richer inter-frame gesture features, effectively capturing information across various time scales. Evaluations on the SHREC’17 Track and DHG-14/28 benchmark datasets show that our model achieves 97.14% and 95.84% accuracy, respectively. These results demonstrate the superior performance of STDA-GCN in dynamic gesture recognition tasks. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

22 pages, 11714 KiB  
Article
A Light-Weight Self-Supervised Infrared Image Perception Enhancement Method
by Yifan Xiao, Zhilong Zhang and Zhouli Li
Electronics 2024, 13(18), 3695; https://doi.org/10.3390/electronics13183695 - 18 Sep 2024
Viewed by 935
Abstract
Convolutional Neural Networks (CNNs) have achieved remarkable results in the field of infrared image enhancement. However, the research on the visual perception mechanism and the objective evaluation indicators for enhanced infrared images is still not in-depth enough. To [...] Read more.
Convolutional Neural Networks (CNNs) have achieved remarkable results in the field of infrared image enhancement. However, the research on the visual perception mechanism and the objective evaluation indicators for enhanced infrared images is still not in-depth enough. To make the subjective and objective evaluation more consistent, this paper uses a perceptual metric to evaluate the enhancement effect of infrared images. The perceptual metric mimics the early conversion process of the human visual system and uses the normalized Laplacian pyramid distance (NLPD) between the enhanced image and the original scene radiance to evaluate the image enhancement effect. Based on this, this paper designs an infrared image-enhancement algorithm that is more conducive to human visual perception. The algorithm uses a lightweight Fully Convolutional Network (FCN), with NLPD as the similarity measure, and trains the network in a self-supervised manner by minimizing the NLPD between the enhanced image and the original scene radiance to achieve infrared image enhancement. The experimental results show that the infrared image enhancement method in this paper outperforms existing methods in terms of visual perception quality, and due to the use of a lightweight network, it is also the fastest enhancement method currently. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

21 pages, 9454 KiB  
Article
Denoising Diffusion Implicit Model for Camouflaged Object Detection
by Wei Cai, Weijie Gao, Xinhao Jiang, Xin Wang and Xingyu Di
Electronics 2024, 13(18), 3690; https://doi.org/10.3390/electronics13183690 - 17 Sep 2024
Viewed by 919
Abstract
Camouflaged object detection (COD) is a challenging task that involves identifying objects that closely resemble their background. In order to detect camouflaged objects more accurately, we propose a diffusion model for the COD network called DMNet. DMNet formulates COD as a denoising diffusion [...] Read more.
Camouflaged object detection (COD) is a challenging task that involves identifying objects that closely resemble their background. In order to detect camouflaged objects more accurately, we propose a diffusion model for the COD network called DMNet. DMNet formulates COD as a denoising diffusion process from noisy boxes to prediction boxes. During the training stage, random boxes diffuse from ground-truth boxes, and DMNet learns to reverse this process. In the sampling stage, DMNet progressively refines random boxes to prediction boxes. In addition, due to the camouflaged object’s blurred appearance and the low contrast between it and the background, the feature extraction stage of the network is challenging. Firstly, we proposed a parallel fusion module (PFM) to enhance the information extracted from the backbone. Then, we designed a progressive feature pyramid network (PFPN) for feature fusion, in which the upsample adaptive spatial fusion module (UAF) balances the different feature information by assigning weights to different layers. Finally, a location refinement module (LRM) is constructed to make DMNet pay attention to the boundary details. We compared DMNet with other classical object-detection models on the COD10K dataset. Experimental results indicated that DMNet outperformed others, achieving optimal effects across six evaluation metrics and significantly enhancing detection accuracy. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

20 pages, 7583 KiB  
Article
Object/Scene Recognition Based on a Directional Pixel Voting Descriptor
by Abiel Aguilar-González, Alejandro Medina Santiago and J. A. de Jesús Osuna-Coutiño
Appl. Sci. 2024, 14(18), 8187; https://doi.org/10.3390/app14188187 - 11 Sep 2024
Viewed by 670
Abstract
Detecting objects in images is crucial for several applications, including surveillance, autonomous navigation, augmented reality, and so on. Although AI-based approaches such as Convolutional Neural Networks (CNNs) have proven highly effective in object detection, in scenarios where the objects being recognized are unknow, [...] Read more.
Detecting objects in images is crucial for several applications, including surveillance, autonomous navigation, augmented reality, and so on. Although AI-based approaches such as Convolutional Neural Networks (CNNs) have proven highly effective in object detection, in scenarios where the objects being recognized are unknow, it is difficult to generalize an AI model for such tasks. In another trend, feature-based approaches like SIFT, SURF, and ORB offer the capability to search any object but have limitations under complex visual variations. In this work, we introduce a novel edge-based object/scene recognition method. We propose that utilizing feature edges, instead of feature points, offers high performance under complex visual variations. Our primary contribution is a directional pixel voting descriptor based on image segments. Experimental results are promising; compared to previous approaches, ours demonstrates superior performance under complex visual variations and high processing speed. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

17 pages, 5025 KiB  
Article
Adaptive Channel-Enhanced Graph Convolution for Skeleton-Based Human Action Recognition
by Xiao-Wei Han, Xing-Yu Chen, Ying Cui, Qiu-Yang Guo and Wen Hu
Appl. Sci. 2024, 14(18), 8185; https://doi.org/10.3390/app14188185 - 11 Sep 2024
Viewed by 705
Abstract
Obtaining discriminative joint features is crucial for skeleton-based human action recognition. Current models mainly focus on the research of skeleton topology encoding. However, their predefined topology is the same and fixed for all action samples, making it challenging to obtain discriminative joint features. [...] Read more.
Obtaining discriminative joint features is crucial for skeleton-based human action recognition. Current models mainly focus on the research of skeleton topology encoding. However, their predefined topology is the same and fixed for all action samples, making it challenging to obtain discriminative joint features. Although some studies have considered the complex non-natural connection relationships between joints, the existing methods cannot fully capture this complexity by using high-order adjacency matrices or adding trainable parameters and instead increase the computation parameters. Therefore, this study constructs a novel adaptive channel-enhanced graph convolution (ACE-GCN) model for human action recognition. The model generates similar and affinity attention maps by encoding channel attention in the input features. These maps are complementarily applied to the input feature map and graph topology, which can realize the refinement of joint features and construct an adaptive and non-shared channel-based adjacency matrix. This method of constructing the adjacency matrix improves the model’s capacity to capture intricate non-natural connections between joints, prevents the accumulation of unnecessary information, and minimizes the number of computational parameters. In addition, integrating the Edgeconv module into a multi-branch aggregation improves the model’s ability to aggregate different scale and temporal features. Ultimately, comprehensive experiments were carried out on NTU-RGB+D 60 and NTU-RGB+D 120, which are two substantial datasets. On the NTU RGB+D 60 dataset, the accuracy of human action recognition was 92% (X-Sub) and 96.3% (X-View). The model achieved an accuracy of 96.6% on the NW-UCLA dataset. The experimental results confirm that the ACE-GCN exhibits superior recognition accuracy and lower computing complexity compared to current methodologies. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

20 pages, 22277 KiB  
Article
Attention-Based Spatiotemporal-Aware Network for Fine-Grained Visual Recognition
by Yili Ren, Ruidong Lu, Guan Yuan, Dashuai Hao and Hongjue Li
Appl. Sci. 2024, 14(17), 7755; https://doi.org/10.3390/app14177755 - 2 Sep 2024
Viewed by 951
Abstract
On public benchmarks, current macro facial expression recognition technologies have achieved significant success. However, in real-life scenarios, individuals may attempt to conceal their true emotions. Conventional expression recognition often overlooks subtle facial changes, necessitating more fine-grained micro-expression recognition techniques. Different with prevalent facial [...] Read more.
On public benchmarks, current macro facial expression recognition technologies have achieved significant success. However, in real-life scenarios, individuals may attempt to conceal their true emotions. Conventional expression recognition often overlooks subtle facial changes, necessitating more fine-grained micro-expression recognition techniques. Different with prevalent facial expressions, weak intensity and short duration are the two main obstacles for perceiving and interpreting a micro-expression correctly. Meanwhile, correlations between pixels of visual data in spatial and channel dimensions are ignored in most existing methods. In this paper, we propose a novel network structure, the Attention-based Spatiotemporal-aware network (ASTNet), for micro-expression recognition. In ASTNet, we combine ResNet and ConvLSTM as a holistic framework (ResNet-ConvLSTM) to extract the spatial and temporal features simultaneously. Moreover, we innovatively integrate two level attention mechanisms, channel-level attention and spatial-level attention, into the ResNet-ConvLSTM. Channel-level attention is used to discriminate the importance of different channels because the contributions for the overall presentation of micro-expression vary between channels. Spatial-level attention is leveraged to dynamically estimate weights for different regions due to the diversity of regions’ reflections to micro-expression. Extensive experiments conducted on two benchmark datasets demonstrate that ASTNet achieves performance improvements of 4.25–16.02% and 0.79–12.93% over several state-of-the-art methods. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

13 pages, 5007 KiB  
Article
Infrared Image Enhancement Method of Substation Equipment Based on Self-Attention Cycle Generative Adversarial Network (SA-CycleGAN)
by Yuanbin Wang and Bingchao Wu
Electronics 2024, 13(17), 3376; https://doi.org/10.3390/electronics13173376 - 26 Aug 2024
Viewed by 910
Abstract
During the acquisition of infrared images in substations, low-quality images with poor contrast, blurred details, and missing texture information frequently appear, which adversely affects subsequent advanced visual tasks. To address this issue, this paper proposes an infrared image enhancement algorithm for substation equipment [...] Read more.
During the acquisition of infrared images in substations, low-quality images with poor contrast, blurred details, and missing texture information frequently appear, which adversely affects subsequent advanced visual tasks. To address this issue, this paper proposes an infrared image enhancement algorithm for substation equipment based on a self-attention cycle generative adversarial network (SA-CycleGAN). The proposed algorithm incorporates a self-attention mechanism into the CycleGAN model’s transcoding network to improve the mapping ability of infrared image information, enhance image contrast, and reducing the number of model parameters. The addition of an efficient local attention mechanism (EAL) and a feature pyramid structure within the encoding network enhances the generator’s ability to extract features and texture information from small targets in infrared substation equipment images, effectively improving image details. In the discriminator part, the model’s performance is further enhanced by constructing a two-channel feature network. To accelerate the model’s convergence, the loss function of the original CycleGAN is optimized. Compared to several mainstream image enhancement algorithms, the proposed algorithm improves the quality of low-quality infrared images by an average of 10.91% in color degree, 18.89% in saturation, and 29.82% in feature similarity indices. Additionally, the number of parameters in the proposed algorithm is reduced by 37.89% compared to the original model. Finally, the effectiveness of the proposed method in improving recognition accuracy is validated by the Centernet target recognition algorithm. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

26 pages, 9607 KiB  
Article
A Global Spatial-Spectral Feature Fused Autoencoder for Nonlinear Hyperspectral Unmixing
by Mingle Zhang, Mingyu Yang, Hongyu Xie, Pinliang Yue, Wei Zhang, Qingbin Jiao, Liang Xu and Xin Tan
Remote Sens. 2024, 16(17), 3149; https://doi.org/10.3390/rs16173149 - 26 Aug 2024
Viewed by 882
Abstract
Hyperspectral unmixing (HU) aims to decompose mixed pixels into a set of endmembers and corresponding abundances. Deep learning-based HU methods are currently a hot research topic, but most existing unmixing methods still rely on per-pixel training or employ convolutional neural networks (CNNs), which [...] Read more.
Hyperspectral unmixing (HU) aims to decompose mixed pixels into a set of endmembers and corresponding abundances. Deep learning-based HU methods are currently a hot research topic, but most existing unmixing methods still rely on per-pixel training or employ convolutional neural networks (CNNs), which overlook the non-local correlations of materials and spectral characteristics. Furthermore, current research mainly focuses on linear mixing models, which limits the feature extraction capability of deep encoders and further improvement in unmixing accuracy. In this paper, we propose a nonlinear unmixing network capable of extracting global spatial-spectral features. The network is designed based on an autoencoder architecture, where a dual-stream CNNs is employed in the encoder to separately extract spectral and local spatial information. The extracted features are then fused together to form a more complete representation of the input data. Subsequently, a linear projection-based multi-head self-attention mechanism is applied to capture global contextual information, allowing for comprehensive spatial information extraction while maintaining lightweight computation. To achieve better reconstruction performance, a model-free nonlinear mixing approach is adopted to enhance the model’s universality, with the mixing model learned entirely from the data. Additionally, an initialization method based on endmember bundles is utilized to reduce interference from outliers and noise. Comparative results on real datasets against several state-of-the-art unmixing methods demonstrate the superior of the proposed approach. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

21 pages, 8951 KiB  
Article
Radiation Anomaly Detection of Sub-Band Optical Remote Sensing Images Based on Multiscale Deep Dynamic Fusion and Adaptive Optimization
by Jinlong Ci, Hai Tan, Haoran Zhai and Xinming Tang
Remote Sens. 2024, 16(16), 2953; https://doi.org/10.3390/rs16162953 - 12 Aug 2024
Viewed by 1012
Abstract
Radiation anomalies in optical remote sensing images frequently occur due to electronic issues within the image sensor or data transmission errors. These radiation anomalies can be categorized into several types, including CCD, StripeNoise, RandomCode1, RandomCode2, ImageMissing, and Tap. To ensure the retention of [...] Read more.
Radiation anomalies in optical remote sensing images frequently occur due to electronic issues within the image sensor or data transmission errors. These radiation anomalies can be categorized into several types, including CCD, StripeNoise, RandomCode1, RandomCode2, ImageMissing, and Tap. To ensure the retention of image data with minimal radiation issues as much as possible, this paper adopts a self-made radiation dataset and proposes a FlexVisionNet-YOLO network to detect radiation anomalies more accurately. Firstly, RepViT is used as the backbone network with a vision transformer architecture to better capture global and local features. Its multiscale feature fusion mechanism efficiently handles targets of different sizes and shapes, enhancing the detection ability for radiation anomalies. Secondly, a feature depth fusion network is proposed in the Feature Fusion part, which significantly improves the flexibility and accuracy of feature fusion and thus enhances the detection and classification performance of complex remote sensing images. Finally, Inner-CIoU is used in the Head part for edge regression, which significantly improves the localization accuracy by finely adjusting the target edges; Slide-Loss is used for classification loss, which enhances the classification robustness by dynamically adjusting the category probabilities and markedly improves the classification accuracy, especially in the sample imbalance dataset. Experimental results show that, compared to YOLOv8, the proposed FlexVisionNet-YOLO method improves precision, recall, mAP0.5, and mAP0.5:0.9 by 3.5%, 7.1%, 4.4%, and 13.6%, respectively. Its effectiveness in detecting radiation anomalies surpasses that of other models. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

25 pages, 11593 KiB  
Article
An Effective and Lightweight Full-Scale Target Detection Network for UAV Images Based on Deformable Convolutions and Multi-Scale Contextual Feature Optimization
by Wanwan Yu, Junping Zhang, Dongyang Liu, Yunqiao Xi and Yinhu Wu
Remote Sens. 2024, 16(16), 2944; https://doi.org/10.3390/rs16162944 - 11 Aug 2024
Viewed by 2177
Abstract
Currently, target detection on unmanned aerial vehicle (UAV) images is a research hotspot. Due to the significant scale variability of targets and the interference of complex backgrounds, current target detection models face challenges when applied to UAV images. To address these issues, we [...] Read more.
Currently, target detection on unmanned aerial vehicle (UAV) images is a research hotspot. Due to the significant scale variability of targets and the interference of complex backgrounds, current target detection models face challenges when applied to UAV images. To address these issues, we designed an effective and lightweight full-scale target detection network, FSTD-Net. The design of FSTD-Net is based on three principal aspects. Firstly, to optimize the extracted target features at different scales while minimizing background noise and sparse feature representations, a multi-scale contextual information extraction module (MSCIEM) is developed. The multi-scale information extraction module (MSIEM) in MSCIEM can better capture multi-scale features, and the contextual information extraction module (CIEM) in MSCIEM is designed to capture long-range contextual information. Secondly, to better adapt to various target shapes at different scales in UAV images, we propose the feature extraction module fitting different shapes (FEMFDS), based on deformable convolutions. Finally, considering low-level features contain rich details, a low-level feature enhancement branch (LLFEB) is designed. The experiments demonstrate that, compared to the second-best model, the proposed FSTD-Net achieves improvements of 3.8%, 2.4%, and 2.0% in AP50, AP, and AP75 on the VisDrone2019, respectively. Additionally, FSTD-Net achieves enhancements of 3.4%, 1.7%, and 1% on the UAVDT dataset. Our proposed FSTD-Net has better detection performance compared to state-of-the-art detection models. The experimental results indicate the effectiveness of the FSTD-Net for target detection in UAV images. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

19 pages, 5414 KiB  
Article
Implicit Sharpness-Aware Minimization for Domain Generalization
by Mingrong Dong, Yixuan Yang, Kai Zeng, Qingwang Wang and Tao Shen
Remote Sens. 2024, 16(16), 2877; https://doi.org/10.3390/rs16162877 - 6 Aug 2024
Viewed by 1286
Abstract
Domain generalization (DG) aims to learn knowledge from multiple related domains to achieve a robust generalization performance in unseen target domains, which is an effective approach to mitigate domain shift in remote sensing image classification. Although the sharpness-aware minimization (SAM) method enhances DG [...] Read more.
Domain generalization (DG) aims to learn knowledge from multiple related domains to achieve a robust generalization performance in unseen target domains, which is an effective approach to mitigate domain shift in remote sensing image classification. Although the sharpness-aware minimization (SAM) method enhances DG capability and improves remote sensing image classification performance by promoting the convergence of the loss minimum to a flatter loss surface, the perturbation loss (maximum loss within the neighborhood of a local minimum) of SAM fails to accurately measure the true sharpness of the loss landscape. Furthermore, its variants often overlook gradient conflicts, thereby limiting further improvement in DG performance. In this paper, we introduce implicit sharpness-aware minimization (ISAM), a novel method that addresses the deficiencies of SAM and mitigates gradient conflicts. Specifically, we demonstrate that the discrepancy in training loss during gradient ascent or descent serves as an equivalent measure of the dominant eigenvalue of the Hessian matrix. This discrepancy provides a reliable measure for sharpness. ISAM effectively reduces sharpness and mitigates potential conflicts between gradients by implicitly minimizing the discrepancy between training losses while ensuring a sufficiently low minimum through minimizing perturbation loss. Extensive experiments and analyses demonstrate that ISAM significantly enhances the model’s generalization ability on remote sensing and DG datasets, outperforming existing state-of-the-art methods. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

17 pages, 7392 KiB  
Article
Lightweight Water Surface Object Detection Network for Unmanned Surface Vehicles
by Chenlong Li, Lan Wang, Yitong Liu and Shuaike Zhang
Electronics 2024, 13(15), 3089; https://doi.org/10.3390/electronics13153089 - 4 Aug 2024
Viewed by 1357
Abstract
The detection algorithms for water surface objects considerably assist unmanned surface vehicles in rapidly perceiving their surrounding environment, providing essential environmental information and evaluating object attributes. This study proposes a lightweight water surface target detection algorithm called YOLO-WSD (water surface detection), based on [...] Read more.
The detection algorithms for water surface objects considerably assist unmanned surface vehicles in rapidly perceiving their surrounding environment, providing essential environmental information and evaluating object attributes. This study proposes a lightweight water surface target detection algorithm called YOLO-WSD (water surface detection), based on YOLOv8n, to address the need for real-time, high-precision, and lightweight target detection algorithms that can adapt to the rapid changes in the surrounding environment during specific tasks. Initially, we designed the C2F-E module, enriched in gradient flow compared to the conventional C2F module, enabling the backbone network to extract richer multi-level features while maintaining lightness. Additionally, this study redesigns the feature fusion network structure by introducing low-level features and achieving multi-level fusion to enhance the network’s capability of integrating multiple levels. Meanwhile, it investigates the impact of channel number differences in the Concat module fusion on model performance, thereby optimizing the neural network structure. Lastly, it introduces the WIOU localization loss function to bolster model robustness. Experiments demonstrated that YOLO-WSD achieves a 4.6% and 3.4% improvement in mAP0.5 on the water surface object detection dataset and Seaship public dataset, respectively, with recall rates improving by 5.4% and 8.5% relative to the baseline YOLOv8n model. The model’s parameter size is 3.3 M. YOLO-WSD exhibits superior performance compared to other mainstream lightweight algorithms. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

23 pages, 3243 KiB  
Article
StarCAN-PFD: An Efficient and Simplified Multi-Scale Feature Detection Network for Small Objects in Complex Scenarios
by Zongxuan Chai, Tingting Zheng and Feixiang Lu
Electronics 2024, 13(15), 3076; https://doi.org/10.3390/electronics13153076 - 3 Aug 2024
Cited by 1 | Viewed by 1581
Abstract
Small object detection in traffic sign applications often faces challenges like complex backgrounds, blurry samples, and multi-scale variations. Existing solutions tend to complicate the algorithms. In this study, we designed an efficient and simple algorithm network called StarCAN-PFD, based on the single-stage YOLOv8 [...] Read more.
Small object detection in traffic sign applications often faces challenges like complex backgrounds, blurry samples, and multi-scale variations. Existing solutions tend to complicate the algorithms. In this study, we designed an efficient and simple algorithm network called StarCAN-PFD, based on the single-stage YOLOv8 framework, to accurately recognize small objects in complex scenarios. We proposed the StarCAN feature extraction network, which was enhanced with the Context Anchor Attention (CAA). We designed the Pyramid Focus and Diffusion Network (PFDNet) to address multi-scale information loss and developed the Detail-Enhanced Conv Shared Detect (DESDetect) module to improve the recognition of complex samples while keeping the network lightweight. Experiments on the CCTSDB dataset validated the effectiveness of each module. Compared to YOLOv8, our algorithm improved [email protected] by 4%, reduced the model size to less than half, and demonstrated better performance on different traffic sign datasets. It excels at detecting small traffic sign targets in complex scenes, including challenging samples such as blurry, low-light night, occluded, and overexposed conditions, showcasing strong generalization ability. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

19 pages, 8202 KiB  
Article
An Accurate and Robust Multimodal Template Matching Method Based on Center-Point Localization in Remote Sensing Imagery
by Jiansong Yang, Yongbin Zheng, Wanying Xu, Peng Sun and Shengjian Bai
Remote Sens. 2024, 16(15), 2831; https://doi.org/10.3390/rs16152831 - 1 Aug 2024
Viewed by 1264
Abstract
Deep learning-based template matching in remote sensing has received increasing research attention. Existing anchor box-based and anchor-free methods often suffer from low template localization accuracy in the presence of multimodal, nonrigid deformation and occlusion. To address this problem, we transform the template matching [...] Read more.
Deep learning-based template matching in remote sensing has received increasing research attention. Existing anchor box-based and anchor-free methods often suffer from low template localization accuracy in the presence of multimodal, nonrigid deformation and occlusion. To address this problem, we transform the template matching task into a center-point localization task for the first time and propose an end-to-end template matching method based on a novel fully convolutional Siamese network. Furthermore, we propose an adaptive shrinkage cross-correlation scheme, which improves the precision of template localization and alleviates the impact of background clutter without adding any parameters. We also design a scheme that leverages keypoint information to assist in locating the template center, thereby enhancing the precision of template localization. We construct a multimodal template matching dataset to verify the performance of the method in dealing with differences in view, scale, rotation and occlusion in practical application scenarios. Extensive experiments on a public dataset, OTB, the proposed dataset, as well as a remote sensing dataset, SEN1-2, demonstrate that our method achieves state-of-the-art performance. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

22 pages, 5295 KiB  
Article
Research on Clothing Image Retrieval Combining Topology Features with Color Texture Features
by Xu Zhang, Huadong Sun and Jian Ma
Mathematics 2024, 12(15), 2363; https://doi.org/10.3390/math12152363 - 29 Jul 2024
Viewed by 938
Abstract
Topological data analysis (TDA) is a method of feature extraction based on data topological structure. Image feature extraction using TDA has been shown to be superior to other feature extraction techniques in some problems, so it has recently received the attention of researchers. [...] Read more.
Topological data analysis (TDA) is a method of feature extraction based on data topological structure. Image feature extraction using TDA has been shown to be superior to other feature extraction techniques in some problems, so it has recently received the attention of researchers. In this paper, clothing image retrieval based on topology features and color texture features is studied. The main work is as follows: (1) Based on the analysis of image data by persistent homology, the feature construction method of a topology feature histogram is proposed, which can represent the ruler of image local topological data, and make up for the shortcomings of traditional feature extraction methods. (2) The improvement of Wasserstein distance is presented, while the similarity measure method named topology feature histogram distance is proposed. (3) Because the single feature has some problems such as the incomplete description of image information and poor robustness, the clothing image retrieval is realized by combining the topology feature with the color texture feature. The experimental results show that the proposed algorithm, namely topology feature histogram + corresponding distance, can effectively reduce the computation time while ensuring the accuracy. Compared with the method using only color texture, the retrieval rate of top5 is improved by 14.9%. Compared with the method using cubic complex + Wasserstein distance, the retrieval rate of top5 is improved by 3.8%, while saving 3.93 s computation time. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

30 pages, 15406 KiB  
Article
Addressing Demographic Bias in Age Estimation Models through Optimized Dataset Composition
by Nenad Panić, Marina Marjanović and Timea Bezdan
Mathematics 2024, 12(15), 2358; https://doi.org/10.3390/math12152358 - 28 Jul 2024
Cited by 1 | Viewed by 947
Abstract
Bias in facial recognition systems often results in unequal performance across demographic groups. This study addresses this by investigating how dataset composition affects the performance and bias of age estimation models across ethnicities. We fine-tuned pre-trained Convolutional Neural Networks (CNNs) like VGG19 on [...] Read more.
Bias in facial recognition systems often results in unequal performance across demographic groups. This study addresses this by investigating how dataset composition affects the performance and bias of age estimation models across ethnicities. We fine-tuned pre-trained Convolutional Neural Networks (CNNs) like VGG19 on the diverse UTKFace dataset (23,705 samples: 10,078 White, 4526 Black, 3434 Asian) and APPA-REAL (7691 samples: 6686 White, 231 Black, 674 Asian). Our approach involved adjusting dataset compositions by oversampling minority groups or reducing samples from overrepresented groups to mitigate bias. We conducted experiments to identify the optimal dataset composition that minimizes performance disparities among ethnic groups. The primary performance metric was Mean Absolute Error (MAE), measuring the average magnitude of prediction errors. We also analyzed the standard deviation of MAE across ethnic groups to assess performance consistency and equity. Our findings reveal that simple oversampling of minority groups does not ensure equitable performance. Instead, systematic adjustments, including reducing samples from overrepresented groups, led to more balanced performance and lower MAE standard deviations across ethnicities. These insights highlight the importance of tailored dataset adjustments and suggest exploring advanced data processing methods and algorithmic tweaks to enhance fairness and accuracy in facial recognition technologies. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

15 pages, 5580 KiB  
Article
DANet: A Domain Alignment Network for Low-Light Image Enhancement
by Qiao Li, Bin Jiang, Xiaochen Bo, Chao Yang and Xu Wu
Electronics 2024, 13(15), 2954; https://doi.org/10.3390/electronics13152954 - 26 Jul 2024
Viewed by 960
Abstract
We propose restoring low-light images suffering from severe degradation using a deep-learning approach. A significant domain gap exists between low-light and real images, which previous methods have failed to address with domain alignment. To tackle this, we introduce a domain alignment network leveraging [...] Read more.
We propose restoring low-light images suffering from severe degradation using a deep-learning approach. A significant domain gap exists between low-light and real images, which previous methods have failed to address with domain alignment. To tackle this, we introduce a domain alignment network leveraging dual encoders and a domain alignment loss. Specifically, we train two dual encoders to transform low-light and real images into two latent spaces and align these spaces using a domain alignment loss. Additionally, we design a Convolution-Transformer module (CTM) during the encoding process to comprehensively extract both local and global features. Experimental results on four benchmark datasets demonstrate that our proposed A Domain Alignment Network(DANet) method outperforms state-of-the-art methods. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

17 pages, 1884 KiB  
Review
Image-Based 3D Reconstruction in Laparoscopy: A Review Focusing on the Quantitative Evaluation by Applying the Reconstruction Error
by Birthe Göbel, Alexander Reiterer and Knut Möller
J. Imaging 2024, 10(8), 180; https://doi.org/10.3390/jimaging10080180 - 24 Jul 2024
Viewed by 1587
Abstract
Image-based 3D reconstruction enables laparoscopic applications as image-guided navigation and (autonomous) robot-assisted interventions, which require a high accuracy. The review’s purpose is to present the accuracy of different techniques to label the most promising. A systematic literature search with PubMed and google scholar [...] Read more.
Image-based 3D reconstruction enables laparoscopic applications as image-guided navigation and (autonomous) robot-assisted interventions, which require a high accuracy. The review’s purpose is to present the accuracy of different techniques to label the most promising. A systematic literature search with PubMed and google scholar from 2015 to 2023 was applied by following the framework of “Review articles: purpose, process, and structure”. Articles were considered when presenting a quantitative evaluation (root mean squared error and mean absolute error) of the reconstruction error (Euclidean distance between real and reconstructed surface). The search provides 995 articles, which were reduced to 48 articles after applying exclusion criteria. From these, a reconstruction error data set could be generated for the techniques of stereo vision, Shape-from-Motion, Simultaneous Localization and Mapping, deep-learning, and structured light. The reconstruction error varies from below one millimeter to higher than ten millimeters—with deep-learning and Simultaneous Localization and Mapping delivering the best results under intraoperative conditions. The high variance emerges from different experimental conditions. In conclusion, submillimeter accuracy is challenging, but promising image-based 3D reconstruction techniques could be identified. For future research, we recommend computing the reconstruction error for comparison purposes and use ex/in vivo organs as reference objects for realistic experiments. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

14 pages, 4700 KiB  
Article
Few-Shot Conditional Learning: Automatic and Reliable Device Classification for Medical Test Equipment
by Eva Pachetti, Giulio Del Corso, Serena Bardelli and Sara Colantonio
J. Imaging 2024, 10(7), 167; https://doi.org/10.3390/jimaging10070167 - 13 Jul 2024
Cited by 1 | Viewed by 917
Abstract
The limited availability of specialized image databases (particularly in hospitals, where tools vary between providers) makes it difficult to train deep learning models. This paper presents a few-shot learning methodology that uses a pre-trained ResNet integrated with an encoder as a backbone to [...] Read more.
The limited availability of specialized image databases (particularly in hospitals, where tools vary between providers) makes it difficult to train deep learning models. This paper presents a few-shot learning methodology that uses a pre-trained ResNet integrated with an encoder as a backbone to encode conditional shape information for the classification of neonatal resuscitation equipment from less than 100 natural images. The model is also strengthened by incorporating a reliability score, which enriches the prediction with an estimation of classification reliability. The model, whose performance is cross-validated, reached a median accuracy performance of over 99% (and a lower limit of 73.4% for the least accurate model/fold) using only 87 meta-training images. During the test phase on complex natural images, performance was slightly degraded due to a sub-optimal segmentation strategy (FastSAM) required to maintain the real-time inference phase (median accuracy 87.25%). This methodology proves to be excellent for applying complex classification models to contexts (such as neonatal resuscitation) that are not available in public databases. Improvements to the automatic segmentation strategy prior to the extraction of conditional information will allow a natural application in simulation and hospital settings. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Graphical abstract

19 pages, 13763 KiB  
Article
AMTT: An End-to-End Anchor-Based Multi-Scale Transformer Tracking Method
by Yitao Zheng, Honggui Deng, Qiguo Xu and Ni Li
Electronics 2024, 13(14), 2710; https://doi.org/10.3390/electronics13142710 - 11 Jul 2024
Viewed by 759
Abstract
Most current trackers utilize only the highest-level features to achieve faster tracking performance, making it difficult to achieve accurate tracking of small and low-resolution objects. To address this problem, we propose an end-to-end anchor-based multi-scale transformer tracking (AMTT) approach to improve the tracking [...] Read more.
Most current trackers utilize only the highest-level features to achieve faster tracking performance, making it difficult to achieve accurate tracking of small and low-resolution objects. To address this problem, we propose an end-to-end anchor-based multi-scale transformer tracking (AMTT) approach to improve the tracking performance of the network for objects of different sizes. First, we design a multi-scale feature encoder based on the deformable transformer, which better fuses the multilayer template features and search features through the self-enhancement module and cross-enhancement module to improve the attention of the whole network to objects of different sizes. Then, to reduce the computational overhead of the decoder while further enhancing the multi-scale features, we design a feature focusing block to compress the number of coded features. Finally, we introduce a feature anchor into the traditional decoder and design an anchor-based decoder, which utilizes the feature anchor to guide the decoder to adapt to changes in object scale and achieve more accurate tracking performance. To confirm the effectiveness of our proposed method, we conduct a series of experiments on different datasets such as UAV123, OTB100 and GOT10k. The results show that our adopted method exhibits highly competitive performance compared to the state-of-the-art methods in recent years. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

12 pages, 3830 KiB  
Article
Comparative Evaluation of Convolutional Neural Network Object Detection Algorithms for Vehicle Detection
by Saieshan Reddy, Nelendran Pillay and Navin Singh
J. Imaging 2024, 10(7), 162; https://doi.org/10.3390/jimaging10070162 - 5 Jul 2024
Cited by 2 | Viewed by 1653
Abstract
The domain of object detection was revolutionized with the introduction of Convolutional Neural Networks (CNNs) in the field of computer vision. This article aims to explore the architectural intricacies, methodological differences, and performance characteristics of three CNN-based object detection algorithms, namely Faster Region-Based [...] Read more.
The domain of object detection was revolutionized with the introduction of Convolutional Neural Networks (CNNs) in the field of computer vision. This article aims to explore the architectural intricacies, methodological differences, and performance characteristics of three CNN-based object detection algorithms, namely Faster Region-Based Convolutional Network (R-CNN), You Only Look Once v3 (YOLO), and Single Shot MultiBox Detector (SSD) in the specific domain application of vehicle detection. The findings of this study indicate that the SSD object detection algorithm outperforms the other approaches in terms of both performance and processing speeds. The Faster R-CNN approach detected objects in images with an average speed of 5.1 s, achieving a mean average precision of 0.76 and an average loss of 0.467. YOLO v3 detected objects with an average speed of 1.16 s, achieving a mean average precision of 0.81 with an average loss of 1.183. In contrast, SSD detected objects with an average speed of 0.5 s, exhibiting the highest mean average precision of 0.92 despite having a higher average loss of 2.625. Notably, all three object detectors achieved an accuracy exceeding 99%. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 1534 KiB  
Article
PointBLIP: Zero-Training Point Cloud Classification Network Based on BLIP-2 Model
by Yunzhe Xiao, Yong Dou and Shaowu Yang
Remote Sens. 2024, 16(13), 2453; https://doi.org/10.3390/rs16132453 - 3 Jul 2024
Viewed by 1159
Abstract
Leveraging the open-world understanding capacity of large-scale visual-language pre-trained models has become a hot spot in point cloud classification. Recent approaches rely on transferable visual-language pre-trained models, classifying point clouds by projecting them into 2D images and evaluating consistency with textual prompts. These [...] Read more.
Leveraging the open-world understanding capacity of large-scale visual-language pre-trained models has become a hot spot in point cloud classification. Recent approaches rely on transferable visual-language pre-trained models, classifying point clouds by projecting them into 2D images and evaluating consistency with textual prompts. These methods benefit from the robust open-world understanding capabilities of visual-language pre-trained models and require no additional training. However, they face several challenges summarized as prompt ambiguity, image domain gap, view weight confusion, and feature deviation. In response to these challenges, we propose PointBLIP, a zero-training point cloud classification network based on the recently introduced BLIP-2 visual-language model. PointBLIP is adept at processing similarities between multi-images and multi-prompts. We separately introduce a novel method for point cloud zero-shot and few-shot classification, which involves comparing multiple features to achieve effective classification. Simultaneously, we enhance the input data quality for both the image and text sides of PointBLIP. In point cloud zero-shot classification tasks, we outperform state-of-the-art methods on three benchmark datasets. For few-shot classification tasks, to the best of our knowledge, we present the first zero-training few-shot point cloud method, surpassing previous works under the same conditions and showcasing comparable performance to full-training methods. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Graphical abstract

24 pages, 8813 KiB  
Article
MSSD-Net: Multi-Scale SAR Ship Detection Network
by Xi Wang, Wei Xu, Pingping Huang and Weixian Tan
Remote Sens. 2024, 16(12), 2233; https://doi.org/10.3390/rs16122233 - 19 Jun 2024
Cited by 1 | Viewed by 1114
Abstract
In recent years, the development of neural networks has significantly advanced their application in Synthetic Aperture Radar (SAR) ship target detection for maritime traffic control and ship management. However, traditional neural network architectures are often complex and resource intensive, making them unsuitable for [...] Read more.
In recent years, the development of neural networks has significantly advanced their application in Synthetic Aperture Radar (SAR) ship target detection for maritime traffic control and ship management. However, traditional neural network architectures are often complex and resource intensive, making them unsuitable for deployment on artificial satellites. To address this issue, this paper proposes a lightweight neural network: the Multi-Scale SAR Ship Detection Network (MSSD-Net). Initially, the MobileOne network module is employed to construct the backbone network for feature extraction from SAR images. Subsequently, a Multi-Scale Coordinate Attention (MSCA) module is designed to enhance the network’s capability to process contextual information. This is followed by the integration of features across different scales using an FPN + PAN structure. Lastly, an Anchor-Free approach is utilized for the rapid detection of ship targets. To evaluate the performance of MSSD-Net, we conducted extensive experiments on the Synthetic Aperture Radar Ship Detection Dataset (SSDD) and SAR-Ship-Dataset. Our experimental results demonstrate that MSSD-Net achieves a mean average precision (mAP) of 98.02% on the SSDD while maintaining a compact model size of only 1.635 million parameters. This indicates that MSSD-Net effectively reduces model complexity without compromising its ability to achieve high accuracy in object detection tasks. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

19 pages, 7213 KiB  
Article
Embedded Zero-Shot Image Classification Based on Bidirectional Feature Mapping
by Huadong Sun, Zhibin Zhen, Yinghui Liu, Xu Zhang, Xiaowei Han and Pengyi Zhang
Appl. Sci. 2024, 14(12), 5230; https://doi.org/10.3390/app14125230 - 17 Jun 2024
Viewed by 854
Abstract
The zero-shot image classification technique aims to explore the semantic information shared between seen and unseen classes through visual features and auxiliary information and, based on this semantic information, to complete the knowledge migration from seen to unseen classes in order to complete [...] Read more.
The zero-shot image classification technique aims to explore the semantic information shared between seen and unseen classes through visual features and auxiliary information and, based on this semantic information, to complete the knowledge migration from seen to unseen classes in order to complete the classification of unseen class images. Previous zero-shot work has either not extracted enough features to express the relationship between the sample classes or has only used a single feature mapping method, which cannot fully explore the information contained in the features and the connection between the visual–semantic features. To address the above problems, this paper proposes an embedded zero-shot image classification model based on bidirectional feature mapping (BFM). It mainly contains a feature space mapping module, which is dominated by a bidirectional feature mapping network and supplemented with a mapping network from visual to category label semantic feature space. Attention mechanisms based on attribute guidance and visual guidance are further introduced to weight the features to reduce the difference between visual and semantic features to alleviate the modal difference problem, and then the category calibration loss is utilized to assign a larger weight to the unseen class to alleviate the seen class bias problem. The BFM model proposed in this paper has been experimented on three public datasets CUB, SUN, and AWA2, and has achieved 71.9%, 62.8%, and 69.3% and 61.6%, 33.2%, and 66.6% accuracies under traditional and generalized zero-sample image classification settings, respectively. The experimental results verify the superiority of the BFM model in the field of zero-shot image classification. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 5611 KiB  
Article
A Visible and Synthetic Aperture Radar Image Fusion Algorithm Based on a Transformer and a Convolutional Neural Network
by Liushun Hu, Shaojing Su, Zhen Zuo, Junyu Wei, Siyang Huang, Zongqing Zhao, Xiaozhong Tong and Shudong Yuan
Electronics 2024, 13(12), 2365; https://doi.org/10.3390/electronics13122365 - 17 Jun 2024
Viewed by 1037
Abstract
For visible and Synthetic Aperture Radar (SAR) image fusion, this paper proposes a visible and SAR image fusion algorithm based on a Transformer and a Convolutional Neural Network (CNN). Firstly, in this paper, the Restormer Block is used to extract cross-modal shallow features. [...] Read more.
For visible and Synthetic Aperture Radar (SAR) image fusion, this paper proposes a visible and SAR image fusion algorithm based on a Transformer and a Convolutional Neural Network (CNN). Firstly, in this paper, the Restormer Block is used to extract cross-modal shallow features. Then, we introduce an improved Transformer–CNN Feature Extractor (TCFE) with a two-branch residual structure. This includes a Transformer branch that introduces the Lite Transformer (LT) and DropKey for extracting global features and a CNN branch that introduces the Convolutional Block Attention Module (CBAM) for extracting local features. Finally, the fused image is output based on global features extracted by the Transformer branch and local features extracted by the CNN branch. The experiments show that the algorithm proposed in this paper can effectively achieve the extraction and fusion of global and local features of visible and SAR images, so that high-quality visible and SAR fusion images can be obtained. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 28354 KiB  
Article
A Hybrid Domain Color Image Watermarking Scheme Based on Hyperchaotic Mapping
by Yumin Dong, Rui Yan, Qiong Zhang and Xuesong Wu
Mathematics 2024, 12(12), 1859; https://doi.org/10.3390/math12121859 - 14 Jun 2024
Viewed by 803
Abstract
In the field of image watermarking technology, it is very important to balance imperceptibility, robustness and embedding capacity. In order to solve this key problem, this paper proposes a new color image adaptive watermarking scheme based on discrete wavelet transform (DWT), discrete cosine [...] Read more.
In the field of image watermarking technology, it is very important to balance imperceptibility, robustness and embedding capacity. In order to solve this key problem, this paper proposes a new color image adaptive watermarking scheme based on discrete wavelet transform (DWT), discrete cosine transform (DCT) and singular value decomposition (SVD). In order to improve the security of the watermark, we use Lorenz hyperchaotic mapping to encrypt the watermark image. We adaptively determine the embedding factor by calculating the Bhattacharyya distance between the cover image and the watermark image, and combine the Alpha blending technique to embed the watermark image into the Y component of the YCbCr color space to enhance the imperceptibility of the algorithm. The experimental results show that the average PSNR of our scheme is 45.9382 dB, and the SSIM is 0.9986. Through a large number of experimental results and comparative analysis, it shows that the scheme has good imperceptibility and robustness, indicating that we have achieved a good balance between imperceptibility, robustness and embedding capacity. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

26 pages, 59985 KiB  
Article
Depth-Guided Dehazing Network for Long-Range Aerial Scenes
by Yihu Wang, Jilin Zhao, Liangliang Yao and Changhong Fu
Remote Sens. 2024, 16(12), 2081; https://doi.org/10.3390/rs16122081 - 8 Jun 2024
Viewed by 764
Abstract
Over the past few years, the applications of unmanned aerial vehicles (UAVs) have greatly increased. However, the decrease in clarity in hazy environments is an important constraint on their further development. Current research on image dehazing mainly focuses on normal scenes at close [...] Read more.
Over the past few years, the applications of unmanned aerial vehicles (UAVs) have greatly increased. However, the decrease in clarity in hazy environments is an important constraint on their further development. Current research on image dehazing mainly focuses on normal scenes at close range or mid-range, while ignoring long-range scenes such as aerial perspective. Furthermore, based on the atmospheric scattering model, the inclusion of depth information is essential for the procedure of image dehazing, especially when dealing with images that exhibit substantial variations in depth. However, most existing models neglect this important information. Consequently, these state-of-the-art (SOTA) methods perform inadequately in dehazing when applied to long-range images. For the purpose of dealing with the above challenges, we propose the construction of a depth-guided dehazing network designed specifically for long-range aerial scenes. Initially, we introduce the depth prediction subnetwork to accurately extract depth information from long-range aerial images, taking into account the substantial variance in haze density. Subsequently, we propose the depth-guided attention module, which integrates a depth map with dehazing features through the attention mechanism, guiding the dehazing process and enabling the effective removal of haze in long-range areas. Furthermore, considering the unique characteristics of long-range aerial scenes, we introduce the UAV-HAZE dataset, specifically designed for training and evaluating dehazing methods in such scenarios. Finally, we conduct extensive experiments to test our method against several SOTA dehazing methods and demonstrate its superiority over others. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

20 pages, 28541 KiB  
Article
IFSrNet: Multi-Scale IFS Feature-Guided Registration Network Using Multispectral Image-to-Image Translation
by Bowei Chen, Li Chen, Umara Khalid and Shuai Zhang
Electronics 2024, 13(12), 2240; https://doi.org/10.3390/electronics13122240 - 7 Jun 2024
Viewed by 890
Abstract
Multispectral image registration is the process of aligning the spatial regions of two images with different distributions. One of the main challenges it faces is to resolve the severe inconsistencies between the reference and target images. This paper presents a novel multispectral image [...] Read more.
Multispectral image registration is the process of aligning the spatial regions of two images with different distributions. One of the main challenges it faces is to resolve the severe inconsistencies between the reference and target images. This paper presents a novel multispectral image registration network, Multi-scale Intuitionistic Fuzzy Set Feature-guided Registration Network (IFSrNet), to address multispectral image registration. IFSrNet generates pseudo-infrared images from visible images using Cycle Generative Adversarial Network (CycleGAN), which is equipped with a multi-head attention module. An end-to-end registration network encodes the input multispectral images with intuitionistic fuzzification, which employs an improved feature descriptor—Intuitionistic Fuzzy Set–Scale-Invariant Feature Transform (IFS-SIFT)—to guide its operation. The results of the image registration will be presented in a direct output. For this task we have also designed specialised loss functions. The results of the experiment demonstrate that IFSrNet outperforms existing registration methods in the Visible–IR dataset. IFSrNet has the potential to be employed as a novel image-to-image translation paradigm. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

13 pages, 19038 KiB  
Article
Multi-Scale Feature Fusion Point Cloud Object Detection Based on Original Point Cloud and Projection
by Zhikang Zhang, Zhongjie Zhu, Yongqiang Bai, Yiwen Jin and Ming Wang
Electronics 2024, 13(11), 2213; https://doi.org/10.3390/electronics13112213 - 6 Jun 2024
Viewed by 1255
Abstract
Existing point cloud object detection algorithms struggle to effectively capture spatial features across different scales, often resulting in inadequate responses to changes in object size and limited feature extraction capabilities, thereby affecting detection accuracy. To solve this problem, we present a point cloud [...] Read more.
Existing point cloud object detection algorithms struggle to effectively capture spatial features across different scales, often resulting in inadequate responses to changes in object size and limited feature extraction capabilities, thereby affecting detection accuracy. To solve this problem, we present a point cloud object detection method based on multi-scale feature fusion of the original point cloud and projection, which aims to improve the multi-scale performance and completeness of feature extraction in point cloud object detection. First, we designed a 3D feature extraction module based on the 3D Swin Transformer. This module pre-processes the point cloud using a 3D Patch Partition approach and employs a self-attention mechanism within a 3D sliding window, along with a downsampling strategy, to effectively extract features at different scales. At the same time, we convert the 3D point cloud to a 2D image using projection technology and extract 2D features using the Swin Transformer. A 2D/3D feature fusion module is then built to integrate 2D and 3D features at the channel level through point-by-point addition and vector concatenation to improve feature completeness. Finally, the integrated feature maps are fed into the detection head to facilitate efficient object detection. Experimental results show that our method has improved the average precision of vehicle detection by 1.01% on the KITTI dataset over three levels of difficulty compared to Voxel-RCNN. In addition, visualization analyses show that our proposed algorithm also exhibits superior performance in object detection. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

17 pages, 14025 KiB  
Article
Point Cloud Registration Algorithm Based on Adaptive Neighborhood Eigenvalue Loading Ratio
by Zhongping Liao, Tao Peng, Ruiqi Tang and Zhiguo Hao
Appl. Sci. 2024, 14(11), 4828; https://doi.org/10.3390/app14114828 - 3 Jun 2024
Viewed by 964
Abstract
Traditional iterative closest point (ICP) registration algorithms are sensitive to initial positions and easily fall into the trap of locally optimal solutions. To address this problem, a point cloud registration algorithm is put forward in this study based on adaptive neighborhood eigenvalue loading [...] Read more.
Traditional iterative closest point (ICP) registration algorithms are sensitive to initial positions and easily fall into the trap of locally optimal solutions. To address this problem, a point cloud registration algorithm is put forward in this study based on adaptive neighborhood eigenvalue loading ratios. In the algorithm, the resolution of the point cloud is first calculated and used as an adaptive basis to determine the raster widths and radii of spherical neighborhoods in the raster filtering; then, the adaptive raster filtering is implemented to the point cloud for denoising, while the eigenvalue loading ratios of point neighborhoods are calculated to extract and match the contour feature points; subsequently, sample consensus initial alignment (SAC-IA) is used to carry out coarse registration; and finally, a fine registration is delivered with KD-tree-accelerated ICP. The experimental results of this study demonstrate that the feature points extracted with this method are highly representative while consuming only 35.6% of the time consumed by other feature point extraction algorithms. Additionally, in noisy and low-overlap scenarios, the registration error of this method can be controlled at a level of 0.1 mm, with the registration speed improved by 56% on average over that of other algorithms. Taken together, the method in this study cannot only ensure strong robustness in registration but can also deliver high registration accuracy and efficiency. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

15 pages, 5200 KiB  
Article
Few-Shot Image Classification Based on Swin Transformer + CSAM + EMD
by Huadong Sun, Pengyi Zhang, Xu Zhang and Xiaowei Han
Electronics 2024, 13(11), 2121; https://doi.org/10.3390/electronics13112121 - 29 May 2024
Cited by 1 | Viewed by 801
Abstract
In few-shot image classification (FSIC), the feature extraction module of the traditional convolutional neural networks is often constrained by the local nature of the convolutional kernel. As a result, it becomes challenging to handle global information and long-distance dependencies effectively. In order to [...] Read more.
In few-shot image classification (FSIC), the feature extraction module of the traditional convolutional neural networks is often constrained by the local nature of the convolutional kernel. As a result, it becomes challenging to handle global information and long-distance dependencies effectively. In order to address this problem, an innovative FSIC method is proposed in this paper, which is the integration of Swin Transformer and CSAM and Earth Mover’s Distance (EMD) technology (STCE). We utilize the Swin Transformer network for image feature extraction, and perform CSAM attention mechanism feature weighting on the output feature map, while we adopt the EMD algorithm to generate the optimal matching flow between the structural units, minimizing the matching cost. This approach allows for a more precise representation of the classification distance between images. We have conducted numerous experiments to validate the effectiveness of our algorithm. On three commonly used few-shot datasets, namely mini-ImageNet, tiered-ImageNet, and FC100, the accuracy of one-shot and five-shot has reached the state of the art (SOTA) in the FSIC; the mini-ImageNet achieves an accuracy of 98.65 ± 0.1% for one-shot and 99.6 ± 0.2% for five-shot tasks, while tiered ImageNet has an accuracy of 91.6 ± 0.1% for one-shot tasks and 96.55 ± 0.27% for five-shot tasks. For FC100, the accuracy is 64.1 ± 0.3% for one-shot tasks and 79.8 ± 0.69% for five-shot tasks. On two commonly used few-shot datasets, namely CUB, CIFAR-FS, CUB achieves an accuracy of 83.1 ± 0.4% for one-shot and 92.88 ± 0.4% for five-shot tasks, while CIFAR-FS achieves an accuracy of 86.95 ± 0.2% for one-shot and 94 ± 0.4% for five-shot tasks. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

17 pages, 6111 KiB  
Article
Multi-Scale Target Detection in Autonomous Driving Scenarios Based on YOLOv5-AFAM
by Hang Ma, Wei Zhao, Bosi Liu and Wenbai Chen
Appl. Sci. 2024, 14(11), 4633; https://doi.org/10.3390/app14114633 - 28 May 2024
Viewed by 1052
Abstract
Multi-scale object detection is critically important in complex driving environments within the field of autonomous driving. To enhance the detection accuracy of both small-scale and large-scale targets in complex autonomous driving environments, this paper proposes an improved YOLOv5-AFAM algorithm. Firstly, the Adaptive Fusion [...] Read more.
Multi-scale object detection is critically important in complex driving environments within the field of autonomous driving. To enhance the detection accuracy of both small-scale and large-scale targets in complex autonomous driving environments, this paper proposes an improved YOLOv5-AFAM algorithm. Firstly, the Adaptive Fusion Attention Module (AFAM) and Down-sampling Module (DownC) are introduced to increase the detection precision of small targets. Secondly, the Efficient Multi-scale Attention Module (EMA) is incorporated, enabling the model to simultaneously recognize small-scale and large-scale targets. Finally, a Minimum Point Distance IoU-based Loss Function (MPDIou-LOSS) is introduced to improve the accuracy and efficiency of object detection. Experimental validation on the KITTI dataset shows that, compared to the baseline model, the improved algorithm increased precision by 2.4%, recall by 2.6%, mAP50 by 1.5%, and mAP50-90 by an impressive 4.8%. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

12 pages, 5442 KiB  
Article
Image Enhancement of Steel Plate Defects Based on Generative Adversarial Networks
by Zhideng Jie, Hong Zhang, Kaixuan Li, Xiao Xie and Aopu Shi
Electronics 2024, 13(11), 2013; https://doi.org/10.3390/electronics13112013 - 22 May 2024
Cited by 1 | Viewed by 1070
Abstract
In this study, the problem of a limited number of data samples, which affects the detection accuracy, arises for the image classification task of steel plate surface defects under conditions of small sample sizes. A data enhancement method based on generative adversarial networks [...] Read more.
In this study, the problem of a limited number of data samples, which affects the detection accuracy, arises for the image classification task of steel plate surface defects under conditions of small sample sizes. A data enhancement method based on generative adversarial networks is proposed. The method introduces a two-way attention mechanism, which is specifically designed to improve the model’s ability to identify weak defects and optimize the model structure of the network discriminator, which augments the model’s capacity to perceive the overall details of the image and effectively improves the intricacy and authenticity of the generated images. By enhancing the two original datasets, the experimental results show that the proposed method improves the average accuracy by 8.5% across the four convolutional classification models. The results demonstrate the superior detection accuracy of the proposed method, improving the classification of steel plate surface defects. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 2977 KiB  
Article
Feature Maps Need More Attention: A Spatial-Channel Mutual Attention-Guided Transformer Network for Face Super-Resolution
by Zhe Zhang and Chun Qi
Appl. Sci. 2024, 14(10), 4066; https://doi.org/10.3390/app14104066 - 10 May 2024
Viewed by 1018
Abstract
Recently, transformer-based face super-resolution (FSR) approaches have achieved promising success in restoring degraded facial details due to their high capability for capturing both local and global dependencies. However, while existing methods focus on introducing sophisticated structures, they neglect the potential feature map information, [...] Read more.
Recently, transformer-based face super-resolution (FSR) approaches have achieved promising success in restoring degraded facial details due to their high capability for capturing both local and global dependencies. However, while existing methods focus on introducing sophisticated structures, they neglect the potential feature map information, limiting FSR performance. To circumvent this problem, we carefully design a pair of guiding blocks to dig for possible feature map information to enhance features before feeding them to transformer blocks. Relying on the guiding blocks, we propose a spatial-channel mutual attention-guided transformer network for FSR, for which the backbone architecture is a multi-scale connected encoder–decoder. Specifically, we devise a novel Spatial-Channel Mutual Attention-guided Transformer Module (SCATM), which is composed of a Spatial-Channel Mutual Attention Guiding Block (SCAGB) and a Channel-wise Multi-head Transformer Block (CMTB). SCATM on the top layer (SCATM-T) aims to promote both local facial details and global facial structures, while SCATM on the bottom layer (SCATM-B) seeks to optimize the encoded features. Considering that different scale features are complementary, we further develop a Multi-scale Feature Fusion Module (MFFM), which fuses features from different scales for better restoration performance. Quantitative and qualitative experimental results on various datasets indicate that the proposed method outperforms other state-of-the-art FSR methods. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 8838 KiB  
Article
Salient Object Detection via Fusion of Multi-Visual Perception
by Wenjun Zhou, Tianfei Wang, Xiaoqin Wu, Chenglin Zuo, Yifan Wang, Quan Zhang and Bo Peng
Appl. Sci. 2024, 14(8), 3433; https://doi.org/10.3390/app14083433 - 18 Apr 2024
Cited by 1 | Viewed by 1219
Abstract
Salient object detection aims to distinguish the most visually conspicuous regions, playing an important role in computer vision tasks. However, complex natural scenarios can challenge salient object detection, hindering accurate extraction of objects with rich morphological diversity. This paper proposes a novel method [...] Read more.
Salient object detection aims to distinguish the most visually conspicuous regions, playing an important role in computer vision tasks. However, complex natural scenarios can challenge salient object detection, hindering accurate extraction of objects with rich morphological diversity. This paper proposes a novel method for salient object detection leveraging multi-visual perception, mirroring the human visual system’s rapid identification, and focusing on impressive objects/regions within complex scenes. First, a feature map is derived from the original image. Then, salient object detection results are obtained for each perception feature and combined via a feature fusion strategy to produce a saliency map. Finally, superpixel segmentation is employed for precise salient object extraction, removing interference areas. This multi-feature approach for salient object detection harnesses complementary features to adapt to complex scenarios. Competitive experiments on the MSRA10K and ECSSD datasets place our method in the first tier, achieving 0.1302 MAE and 0.9382 F-measure for the MSRA10K dataset and 0.0783 MAE and and 0.9635 F-measure for the ECSSD dataset, demonstrating superior salient object detection performance in complex natural scenarios. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

13 pages, 3442 KiB  
Article
MDP-SLAM: A Visual SLAM towards a Dynamic Indoor Scene Based on Adaptive Mask Dilation and Dynamic Probability
by Xiaofeng Zhang and Zhengyang Shi
Electronics 2024, 13(8), 1497; https://doi.org/10.3390/electronics13081497 - 15 Apr 2024
Viewed by 1058
Abstract
Visual simultaneous localization and mapping (SLAM) algorithms in dynamic scenes will apply the moving feature points to the camera pose’s calculation, which will cause the continuous accumulation of errors. As a target-detection tool, mask R-CNN, which is often used in combination with the [...] Read more.
Visual simultaneous localization and mapping (SLAM) algorithms in dynamic scenes will apply the moving feature points to the camera pose’s calculation, which will cause the continuous accumulation of errors. As a target-detection tool, mask R-CNN, which is often used in combination with the former, due to the limited training datasets, easily results in the semantic mask being incomplete and deformed, which will increase the error. In order to solve the above problems, we propose in this paper a visual SLAM algorithm based on an adaptive mask dilation strategy and the dynamic probability of the feature points, named MDP-SLAM. Firstly, we use the mask R-CNN target-detection algorithm to obtain the initial mask of the dynamic target. On this basis, an adaptive mask-dilation algorithm is used to obtain a mask that can completely cover the dynamic target and part of the surrounding scene. Then, we use the K-means clustering algorithm to segment the depth image information in the mask coverage area into absolute dynamic regions and relative dynamic regions. Combined with the epipolar constraint and the semantic constraint, the dynamic probability of the feature points is calculated, and then, the highly dynamic possible feature points are removed to solve an accurate final pose of the camera. Finally, the method is tested on the TUM RGB-D dataset. The results show that the MDP-SLAM algorithm proposed in this paper can effectively improve the accuracy of attitude estimation and has high accuracy and robustness in dynamic indoor scenes. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

20 pages, 7943 KiB  
Article
Pushing the Boundaries of Solar Panel Inspection: Elevated Defect Detection with YOLOv7-GX Technology
by Yin Wang, Jingyong Zhao, Yihua Yan, Zhicheng Zhao and Xiao Hu
Electronics 2024, 13(8), 1467; https://doi.org/10.3390/electronics13081467 - 12 Apr 2024
Cited by 1 | Viewed by 1251
Abstract
During the maintenance and management of solar photovoltaic (PV) panels, how to efficiently solve the maintenance difficulties becomes a key challenge that restricts their performance and service life. Aiming at the multi-defect-recognition challenge in PV-panel image analysis, this study innovatively proposes a new [...] Read more.
During the maintenance and management of solar photovoltaic (PV) panels, how to efficiently solve the maintenance difficulties becomes a key challenge that restricts their performance and service life. Aiming at the multi-defect-recognition challenge in PV-panel image analysis, this study innovatively proposes a new algorithm for the defect detection of PV panels incorporating YOLOv7-GX technology. The algorithm first constructs an innovative GhostSlimFPN network architecture by introducing GSConv and depth-wise separable convolution technologies, optimizing the traditional neck network structure. Then, a customized 1 × 1 convolutional module incorporating the GAM (Global Attention Mechanism) attention mechanism is designed in this paper to improve the ELAN structure, aiming to enhance the network’s perception and representation capabilities while controlling the network complexity. In addition, the XIOU loss function is introduced in the study to replace the traditional CIOU loss function, which effectively improves the robustness and convergence efficiency of the model. In the training stage, the sample imbalance problem is effectively solved by implementing differentiated weight allocations for different images and categories, which promotes the balance of the training process. The experimental data show that the optimized model achieves 94.8% in the highest mAP value, which is 6.4% higher than the original YOLOv7 network, significantly better than other existing models, and provides solid theoretical and technical support for further research and application in the field of PV-panel defect detection. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

28 pages, 14693 KiB  
Article
Wildlife Real-Time Detection in Complex Forest Scenes Based on YOLOv5s Deep Learning Network
by Zhibin Ma, Yanqi Dong, Yi Xia, Delong Xu, Fu Xu and Feixiang Chen
Remote Sens. 2024, 16(8), 1350; https://doi.org/10.3390/rs16081350 - 11 Apr 2024
Cited by 4 | Viewed by 3084
Abstract
With the progressively deteriorating global ecological environment and the gradual escalation of human activities, the survival of wildlife has been severely impacted. Hence, a rapid, precise, and reliable method for detecting wildlife holds immense significance in safeguarding their existence and monitoring their status. [...] Read more.
With the progressively deteriorating global ecological environment and the gradual escalation of human activities, the survival of wildlife has been severely impacted. Hence, a rapid, precise, and reliable method for detecting wildlife holds immense significance in safeguarding their existence and monitoring their status. However, due to the rare and concealed nature of wildlife activities, the existing wildlife detection methods face limitations in efficiently extracting features during real-time monitoring in complex forest environments. These models exhibit drawbacks such as slow speed and low accuracy. Therefore, we propose a novel real-time monitoring model called WL-YOLO, which is designed for lightweight wildlife detection in complex forest environments. This model is built upon the deep learning model YOLOv5s. In WL-YOLO, we introduce a novel and lightweight feature extraction module. This module is comprised of a deeply separable convolutional neural network integrated with compression and excitation modules in the backbone network. This design is aimed at reducing the number of model parameters and computational requirements, while simultaneously enhancing the feature representation of the network. Additionally, we introduced a CBAM attention mechanism to enhance the extraction of local key features, resulting in improved performance of WL-YOLO in the natural environment where wildlife has high concealment and complexity. This model achieved a mean accuracy (mAP) value of 97.25%, an F1-score value of 95.65%, and an accuracy value of 95.14%. These results demonstrated that this model outperforms the current mainstream deep learning models. Additionally, compared to the YOLOv5m base model, WL-YOLO reduces the number of parameters by 44.73% and shortens the detection time by 58%. This study offers technical support for detecting and protecting wildlife in intricate environments by introducing a highly efficient and advanced wildlife detection model. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Graphical abstract

18 pages, 687 KiB  
Article
MHDNet: A Multi-Scale Hybrid Deep Learning Model for Person Re-Identification
by Jinghui Wang and Jun Wang
Electronics 2024, 13(8), 1435; https://doi.org/10.3390/electronics13081435 - 10 Apr 2024
Cited by 2 | Viewed by 1226
Abstract
The primary objective of person re-identification is to identify individuals from surveillance videos across various scenarios. Conventional pedestrian recognition models typically employ convolutional neural network (CNN) and vision transformer (ViT) networks to extract features, and while CNNs are adept at extracting local features [...] Read more.
The primary objective of person re-identification is to identify individuals from surveillance videos across various scenarios. Conventional pedestrian recognition models typically employ convolutional neural network (CNN) and vision transformer (ViT) networks to extract features, and while CNNs are adept at extracting local features through convolution operations, capturing global information can be challenging, especially when dealing with high-resolution images. In contrast, ViT rely on cascaded self-attention modules to capture long-range feature dependencies, sacrificing local feature details. In light of these limitations, this paper presents the MHDNet, a hybrid network structure for pedestrian recognition that combines convolutional operations and self-attention mechanisms to enhance representation learning. The MHDNet is built around the Feature Fusion Module (FFM), which harmonizes global and local features at different resolutions. With a parallel structure, the MHDNet model maximizes the preservation of local features and global representations. Experiments on two person re-identification datasets demonstrate the superiority of the MHDNet over other state-of-the-art methods. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

11 pages, 1269 KiB  
Article
Hybrid-Margin Softmax for the Detection of Trademark Image Similarity
by Chenyang Wang, Guangyuan Zheng and Hongtao Shan
Appl. Sci. 2024, 14(7), 2865; https://doi.org/10.3390/app14072865 - 28 Mar 2024
Viewed by 973
Abstract
The detection of image similarity is critical to trademark (TM) legal registration and court judgment on infringement cases. Meanwhile, there are great challenges regarding the annotation of similar pairs and model generalization on rapidly growing data when deep learning is introduced into the [...] Read more.
The detection of image similarity is critical to trademark (TM) legal registration and court judgment on infringement cases. Meanwhile, there are great challenges regarding the annotation of similar pairs and model generalization on rapidly growing data when deep learning is introduced into the task. The research idea of metric learning is naturally suited for the task where similarity of input is given instead of classification, but current methods are not targeted at the task and should be upgraded. To address these issues, loss-driven model training is introduced, and a hybrid-margin softmax (HMS) is proposed exactly based on the peculiarity of TM images. Two additive penalty margins are attached to the softmax to expand the decision boundary and develop greater tolerance for slight differences between similar TM images. With the HMS, a Siamese neural network (SNN) as the feature extractor is further penalized and the discrimination ability is improved. Experiments demonstrate that the detection model trained on HMS can make full use of small numbers of training data and has great discrimination ability on bigger quantities of test data. Meanwhile, the model can reach high performance with less depth of SNN. Extensive experiments indicate that the HMS-driven model trained completely on TM data generalized well on the face recognition (FR) task, which involves another type of image data. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

25 pages, 8266 KiB  
Article
Infrared Small Target Detection Based on Tensor Tree Decomposition and Self-Adaptive Local Prior
by Guiyu Zhang, Zhenyu Ding, Qunbo Lv, Baoyu Zhu, Wenjian Zhang, Jiaao Li and Zheng Tan
Remote Sens. 2024, 16(6), 1108; https://doi.org/10.3390/rs16061108 - 21 Mar 2024
Viewed by 1530
Abstract
Infrared small target detection plays a crucial role in both military and civilian systems. However, current detection methods face significant challenges in complex scenes, such as inaccurate background estimation, inability to distinguish targets from similar non-target points, and poor robustness across various scenes. [...] Read more.
Infrared small target detection plays a crucial role in both military and civilian systems. However, current detection methods face significant challenges in complex scenes, such as inaccurate background estimation, inability to distinguish targets from similar non-target points, and poor robustness across various scenes. To address these issues, this study presents a novel spatial–temporal tensor model for infrared small target detection. In our method, we introduce the tensor tree rank to capture global structure in a more balanced strategy, which helps achieve more accurate background estimation. Meanwhile, we design a novel self-adaptive local prior weight by evaluating the level of clutter and noise content in the image. It mitigates the imbalance between target enhancement and background suppression. Then, the spatial–temporal total variation (STTV) is used as a joint regularization term to help better remove noise and obtain better detection performance. Finally, the proposed model is efficiently solved by the alternating direction multiplier method (ADMM). Extensive experiments demonstrate that our method achieves superior detection performance when compared with other state-of-the-art methods in terms of target enhancement, background suppression, and robustness across various complex scenes. Furthermore, we conduct an ablation study to validate the effectiveness of each module in the proposed model. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 5507 KiB  
Article
Research on Coaxiality Measurement Method for Automobile Brake Piston Components Based on Machine Vision
by Qinghua Li, Weinan Ge, Hu Shi, Wanting Zhao and Shihong Zhang
Appl. Sci. 2024, 14(6), 2371; https://doi.org/10.3390/app14062371 - 11 Mar 2024
Cited by 1 | Viewed by 865
Abstract
Aiming at addressing the problem of the online detection of automobile brake piston components, a non-contact measurement method based on the combination of machine vision and image processing technology is proposed. Firstly, an industrial camera is used to capture an image, and a [...] Read more.
Aiming at addressing the problem of the online detection of automobile brake piston components, a non-contact measurement method based on the combination of machine vision and image processing technology is proposed. Firstly, an industrial camera is used to capture an image, and a series of image preprocessing algorithms is used to extract a clear contour of a test piece with a unit pixel width. Secondly, based on the structural characteristics of automobile brake piston components, the region of interest is extracted, and the test piece is segmented into spring region and cylinder region. Then, based on mathematical morphology techniques, the edges of the image are optimized. We extract geometric feature points by comparing the heights of adjacent pixel points on both sides of the pixel points, so as to calculate the variation of the spring axis relative to the reference axis (centerline of the cylinder). Then, we extract the maximum variation from all images, and calculate the coaxiality error value using this maximum variation. Finally, we validate the feasibility of the proposed method and the stability of extracting geometric feature points through experiments. The experiments demonstrate the feasibility of the method in engineering practice, with the stability in extracting geometric feature points reaching 99.25%. Additionally, this method offers a new approach and perspective for coaxiality measurement of stepped shaft parts. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

19 pages, 11826 KiB  
Article
A Convolution with Transformer Attention Module Integrating Local and Global Features for Object Detection in Remote Sensing Based on YOLOv8n
by Kaiqi Lang, Jie Cui, Mingyu Yang, Hanyu Wang, Zilong Wang and Honghai Shen
Remote Sens. 2024, 16(5), 906; https://doi.org/10.3390/rs16050906 - 4 Mar 2024
Cited by 5 | Viewed by 3063
Abstract
Object detection in remote sensing scenarios plays an indispensable and significant role in civilian, commercial, and military areas, leveraging the power of convolutional neural networks (CNNs). Remote sensing images, captured by crafts and satellites, exhibit unique characteristics including complicated backgrounds, limited features, distinct [...] Read more.
Object detection in remote sensing scenarios plays an indispensable and significant role in civilian, commercial, and military areas, leveraging the power of convolutional neural networks (CNNs). Remote sensing images, captured by crafts and satellites, exhibit unique characteristics including complicated backgrounds, limited features, distinct density, and varied scales. The contextual and comprehensive information in an image can make a detector precisely localize and classify targets, which is extremely valuable for object detection in remote sensing scenarios. However, CNNs, restricted by the essence of the convolution operation, possess local receptive fields and scarce contextual information, even in large models. To address this limitation and improve detection performance by extracting global contextual information, we propose a novel plug-and-play attention module, named Convolution with Transformer Attention Module (CTAM). CTAM is composed of a convolutional bottleneck block and a simplified Transformer layer, which can facilitate the integration of local features and position information with long-range dependency. YOLOv8n, a superior and faster variant of the YOLO series, is selected as the baseline. To demonstrate the effectiveness and efficiency of CTAM, we incorporated CTAM into YOLOv8n and conducted extensive experiments on the DIOR dataset. YOLOv8n-CTAM achieves an impressive 54.2 mAP@50-95, surpassing YOLOv8n (51.4) by a large margin. Notably, it outperforms the baseline by 2.7 mAP@70 and 4.4 mAP@90, showcasing its superiority with stricter IoU thresholds. Furthermore, the experiments conducted on the TGRS-HRRSD dataset validate the excellent generalization ability of CTAM. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

26 pages, 29677 KiB  
Article
Development of a Powder Analysis Procedure Based on Imaging Techniques for Examining Aggregation and Segregation Phenomena
by Giuseppe Bonifazi, Paolo Barontini, Riccardo Gasbarrone, Davide Gattabria and Silvia Serranti
J. Imaging 2024, 10(3), 53; https://doi.org/10.3390/jimaging10030053 - 21 Feb 2024
Viewed by 1547
Abstract
In this manuscript, a method that utilizes classical image techniques to assess particle aggregation and segregation, with the primary goal of validating particle size distribution determined by conventional methods, is presented. This approach can represent a supplementary tool in quality control systems for [...] Read more.
In this manuscript, a method that utilizes classical image techniques to assess particle aggregation and segregation, with the primary goal of validating particle size distribution determined by conventional methods, is presented. This approach can represent a supplementary tool in quality control systems for powder production processes in industries such as manufacturing and pharmaceuticals. The methodology involves the acquisition of high-resolution images, followed by their fractal and textural analysis. Fractal analysis plays a crucial role by quantitatively measuring the complexity and self-similarity of particle structures. This approach allows for the numerical evaluation of aggregation and segregation phenomena, providing valuable insights into the underlying mechanisms at play. Textural analysis contributes to the characterization of patterns and spatial correlations observed in particle images. The examination of textural features offers an additional understanding of particle arrangement and organization. Consequently, it aids in validating the accuracy of particle size distribution measurements. To this end, by incorporating fractal and structural analysis, a methodology that enhances the reliability and accuracy of particle size distribution validation is obtained. It enables the identification of irregularities, anomalies, and subtle variations in particle arrangements that might not be detected by traditional measurement techniques alone. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Graphical abstract

16 pages, 3114 KiB  
Article
Underwater Degraded Image Restoration by Joint Evaluation and Polarization Partition Fusion
by Changye Cai, Yuanyi Fan, Ronghua Li, Haotian Cao, Shenghui Zhang and Mianze Wang
Appl. Sci. 2024, 14(5), 1769; https://doi.org/10.3390/app14051769 - 21 Feb 2024
Cited by 1 | Viewed by 1166
Abstract
Images of underwater environments suffer from contrast degradation, reduced clarity, and information attenuation. The traditional method is the global estimate of polarization. However, targets in water often have complex polarization properties. For low polarization regions, since the polarization is similar to the polarization [...] Read more.
Images of underwater environments suffer from contrast degradation, reduced clarity, and information attenuation. The traditional method is the global estimate of polarization. However, targets in water often have complex polarization properties. For low polarization regions, since the polarization is similar to the polarization of background, it is difficult to distinguish between target and non-targeted regions when using traditional methods. Therefore, this paper proposes a joint evaluation and partition fusion method. First, we use histogram stretching methods for preprocessing two polarized orthogonal images, which increases the image contrast and enhances the image detail information. Then, the target is partitioned according to the values of each pixel point of the polarization image, and the low and high polarization target regions are extracted based on polarization values. To address the practical problem, the low polarization region is recovered using the polarization difference method, and the high polarization region is recovered using the joint estimation of multiple optimization metrics. Finally, the low polarization and the high polarization regions are fused. Subjectively, the experimental results as a whole have been fully restored, and the information has been retained completely. Our method can fully recover the low polarization region, effectively remove the scattering effect and increase an image’s contrast. Objectively, the results of the experimental evaluation indexes, EME, Entropy, and Contrast, show that our method performs significantly better than the other methods, which confirms the feasibility of this paper’s algorithm for application in specific underwater scenarios. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

18 pages, 5785 KiB  
Article
Research on Rejoining Bone Stick Fragment Images: A Method Based on Multi-Scale Feature Fusion Siamese Network Guided by Edge Contour
by Jingjing He, Huiqin Wang, Rui Liu, Li Mao, Ke Wang, Zhan Wang and Ting Wang
Appl. Sci. 2024, 14(2), 717; https://doi.org/10.3390/app14020717 - 15 Jan 2024
Viewed by 1208
Abstract
The rejoining of bone sticks holds significant importance in studying the historical and cultural aspects of the Han Dynasty. Currently, the rejoining work of bone inscriptions heavily relies on manual efforts by experts, demanding a considerable amount of time and energy. This paper [...] Read more.
The rejoining of bone sticks holds significant importance in studying the historical and cultural aspects of the Han Dynasty. Currently, the rejoining work of bone inscriptions heavily relies on manual efforts by experts, demanding a considerable amount of time and energy. This paper introduces a multi-scale feature fusion Siamese network guided by edge contour (MFS-GC) model. Constructing a Siamese network framework, it first uses a residual network to extract features of bone sticks, which is followed by computing the L2 distance for similarity measurement. During the extraction of feature vectors using the residual network, the BN layer tends to lose contour detail information, resulting in less conspicuous feature extraction, especially along fractured edges. To address this issue, the Spatially Adaptive DEnormalization (SPADE) model is employed to guide the normalization of contour images of bone sticks. This ensures that the network can learn multi-scale boundary contour features at each layer. Finally, the extracted multi-scale fused features undergo similarity measurement for local matching of bone stick fragment images. Additionally, a Conjugable Bone Stick Dataset (CBSD) is constructed. In the experimental validation phase, the MFS-GC algorithm is compared with classical similarity calculation methods in terms of precision, recall, and miss detection rate. The experiments demonstrate that the MFS-GC algorithm achieves an average accuracy of 95.5% in the Top-15 on the CBSD. The findings of this research can contribute to solving the rejoining issues of bone sticks. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

16 pages, 4061 KiB  
Article
EDF-YOLOv5: An Improved Algorithm for Power Transmission Line Defect Detection Based on YOLOv5
by Hongxing Peng, Minjun Liang, Chang Yuan and Yongqiang Ma
Electronics 2024, 13(1), 148; https://doi.org/10.3390/electronics13010148 - 29 Dec 2023
Cited by 5 | Viewed by 1407
Abstract
Detecting defects in power transmission lines through unmanned aerial inspection images is crucial for evaluating the operational status of outdoor transmission equipment. This paper presents a defect recognition method called EDF-YOLOv5, which is based on the YOLOv5s, to enhance detection accuracy. Firstly, the [...] Read more.
Detecting defects in power transmission lines through unmanned aerial inspection images is crucial for evaluating the operational status of outdoor transmission equipment. This paper presents a defect recognition method called EDF-YOLOv5, which is based on the YOLOv5s, to enhance detection accuracy. Firstly, the EN-SPPFCSPC module is designed to improve the algorithm’s ability to extract information, thereby enhancing the detection performance for small target defects. Secondly, the algorithm incorporates a high-level semantic feature information extraction network, DCNv3C3, which improves its ability to generalize to defects of different shapes. Lastly, a new bounding box loss function, Focal-CIoU, is introduced to enhance the contribution of high-quality samples during training. The experimental results demonstrate that the enhanced algorithm achieves a 2.3% increase in mean average precision ([email protected]) for power transmission line defect detection, a 0.9% improvement in F1-score, and operates at a detection speed of 117 frames per second. These findings highlight the superior performance of EDF-YOLOv5 in detecting power transmission line defects. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

14 pages, 7400 KiB  
Article
Non-Local Means Hole Repair Algorithm Based on Adaptive Block
by Bohu Zhao, Lebao Li and Haipeng Pan
Appl. Sci. 2024, 14(1), 159; https://doi.org/10.3390/app14010159 - 24 Dec 2023
Viewed by 963
Abstract
RGB-D cameras provide depth and color information and are widely used in 3D reconstruction and computer vision. In the majority of existing RGB-D cameras, a considerable portion of depth values is often lost due to severe occlusion or limited camera coverage, thereby adversely [...] Read more.
RGB-D cameras provide depth and color information and are widely used in 3D reconstruction and computer vision. In the majority of existing RGB-D cameras, a considerable portion of depth values is often lost due to severe occlusion or limited camera coverage, thereby adversely impacting the precise localization and three-dimensional reconstruction of objects. In this paper, to address the issue of poor-quality in-depth images captured by RGB-D cameras, a depth image hole repair algorithm based on non-local means is proposed first, leveraging the structural similarities between grayscale and depth images. Second, while considering the cumbersome parameter tuning associated with the non-local means hole repair method for determining the size of structural blocks for depth image hole repair, an intelligent block factor is introduced, which automatically determines the optimal search and repair block sizes for various hole sizes, resulting in the development of an adaptive block-based non-local means algorithm for repairing depth image holes. Furthermore, the proposed algorithm’s performance are evaluated using both the Middlebury stereo matching dataset and a self-constructed RGB-D dataset, with performance assessment being carried out by comparing the algorithm against other methods using five metrics: RMSE, SSIM, PSNR, DE, and ALME. Finally, experimental results unequivocally demonstrate the innovative resolution of the parameter tuning complexity inherent in-depth image hole repair, effectively filling the holes, suppressing noise within depth images, enhancing image quality, and achieving elevated precision and accuracy, as affirmed by the attained results. Full article
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)
Show Figures

Figure 1

Back to TopTop