U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation
Abstract
Multimodal semantic segmentation is a pivotal component of computer vision and typically surpasses unimodal methods by utilizing rich information set from various sources. Current models frequently adopt modality-specific frameworks that inherently biases toward certain modalities. Although these biases might be advantageous in specific situations, they generally limit the adaptability of the models across different multimodal contexts, thereby potentially impairing performance. To address this issue, we leverage the inherent capabilities of the model itself to discover the optimal equilibrium in multimodal fusion and introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation. Specifically, this method involves an unbiased integration of multimodal visual data. Additionally, we employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets, verifing its efficacy in enhancing the robustness and versatility of semantic segmentation in diverse settings. Our code is available at U3M-multimodal-semantic-segmentation.
Index Terms:
Semantic Segmentation, Multi-Modality, Multi-Scale Fusion, Unbiased Modality FusionI Introduction
Semantic segmentation [1, 2, 3] is a crucial task within the field of computer vision, with applications spanning various domains such as scene understanding [4, 5, 6, 7], autonomous driving [8, 9], etc. RGB-based semantic segmentation (depicted Fig. 1(a)), serving as a foundational task, is suitable for the analysis of most scenes and has been extensively explored by the research community, yielding many impressive works and experimental results [10, 11, 12]. However, employing only RGB channels to segment certain complex and special scenes presents challenges, particularly where RGB information is elusive [13, 5]. In contrast to RGB cameras, which depend on visible light and often falter in darkness, thermal infrared (TIR) sensors directly detect heat emissions from objects, offering substantial contrast in the absence of light [9, 7]. Additionally, other modalities such as depth [6, 14] and LiDAR [15, 16] can also provide additional visual semantic information and are increasingly being integrated into semantic segmentation efforts, as illustrated in Fig. 1(b).
In the context of multimodal semantic segmentation, multimodal feature extraction, serving as the foundation for downstream tasks, has been extensively explored in numerous studies [1, 10]. Although these models have demonstrated exceptional performance in certain scenarios and datasets, they are typically trained from scratch on specific small-scale datasets, the limited number of images and scenes in multimodal visual training datasets leads to poor generalization across different scenes and data, not to mention better feature extraction capabilities. In contrast to aforementioned models trained from scratch, recent pretrained models, mostly based on Convolutional Neural Network (CNN) [17] and Transformer[18], have shown strong performance on downstream tasks by building models with large-scale parameters and pretraining on vast amounts of visual images. The rise of large-scale visual pretraining models has led to a series of studies exploring those applications in semantic segmentation. For instance, a large-scale pretrained convolutional neural network[17], demonstrated excellent semantic segmentation capabilities and performance by pretraining on ImageNet. Building on this well-established pretrained model, [14] introduced ACNet, which refines feature extraction through asymmetric convolution blocks, enhancing network accuracy and robustness. Compared with CNN-based pretrained models, Transformers offer a larger receptive field [18] and stronger global modeling capabilities. Recent research [19, 20] using plain vision Transformers for information extraction struggles with multiscale semantic information. To address this, Transformers with moving and multiscale windows have been explored [21, 22]. Multiscale information extractors have proven more practical. To balance speed and accuracy, we use the hierarchical multiscale vision Transformer Segformer [3] as the modal information extractor.
Although extracted by large-scale pretrained models with strong visual information extraction capabilities, visual multimodal semantic information still presents some domain gaps. To bridge this, novel fine-tuning strategies using prompting techniques have been proposed [23], efficiently adapting pretrained models to multimodal modes with minimal updates. Another approach involves using pretrained models as modality extractors, followed by fine-tuning with limited multimodal data, as shown in Fig. 1(c). However, this often leads to catastrophic forgetting. Therefore, designing an additional feature fusion module is receiving increasing attention. For example, [8] designed an additional dual attention network to complement the pretrained feature extractor at the only low-resolution feature map (Fig. 1(d)), which does not affect the original pretrained model. However, using single-scale information presents problems such as weak local spatial information and low input resolution. To address these issues, FuseNet [6] integrates multiscale RGB-Depth multimodal features at every encoder stage, effectively improving performance on the SUN RGB-D benchmark. Similarly, CACFNet [7] incorporates a cross-modal attention fusion module that extracts multiscale RGB-thermal information. Besides, lots of extensive research [15, 24] have demonstrated multiscale modal fusion modules (as shown in Fig. 1(d)) enhance the adaptive integration of multimodal.
Despite the excellent results achieved by existing work, particularly [13, 24], most of the multimodal fusions discussed above, notably CMNEXT [15], exhibit some modalities bias. Specifically, these methods typically prioritize one modality as dominant and treat others as auxiliary (generally in the format of RGB+Xs) [25]. However, these approaches overlook the dynamic dominant correlation within multimodal data [25, 26], which hampers the ability to fully utilize complementary multimodal information in complex scenarios (as illustrated in Fig.2), thus limiting performance.
To avoid the inherent modal bias in model design and facilitate the fusion of multi-scale modal information, we first introduce an Unbiased Multiscale Modal Fusion Model (U3M) by treating all modalities equitably and performing multiscale fusion, which is different from the existing technology. This strategy enables the model to autonomously generate modal preferences applicable to various segmentation scenarios. Secondly, based on the fusion paradigm, we develop two multiscale fusion modules utilizing multiscale pooling and convolution, which effectively integrate and fuse global and local information across different scales for multimodal information. Finally, the model’s efficacy is validated by outstanding results across multiple datasets. Our contributions are summarized as follows:
-
•
We demonstrate the presence of modal bias in model design and develop an unbiased modal fusion methodology. This approach leverages the inherent properties of the model to autonomously generate modality preferences, substantially reducing biases introduced by manual design interventions.
-
•
We design a multiscale model fusion layer that incorporates multiscale convolution and pooling. In this way, the fusion layer enhances the capability for multiscale modal fusion throughout various feature encoding stages.
- •
II Related Works
II-A Semantic Segmentation
The field of semantic segmentation has seen significant advancements, particularly with the advent of fully convolutional networks that revolutionized pixel classification [10]. Based on this, some improvements such as multi-scale feature extraction and fusion have proven effective [1, 27]. For instance, Chen et al. further developed DeepLab by integrating an encoder-decoder architecture for efficient multi-scale context aggregation [2]. Similarly, Zhao et al. introduced a pyramid pooling module to aggregate context from varied regions at multiple scales [12].
Channel and self-attention mechanisms have evolved to capture more global semantic information. Lin et al. developed RefineNet, employing multi-path refinement networks with channel attention for high-resolution segmentation [28]. Choi et al. proposed CARS, emphasizing channel-wise attention for region-based segmentation [29]. Another paradigm, Context-based refinement algorithms, integrates extensive background contextual information [11, 30]. Some other context-based models leverage context to refine segmentation through adaptive feature recalibration [31], while [32] developed OCNet to enhance semantic understanding by aggregating object context.
Edge detection techniques also serve as complementary cues for semantic segmentation. For example, Li et al. improved edge detection using deep learning [33], and Borse et al. proposed InverseForm to enhance edge detection accuracy in complex images [34]. The recent adoption of vision transformers for recognition tasks has led to the development of dense prediction transformers specialized for semantic segmentation. Notable examples include CSWin Transformer, which captures long-range dependencies using cross-shaped windows [21], and HRFormer, which integrates high-resolution representations for detailed segmentation [22]. These advancements have facilitated the segmentation of discrete objects and amorphous regions [35, 36], with transformers now incorporating token mixing via attention mechanisms [37], multi-layer perceptron elements [38, 39], and pooling and convolutional blocks [40].
Despite setting new benchmarks in image segmentation, challenges persist, particularly under real-life conditions where RGB images are inadequate, such as in low-light environments or when capturing fast-moving subjects. Consequently, multimodal semantic segmentation is garnering increasing attention.
II-B Multimodality Semantic Segmentation
Multimodal semantic segmentation is increasingly recognized for its ability to integrate diverse modal data, effectively compensating for the inherent limitations of each modality [15, 13, 16]. Zhou et al. [41] integrated edge-aware features for enhanced semantic segmentation of RGB and thermal images, improving object boundary delineation in multimodal scenarios. Deng et al. [42] incorporated an attention mechanism that boosts feature representation for real-time RGB-thermal semantic segmentation. Zhao et al. [43] utilized deep image decomposition to fuse infrared and visible images, preserving critical features from both modalities. Huang et al. [44] employed a recurrent network to iteratively refine multi-modality image fusion, enhancing detail preservation and reducing artifacts. Additionally, in RGB-depth modality fusion, ACNet [14] employs an attention mechanism to optimize the usage of RGB and depth data for improved semantic segmentation accuracy, especially in scenarios with complex visibility. FuseNet [6] integrates RGB and depth data using a dual-stream CNN, leveraging depth as an auxiliary input to enhance segmentation accuracy. The evolution of model frameworks has progressed from those based on CNN [1, 2] to those founded on Transformers [3, 40, 21]. This transition facilitates a more nuanced analysis of the interplay between global semantics and local features, enhancing feature extraction. For modal fusion, some techniques employ attention mechanisms to integrate different modalities [14]. For example, CACFNet [7] utilizes cross-attention mechanisms to selectively enhance the integration of contextual information from different modalities to improve semantic correlation and feature extraction efficiency. Other approaches employ convolution as a feature fusion extraction module [45]; Reza et al. [26] introduced specialized convolution layers designed to process and merge information from multiple modalities effectively. EGFNet [5] uses gated convolution to selectively fuse the most relevant features from diverse modalities, improving fusion effectiveness. Apart from that, Zhang et al. [15] innovated by adapting pooling strategies to multimodal contexts, optimizing feature reduction and abstraction processes to better accommodate the diverse characteristics of different data types. Despite these advancements, many existing modal fusion models rely excessively on RGB images, often involving specialized feature extractors for the RGB channels. This strategy risks neglecting the variable significance of different modalities across various scenarios. To address this, we propose the Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation.
III Method
In this section, we first present the problem definition, followed by a detailed presentation of the proposed U3M.
III-A Problem Definition
Multimodal semantic segmentation seeks to assign a semantic label to each pixel in an image from predefined categories, such as ’car’, ’tree’, or ’road’. Distinct from traditional semantic segmentation which depends on a single modality, typically RGB images, multimodal semantic segmentation leverages data from various sources or sensors to improve the accuracy and robustness of segmentation.
Given a set of images, where represents the -th modality out of total modalities. Here, , , and denote the height, width, and channel size of the image, respectively. The objective of multimodal semantic segmentation is to compute a segmentation map . Here, denotes the set of semantic labels.
Each encoder () processes the corresponding modality , yielding feature maps . These are fused by to produce a combined representation (= in this paper). Finally, is obtained through a decoder .
The effectiveness of a multimodal semantic segmentation model is typically evaluated based on its ability to accurately segment objects and scenes under varying conditions, making use of the additional information provided by the various modalities to overcome the limitations of single-modality segmentation.
III-B Overall model Architecture
The schema of our model’s architecture is presented in Fig. 3 accommodating discrete modalities. The architecture’s sophistication lies in its utilization of modality-specific encoders, which are tailored to distill unique feature hierarchies from each modality, encapsulated by the equation:
(1) |
where represents the modality-specific input imagery for , and is the corresponding encoder. These encoders are adept at generating a spectrum of feature maps at diminished resolutions—precisely, of the initial resolution—collectively represented as . For the sake of brevity, the feature map’s dimensions at the -th encoding stage are denoted as , with traversing {1, 2, 3, 4}. A quartet of fusion blocks, each allied to a corresponding encoder stage, is orchestrated to merge the features from each encoding cascade. The prominent features from every modality are assimilated within the -th fusion module:
(2) |
This integrative process yields a synthesized feature composite , with signifying the fused feature at the th stage. Culminating the process, these aggregated features are input into an MLP decoder, as elucidated in [3], to extrapolate the segmentation contours.
III-C Modality Feature Encoder
Utilizing a mix transformer encoder as referenced in [3], our system effectively extracts hierarchical features from various input data types. Every image undergoes a patch embedding process, getting segmented into patches as per the method described in [3], before being processed by the mixed transformer encoder units. Illustrated in Fig. 3 is the mix transformer unit’s structure, with denoted the -th modal input for the -th mix transformer, which has the shape of . This input is then reshape into a matrix, where equates to , for utilization as the query , the key , and the value . In an effort to curtail computational demands, we employ spatial reduction as suggested by [3], leveraging a reduction ratio . The matrices and undergo an initial transformation into matrices, followed by a remapping into matrices via a linear transformation process. Subsequently, a conventional Multi-Head Self-attention Mechanism (MHSA) is employed to map into intermediate representations as delineated by:
(3) |
(4) |
In this context, signifies the total number of attention heads, with , , , and serving as the respective projection matrices within the spaces and , and , denoting the dimensions of , and . The Attention function is characterized as:
(5) |
Here, , and correspond to the input query, key, and value matrices, respectively. This MHSA phase is succeeded by a mixing layer composed of two MLPs and a convolutional layer, which provides the necessary positional encoding within the transformer encoder to maximize segmentation efficacy as noted in [3]. The computation within this layer is expressed as:
(6) |
(7) |
To conclude, an overlapping patch merge technique is applied to follow [3], culminating in the synthesis of the ultimate output.
III-D Pyramidal Multiscale Modal Fusion Layer
Feature fusion post-hierarchical extraction is performed via a designated fusion block depicted in Fig. 4. This block integrates features from modality-specific encoders across all four stages. Considering as the input features for the th block, where , sourced from th modality, we first merge these along the channel axis, obtaining . Subsequent reduction and combination of channels is achieved through a linear layer, outputting with a reduced channel count of . This process is mathematically formulated as:
(8) |
In this equation, signifies the concatenation of the modal features within the channel space, scaling down an -dimensional input to a -dimensional output.
To enhance the extraction of information across multiple scales, a multiscale feature extractor is proposed. Specifically, we have designed two modules: the Pyramidal Convolution-Based Multimodal Information Fusion Layer and the Pyramidal Pooling-Based Multimodal Information Fusion Layer. They respectively utilize convolution and pooling operations across various dimensions to discern information of different granularities within the multimodal fusion features.
1) Pyramidal Pooling-Based Multimodal Information Fusion Layer: The th linear fusion feature is refined using a multiscale feature fusion module. This unit is structured with a pair of convolutional projection layers, enclosing average pooling operations, to facilitate dimensional interlacing. Employing convolutions of sizes , , , and , the module adeptly captures and consolidates features across scales, enhancing them with residual data. Sandwiching the average pooling, the dual projection layers are instrumental in the comprehensive integration of the feature landscape.
(9) |
(10) |
(11) |
(12) |
2) Pyramidal Convolution-Based Multimodal Information Fusion Layer: Similar to the pooling fusion layer, the enhancement of the th linear fusion feature is conducted through a dedicated multiscale feature fusion apparatus. This configuration entails a sequence of convolutional projection layers that enwrap average pooling operations, thereby facilitating sophisticated feature interweaving. With convolutional kernels of dimensions , , and , analogous to those employed by [26],
(13) |
(14) |
(15) |
(16) |
The two varieties of fine-grained features extracted are fused via addition,
(17) |
Subsequently, passes through a linear layer and a Channel Attention Mechanism (CA) [46] for feature refinement.
III-E Shared Segmentation Head
The fused features generated from all the 4 fusion blocks are sent to the shared MLP decoder. We use the decoder design proposed in [3]. The segmentation head shown in Fig. 3 can be represented as the following equations:
(18) |
(19) |
(20) |
(21) |
The first linear layers take the fused features of different shapes and generate features having the same channel dimension. Then the features are up-sampled to of the original input shape, concatenated along the channel dimension and passed through another linear layer to generate the final fused feature . Finally, is passed through the last linear layer to generate the predicted segmentation map .
IV Experiments
IV-A Experimental Setup and Parameters
Experiment Configuration. All experiments were conducted on a computational platform equipped with 4 NVIDIA GeForce RTX 4090 GPUs. To ensure the reproducibility of results, the experiments were performed under consistent hardware and software configurations.
Parameters. Our model utilized uniform training parameters across all datasets. The learning rate was initially set at , with the Adam optimizer used for adjustments. The batch size was established at 4, with a total training epoch of 400 epochs for the Mcubes dataset and 120 for the FMB dataset. Cross-entropy loss function was employed. Data augmentation techniques, including random rotations, scaling, and horizontal flipping, were applied to enhance the model’s generalization capabilities.
IV-B Dataset
Mcubes. [16] The MCubeS dataset contains sets of RGB, Near-Infrared (NIR), Degree of Linear Polarization (DoLP), and Angle of Linear Polarization (AoLP) pairs. It is designed for researching semantic material segmentation across 20 categories. The dataset comprises 302/96/102 image pairs allocated for training/validation/testing, all sized at 1224×1024.
FMB. [9] The FMB dataset is a comprehensive multi-modality benchmark designed for image fusion and segmentation. It contains 1,500 well-registered pairs of infrared and visible images, annotated with 15 pixel-level categories. The training and test set contains 1220 and 280 image pairs respectively. These images encompass a variety of real-world conditions such as dense fog and low-light scenarios, making them particularly suitable for autonomous driving and semantic understanding applications. The dataset aims to improve the generalization capabilities of fusion and segmentation models across diverse environmental conditions.
IV-C Experimental Results
1) Results on Mcubes Dataset. Table I compares different methods based on mean Intersection over Union (mIoU) percentage, which is a common metric for evaluating the accuracy of segmentation models. The methods include various modalities, specifically RGB-A-D-N, where A, D, and N represent angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR) respectively. Our model achieves the highest mIoU of 51.69, significantly outperforming all other listed methods. This suggests that U3M is particularly effective at handling multimodal inputs to provide superior segmentation accuracy. The closest competitor, CMNeXt, achieves a mIoU of 51.54, making our model’s lead relatively narrow but still notable. The superior performance of U3M suggests its potential applicability in real-world scenarios where precise material identification is crucial, such as in autonomous driving environments or quality control in manufacturing.
The mIoU results of each category from the MCubeS dataset provide a nuanced view of the segmentation performance across diverse material categories, as shown in Table II. Our proposed model demonstrates robust performance, often surpassing state-of-the-art models in per-class IoU metrics. Notably, our model achieves superior results in classes such as ’Leaf’ (76.4), ’Water’ (63.8), and ’Cobblestone’ (73.5), indicating its proficiency in handling complex textures and varied lighting conditions that these materials present. However, some categories like ’Human’ (12.2) and ’Plastic’ (26.3) exhibit weaker performance, suggesting potential areas for further model refinement. This underperformance could be attributed to the challenges associated with the high variability in human appearances and the often subtle differences in plastic materials’ visual characteristics under different conditions. Comparatively, our model’s performance in the ’Rubber’ (71.7) and ’Leaf’ (76.4) categories is particularly noteworthy, underscoring its effectiveness in segmenting materials with distinct textural properties. The overall mean IoU of 51.7 places our model competitively within the landscape of current methods, closely following the CMNeXt model, which exhibits a slightly lower mean IoU of 51.5.
Methods | Asphalt | Concrete | Metal | Road marking | Fabric | Glass | Plaster | Plastic | Rubber | Sand | Gravel | Ceramic | Cobblestone | Brick | Grass | Wood | Leaf | Water | Human | Sky | Mean |
MCubeSNet [16] | 85.7 | 42.6 | 47.0 | 59.2 | 12.5 | 44.3 | 3.0 | 10.6 | 12.7 | 66.8 | 67.1 | 27.8 | 65.8 | 36.8 | 54.8 | 39.4 | 73.0 | 13.3 | 0.0 | 94.8 | 42.9 |
CMNeXt [15]* | 84.3 | 44.9 | 53.9 | 74.5 | 32.3 | 54.0 | 0.8 | 28.3 | 29.7 | 67.7 | 66.5 | 27.7 | 68.5 | 42.9 | 58.7 | 49.7 | 75.4 | 55.7 | 18.9 | 96.5 | 51.5 |
(Ours) | 86.2 | 44.5 | 55.1 | 68.0 | 33.4 | 53.9 | 1.2 | 26.3 | 26.9 | 68.2 | 71.7 | 25.0 | 73.5 | 44.1 | 59.7 | 47.4 | 76.4 | 63.8 | 12.2 | 96.4 | 51.7 |
Methods | Modalities | % mIoU |
GMNet [51] | RGB-Infrared | 49.2 |
LASNet [52] | RGB-Infrared | 42.5 |
EGFNet [41] | RGB-Infrared | 47.3 |
FEANet [42] | RGB-Infrared | 46.8 |
DIDFuse [43] | RGB-Infrared | 50.6 |
ReCoNet [44] | RGB-Infrared | 50.9 |
U2Fusion [53] | RGB-Infrared | 47.9 |
TarDAL [54] | RGB-Infrared | 48.1 |
SegMiF [9] | RGB-Infrared | 54.8 |
U3M (Ours) | RGB-Infrared | 60.8 |
Methods | Modalities | Car | Person | Truck | T-Lamp | T-Sign | Building | Vegetation | Pole | % mIoU |
GMNet [51] | RGB-Infrared | 79.3 | 60.1 | 22.2 | 21.6 | 69.0 | 79.1 | 83.8 | 39.8 | 49.2 |
LASNet [52] | RGB-Infrared | 72.6 | 48.6 | 14.8 | 2.9 | 59.0 | 75.4 | 81.6 | 36.7 | 42.5 |
EGFNet [41] | RGB-Infrared | 77.4 | 63.0 | 17.1 | 25.2 | 66.6 | 77.2 | 83.5 | 41.5 | 47.3 |
FEANet [42] | RGB-Infrared | 73.9 | 60.7 | 32.3 | 13.5 | 55.6 | 79.4 | 81.2 | 36.8 | 46.8 |
DIDFuse [43] | RGB-Infrared | 77.7 | 64.4 | 28.8 | 29.2 | 64.4 | 78.4 | 82.4 | 41.8 | 50.6 |
ReCoNet [44] | RGB-Infrared | 75.9 | 65.8 | 14.9 | 34.7 | 66.6 | 79.2 | 81.3 | 44.9 | 50.9 |
U2Fusion [53] | RGB-Infrared | 76.6 | 61.9 | 14.4 | 6.8 | 68.9 | 78.8 | 82.2 | 42.2 | 47.9 |
TarDAL [54] | RGB-Infrared | 74.2 | 56.0 | 18.3 | 7.8 | 69.0 | 79.1 | 81.7 | 41.9 | 48.1 |
SegMiF [9] | RGB-Infrared | 78.3 | 65.4 | 18.8 | 6.5 | 64.8 | 78.0 | 85.0 | 49.8 | 54.8 |
(Ours) | RGB | 83.3 | 57.6 | 41.6 | 42.7 | 78.3 | 81.7 | 85.6 | 49.5 | 60.5 |
(Ours) | RGB-Infrared | 82.3 | 66.0 | 41.9 | 46.2 | 81.0 | 81.3 | 86.8 | 48.8 | 60.8 |
Methods | Modalities | % mIoU |
Linear Layer Only | RGB-ADN | 49.89 |
Linear Layer + ChannelAttention | RGB-ADN | 50.34 |
Linear Layer + PSPModule | RGB-ADN | 50.62 |
U3M | RGB-ADN | 51.69 |
2) Results on FMB Dataset. The results presented in Table III for the performance comparison on the FBM dataset highlight the capabilities of various semantic segmentation models utilizing RGB-Infrared modalities, a combination critical for enhancing material differentiation under varying illumination conditions. Notably, our model, U3M, achieves an impressive mIoU score of 60.8, which surpasses all other models listed. This superior performance can be attributed to the effective integration of RGB and infrared data, which allows U3M to robustly capture and utilize the complementary information provided by these modalities. Infrared imaging, known for its utility in low-light conditions and its ability to differentiate materials based on thermal properties, combined with the rich detail available in RGB images, provides a more comprehensive understanding of the scene, enhancing segmentation accuracy. The closest competitors, SegMif and TarDAL, achieve mIoU scores of 54.8 and 48.1, respectively, which indicates that while these models are effective, there remains a significant gap in performance compared to U3M. This gap suggests that U3M may employ more sophisticated or optimized techniques for multimodal integration, perhaps through advanced feature fusion strategies or more effective neural network architectures.
Table IV provides a detailed per-class percentage of mIoU analysis for models tested on the FMB dataset, utilizing both RGB and RGB-infrared modalities. The data demonstrates the superior performance of our model across the majority of categories, with particularly high IoU scores in ’Building’ (81.3) and ’Vegetation’ (86.8), illustrating its robustness in identifying and segmenting structural and natural elements in urban environments. However, a notable exception is the ’Truck’ category, where our model’s performance (41.9) lags behind other models like the HDNet (60.7) and EGNet (63.0), indicating potential challenges in distinguishing larger vehicles possibly due to their similar spectral signatures with other objects or insufficient training data representing this category. Our model’s overall mIoU of 60.8 is the highest among the listed methods, confirming its efficacy in integrating RGB and infrared data to enhance segmentation accuracy. The integration of infrared data is particularly beneficial in improving the model’s performance under varying lighting conditions, as infrared provides consistent material recognition capabilities that are less susceptible to variations in visible light. The segmentation performance in the ’Traffic Lamp’ (46.2) and ’Traffic Sign’ (62.2) classes, while not the highest, still represents competitive results, suggesting that the model effectively utilizes the infrared component to detect these objects typically characterized by their distinct material properties not always apparent in RGB imagery.
IV-D Visulization
1) Segmentation result visualization on FMB dataset. The visualization in Figure 6 compellingly illustrates the effectiveness of multimodal semantic segmentation using RGB and thermal (RGB-T) data on the FMB dataset. The comparative analysis of RGB-only versus RGB-T segmentation underscores the limitations of relying solely on visible light data, particularly under adverse lighting conditions. The enhanced segmentation accuracy achieved with RGB-T highlights the thermal modality’s critical role in distinguishing and classifying various elements within urban scenes, reinforcing the necessity for incorporating multimodal data in segmentation tasks to tackle diverse environmental challenges. This analysis confirms the hypothesis that thermal data significantly boosts the segmentation capability, particularly in detecting living entities and differentiating them from inanimate backgrounds, thereby providing a more reliable and robust segmentation framework for real-world applications.
2) t-SNE visualization on FMB dataset. The t-SNE visualizations in Fig. 8 underscore the significant impact of integrating thermal imagery with RGB data on the feature separability and overall effectiveness of semantic segmentation models. The enhanced separation and definition of clusters with RGB-T data corroborate the hypothesis that thermal information aids in resolving ambiguities encountered with RGB-only data, especially in complex segmentation scenarios. This analysis not only highlights the value of multimodal approaches in improving semantic segmentation tasks but also emphasizes the need for models that can effectively leverage diverse data types to achieve more accurate and robust segmentation results. Such insights are critical for advancing the development of segmentation technologies for applications in areas such as autonomous driving, environmental monitoring, and urban planning, where accurate material and object differentiation are paramount.
3) Segmentation result visualization on Mcubes dataset. The segmentation visualizations in Fig. 7 underscore the incremental benefits of integrating multiple data modalities into semantic segmentation tasks on the Mcubes dataset. Each added modality—AoLP, DoLP, and NIR—contributes uniquely to the enhancement of the segmentation accuracy, improving specific challenges encountered with the RGB-only model. The results vividly demonstrate that while RGB data provides a foundational layer of visual information, the incorporation of polarization and thermal data crucially enriches the feature set available for segmentation, thereby enabling more precise and contextually aware delineations of urban scene elements. This multimodal approach illustrates the potential for such technologies to be applied in real-world scenarios where diverse environmental factors and varied object interactions complicate the accurate interpretation of urban spaces.
IV-E Ablation Experiment
1) Ablation experiment on different modalities. The data presented in Table V showcases the mIoU percentages for semantic segmentation performance on the Multimodal Material Segmentation (MCubeS) dataset. The analysis compares the performance of various models utilizing different combinations of modalities: RGB, RGB-A, RGB-A-D, and RGB-A-D-N. Our model demonstrates a progressive improvement in segmentation accuracy as additional modalities are integrated. Starting with RGB alone, our model achieves a mIoU of 49.22, which is higher compared to MCubeSNet’s 33.70 and CMNeXt’s 48.16. This trend of superior performance continues with the inclusion of AoLP (RGB-A), where our model scores 49.89, slightly ahead of CMNeXt’s 48.42. The enhancement is more pronounced in configurations involving both AoLP and DoLP (RGB-A-D), with our model achieving a mIoU of 50.26, compared to CMNeXt’s 49.48. The most comprehensive modality combination, incorporating RGB, AoLP, DoLP, and NIR (RGB-A-D-N), allows our model to reach its peak performance at a mIoU of 51.69, which stands out significantly against CMNeXt’s 51.54 and MCubeSNet’s 42.86. The incremental improvements observed with each additional modality underscore the efficacy of integrating multimodal data to capture a richer, more comprehensive feature set for material segmentation. The integration of NIR, in particular, appears to provide critical enhancements, likely due to its capability to offer consistent material properties detection that is less dependent on visible light conditions.
2) Ablation experiment on different module combinations. The data presented in Table VI illustrates a comparative analysis of semantic segmentation performance on the MCubes dataset using different architectural enhancements within a given framework. These results highlight the influence of various module integrations on the mIoU metric. Starting with a baseline configuration that utilizes a Linear Layer only, the model achieves a mIoU of 49.39. This setup serves as a foundational benchmark for evaluating the effectiveness of additional modules. Upon incorporating a Channel Attention mechanism, there is a modest increase in performance, with the mIoU improving to 50.34. Channel Attention likely aids the model in focusing more effectively on relevant features by re-weighting channel-specific features, thus providing a more refined feature map for segmentation tasks. A more substantial improvement is observed when a Pyramid Scene Parsing (PSP) module is added, resulting in the mIoU of 50.62. The PSP module, known for its capability to aggregate context information at different scales, evidently enhances the model’s ability to capture and integrate multi-scale contextual information, which is crucial for accurate segmentation. Our model, U3M, which incorporates RGB-ADN modalities, achieves the highest performance with a mIoU of 51.69. This indicates that the combination of RGB data along with AoLP and DoLP from the ADN modality, effectively utilized in U3M, significantly contributes to the segmentation accuracy. This performance underscores the model’s robustness and its enhanced capability to discriminate between material types and conditions in a complex dataset like MCubes.
V Conclusion and Limitation
1) Conclusion. The proposed U3M makes a significant leap in multimodal semantic segmentation, featuring innovative modality integration and feature fusion techniques. It addresses modal bias by employing an unbiased multiscale modal fusion methodology that equitably treats all modalities, thereby reducing manual bias. This model utilizes multiscale fusion modules that combine convolutional and pooling strategies, effectively integrating modalities at various scales and enhancing adaptability across diverse environments. Extensive experiments on challenging datasets show U3M consistently surpasses existing models in both accuracy and robustness, proving its suitability for real-world applications such as autonomous driving and urban planning.
2) Limitation and future work. Future work could focus on optimizing the model’s architecture for greater efficiency. Additionally, integrating more varied modalities and extensive real-world testing could broaden its application scope, ensuring it meets the evolving demands of practical implementations in areas like autonomous driving and urban planning.
References
- [1] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
- [2] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
- [3] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in neural information processing systems, vol. 34, pp. 12 077–12 090, 2021.
- [4] Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2022.
- [5] S. Dong, W. Zhou, C. Xu, and W. Yan, “Egfnet: Edge-aware guidance fusion network for rgb–thermal urban scene parsing,” IEEE Transactions on Intelligent Transportation Systems, 2023.
- [6] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part I 13. Springer, 2017, pp. 213–228.
- [7] W. Zhou, S. Dong, M. Fang, and L. Yu, “Cacfnet: Cross-modal attention cascaded fusion network for rgb-t urban scene parsing,” IEEE Transactions on Intelligent Vehicles, 2023.
- [8] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146–3154.
- [9] J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 8115–8124.
- [10] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
- [11] Z. Jin, T. Gong, D. Yu, Q. Chu, J. Wang, C. Wang, and J. Shao, “Mining contextual information beyond image for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7231–7241.
- [12] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
- [13] J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” IEEE Transactions on Intelligent Transportation Systems, 2023.
- [14] X. Hu, K. Yang, L. Fei, and K. Wang, “Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation,” in 2019 IEEE International conference on image processing (ICIP). IEEE, 2019, pp. 1440–1444.
- [15] J. Zhang, R. Liu, H. Shi, K. Yang, S. Reiß, K. Peng, H. Fu, K. Wang, and R. Stiefelhagen, “Delivering arbitrary-modal semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1136–1147.
- [16] Y. Liang, R. Wakaki, S. Nobuhara, and K. Nishino, “Multimodal material segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 800–19 808.
- [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10 012–10 022.
- [20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [21] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 124–12 134.
- [22] Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang, “Hrformer: High-resolution transformer for dense prediction,” arXiv preprint arXiv:2110.09408, 2021.
- [23] Q. He, “Prompting multi-modal image segmentation with semantic grouping,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2094–2102.
- [24] M. Kaykobad Reza, A. Prater-Bennette, and M. Salman Asif, “Multimodal transformer for material segmentation,” arXiv e-prints, pp. arXiv–2309, 2023.
- [25] B. Cao, J. Guo, P. Zhu, and Q. Hu, “Bi-directional adapter for multimodal tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 927–935.
- [26] M. K. Reza, A. Prater-Bennette, and M. S. Asif, “Multimodal transformer for material segmentation,” arXiv preprint arXiv:2309.04001, 2023.
- [27] Q. Hou, L. Zhang, M.-M. Cheng, and J. Feng, “Strip pooling: Rethinking spatial pooling for scene parsing,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4003–4012.
- [28] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
- [29] S. Choi, J. T. Kim, and J. Choo, “Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9373–9383.
- [30] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160.
- [31] C. Yu, J. Wang, C. Gao, G. Yu, C. Shen, and N. Sang, “Context prior for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 416–12 425.
- [32] Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang, “Ocnet: Object context for semantic segmentation,” International Journal of Computer Vision, vol. 129, no. 8, pp. 2375–2398, 2021.
- [33] X. Li, X. Li, L. Zhang, G. Cheng, J. Shi, Z. Lin, S. Tan, and Y. Tong, “Improving semantic segmentation via decoupled body and edge supervision,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16. Springer, 2020, pp. 435–452.
- [34] S. Borse, Y. Wang, Y. Zhang, and F. Porikli, “Inverseform: A loss function for structured boundary-aware segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5901–5911.
- [35] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Advances in neural information processing systems, vol. 34, pp. 17 864–17 875, 2021.
- [36] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299.
- [37] M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, and S.-M. Hu, “Visual attention network,” Computational Visual Media, vol. 9, no. 4, pp. 733–752, 2023.
- [38] Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, “Vision permutator: A permutable mlp-like architecture for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 1328–1334, 2022.
- [39] D. Lian, Z. Yu, X. Sun, and S. Gao, “As-mlp: An axial shifted mlp architecture for vision,” arXiv preprint arXiv:2107.08391, 2021.
- [40] M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. Hu, “Segnext: Rethinking convolutional attention design for semantic segmentation,” Advances in Neural Information Processing Systems, vol. 35, pp. 1140–1156, 2022.
- [41] W. Zhou, S. Dong, C. Xu, and Y. Qian, “Edge-aware guidance fusion network for rgb–thermal scene parsing,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 3, 2022, pp. 3571–3579.
- [42] F. Deng, H. Feng, M. Liang, H. Wang, Y. Yang, Y. Gao, J. Chen, J. Hu, X. Guo, and T. L. Lam, “Feanet: Feature-enhanced attention network for rgb-thermal real-time semantic segmentation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021, pp. 4467–4473.
- [43] Z. Zhao, S. Xu, C. Zhang, J. Liu, P. Li, and J. Zhang, “Didfuse: Deep image decomposition for infrared and visible image fusion,” arXiv preprint arXiv:2003.09210, 2020.
- [44] Z. Huang, J. Liu, X. Fan, R. Liu, W. Zhong, and Z. Luo, “Reconet: Recurrent correction network for fast and efficient multi-modality image fusion,” in European conference on computer Vision. Springer, 2022, pp. 539–555.
- [45] P. Li, J. Chen, B. Lin, and X. Xu, “Residual spatial fusion network for rgb-thermal semantic segmentation,” arXiv preprint arXiv:2306.10364, 2023.
- [46] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- [47] J. Chen, X. Wang, Z. Guo, X. Zhang, and J. Sun, “Dynamic region-aware convolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8064–8073.
- [48] J. Zhou, V. Jampani, Z. Pi, Q. Liu, and M.-H. Yang, “Decoupled dynamic filter networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6647–6656.
- [49] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7077–7087.
- [50] H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “Mmtm: Multimodal transfer module for cnn fusion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 289–13 299.
- [51] W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “Gmnet: Graded-feature multilabel-learning network for rgb-thermal urban scene semantic segmentation,” IEEE Transactions on Image Processing, vol. 30, pp. 7790–7802, 2021.
- [52] G. Li, Y. Wang, Z. Liu, X. Zhang, and D. Zeng, “Rgb-t semantic segmentation with location, activation, and sharpening,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1223–1235, 2022.
- [53] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified unsupervised image fusion network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 502–518, 2020.
- [54] J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5802–5811.