U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation

Bingyu Li{}^{\textbf{\textdagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Da Zhang{}^{\textbf{\textdagger}}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Zhiyuan Zhao, Junyu Gao,  and Xuelong Li : Equal Contribution; *Corresponding author: Xuelong Li.Bingyu Li, Zhiyuan Zhao and Xuelong Li are with the Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China. (E-mail: [email protected]; [email protected]; [email protected]).Da Zhang and Junyu Gao are with the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an 710072, China and with the Institute of Artificial Intelligence (TeleAI), China Telecom, P. R. China. (E-mail: [email protected]; [email protected]).
Abstract

Multimodal semantic segmentation is a pivotal component of computer vision and typically surpasses unimodal methods by utilizing rich information set from various sources. Current models frequently adopt modality-specific frameworks that inherently biases toward certain modalities. Although these biases might be advantageous in specific situations, they generally limit the adaptability of the models across different multimodal contexts, thereby potentially impairing performance. To address this issue, we leverage the inherent capabilities of the model itself to discover the optimal equilibrium in multimodal fusion and introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation. Specifically, this method involves an unbiased integration of multimodal visual data. Additionally, we employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets, verifing its efficacy in enhancing the robustness and versatility of semantic segmentation in diverse settings. Our code is available at U3M-multimodal-semantic-segmentation.

Index Terms:
Semantic Segmentation, Multi-Modality, Multi-Scale Fusion, Unbiased Modality Fusion

I Introduction

Semantic segmentation [1, 2, 3] is a crucial task within the field of computer vision, with applications spanning various domains such as scene understanding [4, 5, 6, 7], autonomous driving [8, 9], etc. RGB-based semantic segmentation (depicted Fig. 1(a)), serving as a foundational task, is suitable for the analysis of most scenes and has been extensively explored by the research community, yielding many impressive works and experimental results [10, 11, 12]. However, employing only RGB channels to segment certain complex and special scenes presents challenges, particularly where RGB information is elusive [13, 5]. In contrast to RGB cameras, which depend on visible light and often falter in darkness, thermal infrared (TIR) sensors directly detect heat emissions from objects, offering substantial contrast in the absence of light [9, 7]. Additionally, other modalities such as depth [6, 14] and LiDAR [15, 16] can also provide additional visual semantic information and are increasingly being integrated into semantic segmentation efforts, as illustrated in Fig. 1(b).

Refer to caption
Figure 1: The Evolution of Multimodal Semantic Segmentation Model Architectures. (a) Training a feature extractor using only RGB images. (b) Sharing a trainable feature extractor between RGB images and other modalities. (c) Up: Sharing a fine-tunable pre-trained feature extractor between RGB images and other modalities. Down: Fine-tuning adapters for different modalities with one frozen feature extrator. (d) Up: Single-scale feature fusion within the model. Down: Multiscale feature fusion.

In the context of multimodal semantic segmentation, multimodal feature extraction, serving as the foundation for downstream tasks, has been extensively explored in numerous studies [1, 10]. Although these models have demonstrated exceptional performance in certain scenarios and datasets, they are typically trained from scratch on specific small-scale datasets, the limited number of images and scenes in multimodal visual training datasets leads to poor generalization across different scenes and data, not to mention better feature extraction capabilities. In contrast to aforementioned models trained from scratch, recent pretrained models, mostly based on Convolutional Neural Network (CNN) [17] and Transformer[18], have shown strong performance on downstream tasks by building models with large-scale parameters and pretraining on vast amounts of visual images. The rise of large-scale visual pretraining models has led to a series of studies exploring those applications in semantic segmentation. For instance, a large-scale pretrained convolutional neural network[17], demonstrated excellent semantic segmentation capabilities and performance by pretraining on ImageNet. Building on this well-established pretrained model, [14] introduced ACNet, which refines feature extraction through asymmetric convolution blocks, enhancing network accuracy and robustness. Compared with CNN-based pretrained models, Transformers offer a larger receptive field [18] and stronger global modeling capabilities. Recent research [19, 20] using plain vision Transformers for information extraction struggles with multiscale semantic information. To address this, Transformers with moving and multiscale windows have been explored [21, 22]. Multiscale information extractors have proven more practical. To balance speed and accuracy, we use the hierarchical multiscale vision Transformer Segformer [3] as the modal information extractor.

Although extracted by large-scale pretrained models with strong visual information extraction capabilities, visual multimodal semantic information still presents some domain gaps. To bridge this, novel fine-tuning strategies using prompting techniques have been proposed [23], efficiently adapting pretrained models to multimodal modes with minimal updates. Another approach involves using pretrained models as modality extractors, followed by fine-tuning with limited multimodal data, as shown in Fig. 1(c). However, this often leads to catastrophic forgetting. Therefore, designing an additional feature fusion module is receiving increasing attention. For example, [8] designed an additional dual attention network to complement the pretrained feature extractor at the only low-resolution feature map (Fig. 1(d)), which does not affect the original pretrained model. However, using single-scale information presents problems such as weak local spatial information and low input resolution. To address these issues, FuseNet [6] integrates multiscale RGB-Depth multimodal features at every encoder stage, effectively improving performance on the SUN RGB-D benchmark. Similarly, CACFNet [7] incorporates a cross-modal attention fusion module that extracts multiscale RGB-thermal information. Besides, lots of extensive research [15, 24] have demonstrated multiscale modal fusion modules (as shown in Fig. 1(d)) enhance the adaptive integration of multimodal.

Despite the excellent results achieved by existing work, particularly [13, 24], most of the multimodal fusions discussed above, notably CMNEXT [15], exhibit some modalities bias. Specifically, these methods typically prioritize one modality as dominant and treat others as auxiliary (generally in the format of RGB+Xs) [25]. However, these approaches overlook the dynamic dominant correlation within multimodal data [25, 26], which hampers the ability to fully utilize complementary multimodal information in complex scenarios (as illustrated in Fig.2), thus limiting performance.

To avoid the inherent modal bias in model design and facilitate the fusion of multi-scale modal information, we first introduce an Unbiased Multiscale Modal Fusion Model (U3M) by treating all modalities equitably and performing multiscale fusion, which is different from the existing technology. This strategy enables the model to autonomously generate modal preferences applicable to various segmentation scenarios. Secondly, based on the fusion paradigm, we develop two multiscale fusion modules utilizing multiscale pooling and convolution, which effectively integrate and fuse global and local information across different scales for multimodal information. Finally, the model’s efficacy is validated by outstanding results across multiple datasets. Our contributions are summarized as follows:

  • We demonstrate the presence of modal bias in model design and develop an unbiased modal fusion methodology. This approach leverages the inherent properties of the model to autonomously generate modality preferences, substantially reducing biases introduced by manual design interventions.

  • We design a multiscale model fusion layer that incorporates multiscale convolution and pooling. In this way, the fusion layer enhances the capability for multiscale modal fusion throughout various feature encoding stages.

  • Comprehensive experiments are conducted on two challenging multimodel semantic segmentation benchmarks (i.e., FMB [9], Mcubes [16]), which further demonstrates that our method achieves state-of-the-art performance.

Refer to caption
Figure 2: The dynamic dominant correlation of multimodal data in different scenes. Left: Under conditions of insufficient light, infrared images can capture more intricate details than RGB images. Middle: In outdoor situations where light is abundant, infrared images tend to lose more details, while RGB images showcase their superiority. Right: In certain common instances, the detailed information in infrared images and RGB images can serve as a complement to each other.

II Related Works

II-A Semantic Segmentation

The field of semantic segmentation has seen significant advancements, particularly with the advent of fully convolutional networks that revolutionized pixel classification [10]. Based on this, some improvements such as multi-scale feature extraction and fusion have proven effective [1, 27]. For instance, Chen et al. further developed DeepLab by integrating an encoder-decoder architecture for efficient multi-scale context aggregation [2]. Similarly, Zhao et al. introduced a pyramid pooling module to aggregate context from varied regions at multiple scales [12].

Channel and self-attention mechanisms have evolved to capture more global semantic information. Lin et al. developed RefineNet, employing multi-path refinement networks with channel attention for high-resolution segmentation [28]. Choi et al. proposed CARS, emphasizing channel-wise attention for region-based segmentation [29]. Another paradigm, Context-based refinement algorithms, integrates extensive background contextual information [11, 30]. Some other context-based models leverage context to refine segmentation through adaptive feature recalibration [31], while [32] developed OCNet to enhance semantic understanding by aggregating object context.

Edge detection techniques also serve as complementary cues for semantic segmentation. For example, Li et al. improved edge detection using deep learning [33], and Borse et al. proposed InverseForm to enhance edge detection accuracy in complex images [34]. The recent adoption of vision transformers for recognition tasks has led to the development of dense prediction transformers specialized for semantic segmentation. Notable examples include CSWin Transformer, which captures long-range dependencies using cross-shaped windows [21], and HRFormer, which integrates high-resolution representations for detailed segmentation [22]. These advancements have facilitated the segmentation of discrete objects and amorphous regions [35, 36], with transformers now incorporating token mixing via attention mechanisms [37], multi-layer perceptron elements [38, 39], and pooling and convolutional blocks [40].

Despite setting new benchmarks in image segmentation, challenges persist, particularly under real-life conditions where RGB images are inadequate, such as in low-light environments or when capturing fast-moving subjects. Consequently, multimodal semantic segmentation is garnering increasing attention.

II-B Multimodality Semantic Segmentation

Multimodal semantic segmentation is increasingly recognized for its ability to integrate diverse modal data, effectively compensating for the inherent limitations of each modality [15, 13, 16]. Zhou et al. [41] integrated edge-aware features for enhanced semantic segmentation of RGB and thermal images, improving object boundary delineation in multimodal scenarios. Deng et al. [42] incorporated an attention mechanism that boosts feature representation for real-time RGB-thermal semantic segmentation. Zhao et al. [43] utilized deep image decomposition to fuse infrared and visible images, preserving critical features from both modalities. Huang et al. [44] employed a recurrent network to iteratively refine multi-modality image fusion, enhancing detail preservation and reducing artifacts. Additionally, in RGB-depth modality fusion, ACNet [14] employs an attention mechanism to optimize the usage of RGB and depth data for improved semantic segmentation accuracy, especially in scenarios with complex visibility. FuseNet [6] integrates RGB and depth data using a dual-stream CNN, leveraging depth as an auxiliary input to enhance segmentation accuracy. The evolution of model frameworks has progressed from those based on CNN [1, 2] to those founded on Transformers [3, 40, 21]. This transition facilitates a more nuanced analysis of the interplay between global semantics and local features, enhancing feature extraction. For modal fusion, some techniques employ attention mechanisms to integrate different modalities [14]. For example, CACFNet [7] utilizes cross-attention mechanisms to selectively enhance the integration of contextual information from different modalities to improve semantic correlation and feature extraction efficiency. Other approaches employ convolution as a feature fusion extraction module [45]; Reza et al. [26] introduced specialized convolution layers designed to process and merge information from multiple modalities effectively. EGFNet [5] uses gated convolution to selectively fuse the most relevant features from diverse modalities, improving fusion effectiveness. Apart from that, Zhang et al. [15] innovated by adapting pooling strategies to multimodal contexts, optimizing feature reduction and abstraction processes to better accommodate the diverse characteristics of different data types. Despite these advancements, many existing modal fusion models rely excessively on RGB images, often involving specialized feature extractors for the RGB channels. This strategy risks neglecting the variable significance of different modalities across various scenarios. To address this, we propose the Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation.

III Method

Refer to caption
Figure 3: Unbiased Multiscale Modal Fusion Model. Utilizing Segformer [3] with frozen parameters as the feature extractor. Each modality’s information is fed into respective feature extractors, divided into four distinct scales for unbiased fusion of multiscale information. Each feature fusion layer comprises two modules based on multiscale pooling and convolution, adaptively extracting features with varied scales. In the end, the multiscale information is concatenated and fed into a shared semantic segmentation head to generate segmentation results.

In this section, we first present the problem definition, followed by a detailed presentation of the proposed U3M.

III-A Problem Definition

Multimodal semantic segmentation seeks to assign a semantic label to each pixel in an image from predefined categories, such as ’car’, ’tree’, or ’road’. Distinct from traditional semantic segmentation which depends on a single modality, typically RGB images, multimodal semantic segmentation leverages data from various sources or sensors to improve the accuracy and robustness of segmentation.

Given a set ={I1,I2,,IM}subscript𝐼1subscript𝐼2subscript𝐼𝑀\mathcal{I}=\{I_{1},I_{2},\dots,I_{M}\}caligraphic_I = { italic_I start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT } of images, where ImH×W×Cmsubscript𝐼𝑚superscript𝐻𝑊subscript𝐶𝑚I_{m}\in\mathbb{R}^{H\times W\times C_{m}}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the m𝑚mitalic_m-th modality out of M𝑀Mitalic_M total modalities. Here, H𝐻Hitalic_H, W𝑊Witalic_W, and Cmsubscript𝐶𝑚C_{m}italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denote the height, width, and channel size of the image, respectively. The objective of multimodal semantic segmentation is to compute a segmentation map S𝒫H×W𝑆superscript𝒫𝐻𝑊S\in\mathcal{P}^{H\times W}italic_S ∈ caligraphic_P start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT. Here, 𝒫={1,2,,N}𝒫12𝑁\mathcal{P}=\{1,2,\dots,N\}caligraphic_P = { 1 , 2 , … , italic_N } denotes the set of N𝑁Nitalic_N semantic labels.

Each encoder (Encm:H×W×CmH×W×Cf:subscriptEnc𝑚superscript𝐻𝑊subscript𝐶𝑚superscriptsuperscript𝐻superscript𝑊subscript𝐶𝑓\text{Enc}_{m}:\mathbb{R}^{H\times W\times C_{m}}\rightarrow\mathbb{R}^{H^{% \prime}\times W^{\prime}\times C_{f}}Enc start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) processes the corresponding modality Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, yielding feature maps Fmsuperscript𝐹𝑚F^{m}italic_F start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. These are fused by Fusion:(H×W×Cf)MH×W×F:Fusionsuperscriptsuperscriptsuperscript𝐻superscript𝑊subscript𝐶𝑓𝑀superscriptsuperscript𝐻superscript𝑊superscript𝐹\text{Fusion}:(\mathbb{R}^{H^{\prime}\times W^{\prime}\times C_{f}})^{M}% \rightarrow\mathbb{R}^{H^{\prime}\times W^{\prime}\times F^{\prime}}Fusion : ( blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to produce a combined representation F𝐹Fitalic_F (Cfsubscript𝐶𝑓C_{f}italic_C start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT=Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in this paper). Finally, S𝑆Sitalic_S is obtained through a decoder Dec:H×W×F𝒫H×W:Decsuperscriptsuperscript𝐻superscript𝑊superscript𝐹superscript𝒫𝐻𝑊\text{Dec}:\mathbb{R}^{H^{\prime}\times W^{\prime}\times F^{\prime}}% \rightarrow\mathcal{P}^{H\times W}Dec : blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT → caligraphic_P start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT.

The effectiveness of a multimodal semantic segmentation model is typically evaluated based on its ability to accurately segment objects and scenes under varying conditions, making use of the additional information provided by the various modalities to overcome the limitations of single-modality segmentation.

III-B Overall model Architecture

The schema of our model’s architecture is presented in Fig. 3 accommodating M𝑀Mitalic_M discrete modalities. The architecture’s sophistication lies in its utilization of modality-specific encoders, which are tailored to distill unique feature hierarchies from each modality, encapsulated by the equation:

Fm=Encm(Im),subscript𝐹𝑚subscriptEnc𝑚subscript𝐼𝑚F_{m}=\text{Enc}_{m}(I_{m}),italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = Enc start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) , (1)

where ImH×W×Cmsubscript𝐼𝑚superscript𝐻𝑊subscript𝐶𝑚I_{m}\in\mathbb{R}^{H\times W\times C_{m}}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the modality-specific input imagery for m{1,2,,M}𝑚12𝑀m\in\{1,2,\ldots,M\}italic_m ∈ { 1 , 2 , … , italic_M }, and Encm()subscriptEnc𝑚\text{Enc}_{m}(\cdot)Enc start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( ⋅ ) is the corresponding encoder. These encoders are adept at generating a spectrum of feature maps at diminished resolutions—precisely, 14,18,116,1321418116132\frac{1}{4},\frac{1}{8},\frac{1}{16},\frac{1}{32}divide start_ARG 1 end_ARG start_ARG 4 end_ARG , divide start_ARG 1 end_ARG start_ARG 8 end_ARG , divide start_ARG 1 end_ARG start_ARG 16 end_ARG , divide start_ARG 1 end_ARG start_ARG 32 end_ARG of the initial resolution—collectively represented as Fm={Fm1,Fm2,Fm3,Fm4}subscript𝐹𝑚superscriptsubscript𝐹𝑚1superscriptsubscript𝐹𝑚2superscriptsubscript𝐹𝑚3superscriptsubscript𝐹𝑚4F_{m}=\{F_{m}^{1},F_{m}^{2},F_{m}^{3},F_{m}^{4}\}italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT }. For the sake of brevity, the feature map’s dimensions at the i𝑖iitalic_i-th encoding stage are denoted as (Hi×Wi×Cfi)subscriptsuperscript𝐻𝑖subscriptsuperscript𝑊𝑖subscript𝐶𝑓𝑖(H^{\prime}_{i}\times W^{\prime}_{i}\times C_{fi})( italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_f italic_i end_POSTSUBSCRIPT ), with i𝑖iitalic_i traversing {1, 2, 3, 4}. A quartet of fusion blocks, each allied to a corresponding encoder stage, is orchestrated to merge the features from each encoding cascade. The prominent features Fmisubscriptsuperscript𝐹𝑖𝑚F^{i}_{m}italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT from every modality are assimilated within the i𝑖iitalic_i-th fusion module:

Fi=FusionBlocki({Fmi}m).superscript𝐹𝑖superscriptFusionBlock𝑖subscriptsubscriptsuperscript𝐹𝑖𝑚𝑚F^{i}=\text{FusionBlock}^{i}(\{F^{i}_{m}\}_{m}).italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = FusionBlock start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ( { italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . (2)

This integrative process yields a synthesized feature composite F={F1,F2,F3,F4}𝐹superscript𝐹1superscript𝐹2superscript𝐹3superscript𝐹4F=\{F^{1},F^{2},F^{3},F^{4}\}italic_F = { italic_F start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_F start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT }, with Fisuperscript𝐹𝑖F^{i}italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT signifying the fused feature at the i𝑖iitalic_ith stage. Culminating the process, these aggregated features F𝐹Fitalic_F are input into an MLP decoder, as elucidated in [3], to extrapolate the segmentation contours.

III-C Modality Feature Encoder

Utilizing a mix transformer encoder as referenced in [3], our system effectively extracts hierarchical features from various input data types. Every image Imsubscript𝐼𝑚I_{m}italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT undergoes a patch embedding process, getting segmented into 4×4444\times 44 × 4 patches as per the method described in [3], before being processed by the mixed transformer encoder units. Illustrated in Fig. 3 is the mix transformer unit’s structure, with Xmisubscriptsuperscript𝑋𝑖𝑚X^{i}_{m}italic_X start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT denoted the m𝑚mitalic_m-th modal input for the i𝑖iitalic_i-th mix transformer, which has the shape of Hi×Wi×Cisuperscriptsubscript𝐻𝑖subscript𝑊𝑖subscript𝐶𝑖\mathbb{R}^{H_{i}\times W_{i}\times C_{i}}blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. This input is then reshape into a Ni×Cisubscript𝑁𝑖subscript𝐶𝑖N_{i}\times C_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT matrix, where Nisubscript𝑁𝑖N_{i}italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT equates to Hi×Wisubscript𝐻𝑖subscript𝑊𝑖H_{i}\times W_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for utilization as the query Q𝑄Qitalic_Q, the key K𝐾Kitalic_K, and the value V𝑉Vitalic_V. In an effort to curtail computational demands, we employ spatial reduction as suggested by [3], leveraging a reduction ratio R𝑅Ritalic_R. The matrices K𝐾Kitalic_K and V𝑉Vitalic_V undergo an initial transformation into NiR×Cisubscript𝑁𝑖𝑅subscript𝐶𝑖\frac{N_{i}}{R}\times C_{i}divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_R end_ARG × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT matrices, followed by a remapping into NiR×Cisubscript𝑁𝑖𝑅subscript𝐶𝑖\frac{N_{i}}{R}\times C_{i}divide start_ARG italic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_R end_ARG × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT matrices via a linear transformation process. Subsequently, a conventional Multi-Head Self-attention Mechanism (MHSA) is employed to map Q,K,V𝑄𝐾𝑉Q,K,Vitalic_Q , italic_K , italic_V into intermediate representations as delineated by:

MHSA(Q,K,V)=Concatenate(head1,,headh)WO,𝑀𝐻𝑆𝐴𝑄𝐾𝑉Concatenate𝑒𝑎subscript𝑑1𝑒𝑎subscript𝑑superscript𝑊𝑂MHSA(Q,K,V)=\text{Concatenate}(head_{1},\ldots,head_{h})W^{O},italic_M italic_H italic_S italic_A ( italic_Q , italic_K , italic_V ) = Concatenate ( italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT , (3)
headj=Attention(QWjQ,KWjK,VWjV).𝑒𝑎subscript𝑑𝑗Attention𝑄superscriptsubscript𝑊𝑗𝑄𝐾superscriptsubscript𝑊𝑗𝐾𝑉superscriptsubscript𝑊𝑗𝑉head_{j}=\text{Attention}(QW_{j}^{Q},KW_{j}^{K},VW_{j}^{V}).italic_h italic_e italic_a italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = Attention ( italic_Q italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , italic_K italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , italic_V italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ) . (4)

In this context, hhitalic_h signifies the total number of attention heads, with WjQsuperscriptsubscript𝑊𝑗𝑄W_{j}^{Q}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT, WjKsuperscriptsubscript𝑊𝑗𝐾W_{j}^{K}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT, WjVsuperscriptsubscript𝑊𝑗𝑉W_{j}^{V}italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT, and WOsuperscript𝑊𝑂W^{O}italic_W start_POSTSUPERSCRIPT italic_O end_POSTSUPERSCRIPT serving as the respective projection matrices within the spaces Ci×dksuperscriptsubscript𝐶𝑖subscript𝑑𝑘\mathbb{R}^{C_{i}\times d_{k}}blackboard_R start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and hdv×Cisuperscriptsubscript𝑑𝑣subscript𝐶𝑖\mathbb{R}^{hd_{v}\times C_{i}}blackboard_R start_POSTSUPERSCRIPT italic_h italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, dvsubscript𝑑𝑣d_{v}italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT denoting the dimensions of K𝐾Kitalic_K, and V𝑉Vitalic_V. The Attention function is characterized as:

Attention(Q,K,V)=Softmax(QKTdk)V.Attention𝑄𝐾𝑉Softmax𝑄superscript𝐾𝑇subscript𝑑𝑘𝑉\text{Attention}(Q,K,V)=\text{Softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V.Attention ( italic_Q , italic_K , italic_V ) = Softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V . (5)

Here, Q,K𝑄𝐾Q,Kitalic_Q , italic_K, and V𝑉Vitalic_V correspond to the input query, key, and value matrices, respectively. This MHSA phase is succeeded by a mixing layer composed of two MLPs and a 3×3333\times 33 × 3 convolutional layer, which provides the necessary positional encoding within the transformer encoder to maximize segmentation efficacy as noted in [3]. The computation within this layer is expressed as:

Xin=MHSA(Q,K,V),subscript𝑋in𝑀𝐻𝑆𝐴𝑄𝐾𝑉{X}_{\text{in}}=MHSA(Q,K,V),italic_X start_POSTSUBSCRIPT in end_POSTSUBSCRIPT = italic_M italic_H italic_S italic_A ( italic_Q , italic_K , italic_V ) , (6)
Xout=MLP(GELU(Conv3×3(MLP(Xin))))+Xin.subscript𝑋outMLPGELUsubscriptConv33MLPsubscript𝑋insubscript𝑋inX_{\text{out}}=\text{MLP}(\text{GELU}(\text{Conv}_{3\times 3}(\text{MLP}({X}_{% \text{in}}))))+{X}_{\text{in}}.italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = MLP ( GELU ( Conv start_POSTSUBSCRIPT 3 × 3 end_POSTSUBSCRIPT ( MLP ( italic_X start_POSTSUBSCRIPT in end_POSTSUBSCRIPT ) ) ) ) + italic_X start_POSTSUBSCRIPT in end_POSTSUBSCRIPT . (7)

To conclude, an overlapping patch merge technique is applied to Xoutsubscript𝑋outX_{\text{out}}italic_X start_POSTSUBSCRIPT out end_POSTSUBSCRIPT follow [3], culminating in the synthesis of the ultimate output.

III-D Pyramidal Multiscale Modal Fusion Layer

Feature fusion post-hierarchical extraction is performed via a designated fusion block depicted in Fig. 4. This block integrates features from modality-specific encoders across all four stages. Considering Fmisubscriptsuperscript𝐹𝑖𝑚F^{i}_{m}italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT as the input features for the i𝑖iitalic_ith block, where FmiHi×Wi×Cisubscriptsuperscript𝐹𝑖𝑚superscriptsubscript𝐻𝑖subscript𝑊𝑖subscript𝐶𝑖F^{i}_{m}\in\mathbb{R}^{H_{i}\times W_{i}\times C_{i}}italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, sourced from m𝑚mitalic_mth modality, we first merge these along the channel axis, obtaining Fisuperscript𝐹𝑖F^{\prime i}italic_F start_POSTSUPERSCRIPT ′ italic_i end_POSTSUPERSCRIPT. Subsequent reduction and combination of channels is achieved through a linear layer, outputting F~isuperscript~𝐹𝑖\tilde{F}^{i}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT with a reduced channel count of Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. This process is mathematically formulated as:

F~i=Linear(m=1MFmi).\tilde{F}^{i}=\text{Linear}(\big{\|}_{m=1}^{M}F^{i}_{m}).over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = Linear ( ∥ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) . (8)

In this equation, \big{\|} signifies the concatenation of the modal features within the channel space, scaling down an MCi𝑀subscript𝐶𝑖MC_{i}italic_M italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-dimensional input to a Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-dimensional output.

Refer to caption
Figure 4: Multiscale Feature Fusion Module. To enhance the extraction of information across multiple scales, a multiscale feature extractor is proposed.

To enhance the extraction of information across multiple scales, a multiscale feature extractor is proposed. Specifically, we have designed two modules: the Pyramidal Convolution-Based Multimodal Information Fusion Layer and the Pyramidal Pooling-Based Multimodal Information Fusion Layer. They respectively utilize convolution and pooling operations across various dimensions to discern information of different granularities within the multimodal fusion features.

1) Pyramidal Pooling-Based Multimodal Information Fusion Layer: The i𝑖iitalic_ith linear fusion feature F~isuperscript~𝐹𝑖\tilde{F}^{i}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is refined using a multiscale feature fusion module. This unit is structured with a pair of convolutional projection layers, enclosing average pooling operations, to facilitate dimensional interlacing. Employing convolutions of sizes 1×1111\times 11 × 1, 2×2222\times 22 × 2, 3×3333\times 33 × 3, and 6×6666\times 66 × 6, the module adeptly captures and consolidates features across scales, enhancing them with residual data. Sandwiching the average pooling, the dual projection layers are instrumental in the comprehensive integration of the feature landscape.

Fpoolingi=Conv1×1(F~i),superscriptsubscript𝐹𝑝𝑜𝑜𝑙𝑖𝑛𝑔𝑖subscriptConv11superscript~𝐹𝑖F_{pooling}^{i}=\text{Conv}_{1\times 1}(\tilde{F}^{i}),italic_F start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (9)
Fpoolingi,k=Conv1×1(AvgPoolingk×k(Fpoolingi)),superscriptsubscript𝐹𝑝𝑜𝑜𝑙𝑖𝑛𝑔𝑖𝑘subscriptConv11subscriptAvgPooling𝑘𝑘superscriptsubscript𝐹𝑝𝑜𝑜𝑙𝑖𝑛𝑔𝑖F_{pooling}^{i,k}=\text{Conv}_{1\times 1}(\text{AvgPooling}_{k\times k}(F_{% pooling}^{i})),italic_F start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( AvgPooling start_POSTSUBSCRIPT italic_k × italic_k end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) , (10)
Fpoolingi=k{1,2,5,6}Upsample(Fpoolingi,k),superscriptsubscript𝐹𝑝𝑜𝑜𝑙𝑖𝑛𝑔𝑖subscript𝑘1256Upsamplesuperscriptsubscript𝐹𝑝𝑜𝑜𝑙𝑖𝑛𝑔𝑖𝑘F_{pooling}^{i}=\sum_{k\in\{1,2,5,6\}}\text{Upsample}(F_{pooling}^{i,k}),italic_F start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k ∈ { 1 , 2 , 5 , 6 } end_POSTSUBSCRIPT Upsample ( italic_F start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT ) , (11)
Fpoolingi=Conv1×1(Fpoolingi).superscriptsubscript𝐹𝑝𝑜𝑜𝑙𝑖𝑛𝑔𝑖subscriptConv11superscriptsubscript𝐹𝑝𝑜𝑜𝑙𝑖𝑛𝑔𝑖F_{pooling}^{i}=\text{Conv}_{1\times 1}(F_{pooling}^{i}).italic_F start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_p italic_o italic_o italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (12)

2) Pyramidal Convolution-Based Multimodal Information Fusion Layer: Similar to the pooling fusion layer, the enhancement of the i𝑖iitalic_ith linear fusion feature F~isuperscript~𝐹𝑖\tilde{F}^{i}over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is conducted through a dedicated multiscale feature fusion apparatus. This configuration entails a sequence of convolutional projection layers that enwrap average pooling operations, thereby facilitating sophisticated feature interweaving. With convolutional kernels of dimensions 3×3333\times 33 × 3, 5×5555\times 55 × 5, and 7×7777\times 77 × 7, analogous to those employed by [26],

Refer to caption
Figure 5: Multiscale pooling and convolution. Pooling and convolution at different scales are capable of capturing local and global features across multiple levels, thereby complementing the global attention mechanisms integrated within the backbone architecture effectively. This approach ensures that the resultant fused features encompass a comprehensive focus on both local and global dimensions.
Fconvi=Conv1×1(F~i),superscriptsubscript𝐹𝑐𝑜𝑛𝑣𝑖subscriptConv11superscript~𝐹𝑖F_{conv}^{i}=\text{Conv}_{1\times 1}(\tilde{F}^{i}),italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (13)
Fconvi,k=Convk×k(Fconvi),superscriptsubscript𝐹𝑐𝑜𝑛𝑣𝑖𝑘subscriptConv𝑘𝑘superscriptsubscript𝐹𝑐𝑜𝑛𝑣𝑖F_{conv}^{i,k}=\text{Conv}_{k\times k}(F_{conv}^{i}),italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT italic_k × italic_k end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (14)
Fconvi=k{3,5,7}(Fi+Fconvi,k),superscriptsubscript𝐹𝑐𝑜𝑛𝑣𝑖subscript𝑘357superscript𝐹𝑖superscriptsubscript𝐹𝑐𝑜𝑛𝑣𝑖𝑘F_{conv}^{i}=\sum_{k\in\{3,5,7\}}(F^{i}+F_{conv}^{i,k}),italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k ∈ { 3 , 5 , 7 } end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i , italic_k end_POSTSUPERSCRIPT ) , (15)
Fconvi=Conv1×1(Fconvi).superscriptsubscript𝐹𝑐𝑜𝑛𝑣𝑖subscriptConv11superscriptsubscript𝐹𝑐𝑜𝑛𝑣𝑖F_{conv}^{i}=\text{Conv}_{1\times 1}(F_{conv}^{i}).italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = Conv start_POSTSUBSCRIPT 1 × 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) . (16)

The two varieties of fine-grained features extracted are fused via addition,

Ffusioni=FPoolingi+Fconvi.superscriptsubscript𝐹𝑓𝑢𝑠𝑖𝑜𝑛𝑖superscriptsubscript𝐹𝑃𝑜𝑜𝑙𝑖𝑛𝑔𝑖superscriptsubscript𝐹𝑐𝑜𝑛𝑣𝑖F_{fusion}^{i}=F_{Pooling}^{i}+F_{conv}^{i}.italic_F start_POSTSUBSCRIPT italic_f italic_u italic_s italic_i italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_F start_POSTSUBSCRIPT italic_P italic_o italic_o italic_l italic_i italic_n italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT + italic_F start_POSTSUBSCRIPT italic_c italic_o italic_n italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (17)

Subsequently, Ffusionisuperscriptsubscript𝐹𝑓𝑢𝑠𝑖𝑜𝑛𝑖F_{fusion}^{i}italic_F start_POSTSUBSCRIPT italic_f italic_u italic_s italic_i italic_o italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT passes through a linear layer and a Channel Attention Mechanism (CA) [46] for feature refinement.

III-E Shared Segmentation Head

The fused features generated from all the 4 fusion blocks are sent to the shared MLP decoder. We use the decoder design proposed in [3]. The segmentation head shown in Fig. 3 can be represented as the following equations:

F^i=Linear(Fouti),i{1,2,3,4}formulae-sequencesuperscript^𝐹𝑖Linearsuperscriptsubscript𝐹𝑜𝑢𝑡𝑖for-all𝑖1234\hat{F}^{i}=\text{Linear}(F_{out}^{i}),\quad\forall i\in\{1,2,3,4\}over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = Linear ( italic_F start_POSTSUBSCRIPT italic_o italic_u italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , ∀ italic_i ∈ { 1 , 2 , 3 , 4 } (18)
F^i=Upsample(F^i),i{1,2,3,4}formulae-sequencesuperscript^𝐹𝑖Upsamplesuperscript^𝐹𝑖for-all𝑖1234\hat{F}^{i}=\text{Upsample}(\hat{F}^{i}),\quad\forall i\in\{1,2,3,4\}over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = Upsample ( over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , ∀ italic_i ∈ { 1 , 2 , 3 , 4 } (19)
F=Linear(F^1F^i),𝐹Linearsuperscript^𝐹1normsuperscript^𝐹𝑖F=\text{Linear}(\hat{F}^{1}\parallel\ldots\parallel\hat{F}^{i}),italic_F = Linear ( over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ∥ … ∥ over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (20)
P=Linear(F).𝑃Linear𝐹P=\text{Linear}(F).italic_P = Linear ( italic_F ) . (21)

The first linear layers take the fused features of different shapes and generate features having the same channel dimension. Then the features are up-sampled to 1414\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG of the original input shape, concatenated along the channel dimension and passed through another linear layer to generate the final fused feature Fsuperscript𝐹F^{\prime}italic_F start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Finally, F𝐹Fitalic_F is passed through the last linear layer to generate the predicted segmentation map P𝑃Pitalic_P.

IV Experiments

IV-A Experimental Setup and Parameters

Refer to caption
Figure 6: Visualization on FMB dataset. This figure presents a comprehensive visualization of semantic segmentation results on the FMB dataset, juxtaposing RGB images with their thermal counterparts and the corresponding segmentation outcomes from two different methods: Results with RGB and Results with RGB-T. This figure exemplifies the pivotal role of integrating multimodal data in enhancing segmentation accuracy under varied environmental and lighting conditions.

Experiment Configuration. All experiments were conducted on a computational platform equipped with 4 NVIDIA GeForce RTX 4090 GPUs. To ensure the reproducibility of results, the experiments were performed under consistent hardware and software configurations.

Parameters. Our model utilized uniform training parameters across all datasets. The learning rate was initially set at 6×1056superscript1056\times 10^{-5}6 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, with the Adam optimizer used for adjustments. The batch size was established at 4, with a total training epoch of 400 epochs for the Mcubes dataset and 120 for the FMB dataset. Cross-entropy loss function was employed. Data augmentation techniques, including random rotations, scaling, and horizontal flipping, were applied to enhance the model’s generalization capabilities.

IV-B Dataset

Mcubes. [16] The MCubeS dataset contains sets of RGB, Near-Infrared (NIR), Degree of Linear Polarization (DoLP), and Angle of Linear Polarization (AoLP) pairs. It is designed for researching semantic material segmentation across 20 categories. The dataset comprises 302/96/102 image pairs allocated for training/validation/testing, all sized at 1224×1024.

FMB. [9] The FMB dataset is a comprehensive multi-modality benchmark designed for image fusion and segmentation. It contains 1,500 well-registered pairs of infrared and visible images, annotated with 15 pixel-level categories. The training and test set contains 1220 and 280 image pairs respectively. These images encompass a variety of real-world conditions such as dense fog and low-light scenarios, making them particularly suitable for autonomous driving and semantic understanding applications. The dataset aims to improve the generalization capabilities of fusion and segmentation models across diverse environmental conditions.

IV-C Experimental Results

TABLE I: Performance comparison on Multimodal Material Segmentation (MCubeS) dataset [16]. Here A, D, and N represent angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR) respectively.
Method Modalities % mIoU
DRConv [47] RGB-A-D-N 34.63
DDF [48] RGB-A-D-N 36.16
TransFuser [49] RGB-A-D-N 37.66
DeepLabv3+ [2] RGB-A-D-N 38.13
MMTM [50] RGB-A-D-N 39.71
FuseNet [6] RGB-A-D-N 40.58
MCubeSNet [16] RGB-A-D-N 42.46
CMNeXt [15] RGB-A-D-N 51.54
U3M (Ours) RGB-A-D-N 51.69

1) Results on Mcubes Dataset. Table I compares different methods based on mean Intersection over Union (mIoU) percentage, which is a common metric for evaluating the accuracy of segmentation models. The methods include various modalities, specifically RGB-A-D-N, where A, D, and N represent angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR) respectively. Our model achieves the highest mIoU of 51.69, significantly outperforming all other listed methods. This suggests that U3M is particularly effective at handling multimodal inputs to provide superior segmentation accuracy. The closest competitor, CMNeXt, achieves a mIoU of 51.54, making our model’s lead relatively narrow but still notable. The superior performance of U3M suggests its potential applicability in real-world scenarios where precise material identification is crucial, such as in autonomous driving environments or quality control in manufacturing.

The mIoU results of each category from the MCubeS dataset provide a nuanced view of the segmentation performance across diverse material categories, as shown in Table II. Our proposed model demonstrates robust performance, often surpassing state-of-the-art models in per-class IoU metrics. Notably, our model achieves superior results in classes such as ’Leaf’ (76.4), ’Water’ (63.8), and ’Cobblestone’ (73.5), indicating its proficiency in handling complex textures and varied lighting conditions that these materials present. However, some categories like ’Human’ (12.2) and ’Plastic’ (26.3) exhibit weaker performance, suggesting potential areas for further model refinement. This underperformance could be attributed to the challenges associated with the high variability in human appearances and the often subtle differences in plastic materials’ visual characteristics under different conditions. Comparatively, our model’s performance in the ’Rubber’ (71.7) and ’Leaf’ (76.4) categories is particularly noteworthy, underscoring its effectiveness in segmenting materials with distinct textural properties. The overall mean IoU of 51.7 places our model competitively within the landscape of current methods, closely following the CMNeXt model, which exhibits a slightly lower mean IoU of 51.5.

TABLE II: Per-class % IoU comparison on MCubeS dataset. Our proposed model shows better performance in detecting most of the classes compared to the current state-of-the-art models. * indicates that the code and pretrained model from the authors were used to generate the results.
Methods Asphalt Concrete Metal Road marking Fabric Glass Plaster Plastic Rubber Sand Gravel Ceramic Cobblestone Brick Grass Wood Leaf Water Human Sky Mean
MCubeSNet [16] 85.7 42.6 47.0 59.2 12.5 44.3 3.0 10.6 12.7 66.8 67.1 27.8 65.8 36.8 54.8 39.4 73.0 13.3 0.0 94.8 42.9
CMNeXt [15]* 84.3 44.9 53.9 74.5 32.3 54.0 0.8 28.3 29.7 67.7 66.5 27.7 68.5 42.9 58.7 49.7 75.4 55.7 18.9 96.5 51.5
(Ours) 86.2 44.5 55.1 68.0 33.4 53.9 1.2 26.3 26.9 68.2 71.7 25.0 73.5 44.1 59.7 47.4 76.4 63.8 12.2 96.4 51.7
TABLE III: Performance comparison on FBM dataset [9]. We show performance for different methods from already published works.
Methods Modalities % mIoU
GMNet [51] RGB-Infrared 49.2
LASNet [52] RGB-Infrared 42.5
EGFNet [41] RGB-Infrared 47.3
FEANet [42] RGB-Infrared 46.8
DIDFuse [43] RGB-Infrared 50.6
ReCoNet [44] RGB-Infrared 50.9
U2Fusion [53] RGB-Infrared 47.9
TarDAL [54] RGB-Infrared 48.1
SegMiF [9] RGB-Infrared 54.8
U3M (Ours) RGB-Infrared 60.8
TABLE IV: Per-class % IoU comparison on FMB [9] dataset for both RGB only and RGB-infrared modalities. T-Lamp and T-Sign stand for Traffic Lamp and Traffic Sign respectively. Our model outperforms all the methods for all the classes except for the truck class.
Methods Modalities Car Person Truck T-Lamp T-Sign Building Vegetation Pole % mIoU
GMNet [51] RGB-Infrared 79.3 60.1 22.2 21.6 69.0 79.1 83.8 39.8 49.2
LASNet [52] RGB-Infrared 72.6 48.6 14.8 2.9 59.0 75.4 81.6 36.7 42.5
EGFNet [41] RGB-Infrared 77.4 63.0 17.1 25.2 66.6 77.2 83.5 41.5 47.3
FEANet [42] RGB-Infrared 73.9 60.7 32.3 13.5 55.6 79.4 81.2 36.8 46.8
DIDFuse [43] RGB-Infrared 77.7 64.4 28.8 29.2 64.4 78.4 82.4 41.8 50.6
ReCoNet [44] RGB-Infrared 75.9 65.8 14.9 34.7 66.6 79.2 81.3 44.9 50.9
U2Fusion [53] RGB-Infrared 76.6 61.9 14.4 6.8 68.9 78.8 82.2 42.2 47.9
TarDAL [54] RGB-Infrared 74.2 56.0 18.3 7.8 69.0 79.1 81.7 41.9 48.1
SegMiF [9] RGB-Infrared 78.3 65.4 18.8 6.5 64.8 78.0 85.0 49.8 54.8
(Ours) RGB 83.3 57.6 41.6 42.7 78.3 81.7 85.6 49.5 60.5
(Ours) RGB-Infrared 82.3 66.0 41.9 46.2 81.0 81.3 86.8 48.8 60.8
TABLE V: Performance evaluation (measured in % mIoU) on the Multimodal Material Segmentation (MCubeS) dataset [16] across various modality pairings is presented. The modalities A, D, and N correspond to angle of linear polarization (AoLP), degree of linear polarization (DoLP), and near-infrared (NIR), respectively.
Modalities MCubeSNet [16] CMNeXt [15] (Ours)
RGB 33.70 48.16 49.22
RGB-A 39.10 48.42 49.89
RGB-A-D 42.00 49.48 50.26
RGB-A-D-N 42.86 51.54 51.69
TABLE VI: Performace comparison on Mcubes dataset with different combinations of proposed modules [9].
Methods Modalities % mIoU
Linear Layer Only RGB-ADN 49.89
Linear Layer + ChannelAttention RGB-ADN 50.34
Linear Layer + PSPModule RGB-ADN 50.62
U3M RGB-ADN 51.69

2) Results on FMB Dataset. The results presented in Table III for the performance comparison on the FBM dataset highlight the capabilities of various semantic segmentation models utilizing RGB-Infrared modalities, a combination critical for enhancing material differentiation under varying illumination conditions. Notably, our model, U3M, achieves an impressive mIoU score of 60.8, which surpasses all other models listed. This superior performance can be attributed to the effective integration of RGB and infrared data, which allows U3M to robustly capture and utilize the complementary information provided by these modalities. Infrared imaging, known for its utility in low-light conditions and its ability to differentiate materials based on thermal properties, combined with the rich detail available in RGB images, provides a more comprehensive understanding of the scene, enhancing segmentation accuracy. The closest competitors, SegMif and TarDAL, achieve mIoU scores of 54.8 and 48.1, respectively, which indicates that while these models are effective, there remains a significant gap in performance compared to U3M. This gap suggests that U3M may employ more sophisticated or optimized techniques for multimodal integration, perhaps through advanced feature fusion strategies or more effective neural network architectures.

Table IV provides a detailed per-class percentage of mIoU analysis for models tested on the FMB dataset, utilizing both RGB and RGB-infrared modalities. The data demonstrates the superior performance of our model across the majority of categories, with particularly high IoU scores in ’Building’ (81.3) and ’Vegetation’ (86.8), illustrating its robustness in identifying and segmenting structural and natural elements in urban environments. However, a notable exception is the ’Truck’ category, where our model’s performance (41.9) lags behind other models like the HDNet (60.7) and EGNet (63.0), indicating potential challenges in distinguishing larger vehicles possibly due to their similar spectral signatures with other objects or insufficient training data representing this category. Our model’s overall mIoU of 60.8 is the highest among the listed methods, confirming its efficacy in integrating RGB and infrared data to enhance segmentation accuracy. The integration of infrared data is particularly beneficial in improving the model’s performance under varying lighting conditions, as infrared provides consistent material recognition capabilities that are less susceptible to variations in visible light. The segmentation performance in the ’Traffic Lamp’ (46.2) and ’Traffic Sign’ (62.2) classes, while not the highest, still represents competitive results, suggesting that the model effectively utilizes the infrared component to detect these objects typically characterized by their distinct material properties not always apparent in RGB imagery.

Refer to caption
Figure 7: Segmentation results visualization on Mcubes dataset. The figure provides a vivid visualization of semantic segmentation results on the Mcubes dataset, showcasing the effectiveness of different modalities: RGB, RGB-A, RGB-A-D, and RGB-A-D-N.

IV-D Visulization

1) Segmentation result visualization on FMB dataset. The visualization in Figure 6 compellingly illustrates the effectiveness of multimodal semantic segmentation using RGB and thermal (RGB-T) data on the FMB dataset. The comparative analysis of RGB-only versus RGB-T segmentation underscores the limitations of relying solely on visible light data, particularly under adverse lighting conditions. The enhanced segmentation accuracy achieved with RGB-T highlights the thermal modality’s critical role in distinguishing and classifying various elements within urban scenes, reinforcing the necessity for incorporating multimodal data in segmentation tasks to tackle diverse environmental challenges. This analysis confirms the hypothesis that thermal data significantly boosts the segmentation capability, particularly in detecting living entities and differentiating them from inanimate backgrounds, thereby providing a more reliable and robust segmentation framework for real-world applications.

Refer to caption
Figure 8: t-SNE visualization on FMB dataset. The visualization comprises multiple panels, each line represents a different combination of modalities used for segmentation, labeled as RGB and RGBT. These panels help illustrate the distribution and separation of features in a two-dimensional space, giving insights into the discriminative power of the features extracted under different input conditions.

2) t-SNE visualization on FMB dataset. The t-SNE visualizations in Fig. 8 underscore the significant impact of integrating thermal imagery with RGB data on the feature separability and overall effectiveness of semantic segmentation models. The enhanced separation and definition of clusters with RGB-T data corroborate the hypothesis that thermal information aids in resolving ambiguities encountered with RGB-only data, especially in complex segmentation scenarios. This analysis not only highlights the value of multimodal approaches in improving semantic segmentation tasks but also emphasizes the need for models that can effectively leverage diverse data types to achieve more accurate and robust segmentation results. Such insights are critical for advancing the development of segmentation technologies for applications in areas such as autonomous driving, environmental monitoring, and urban planning, where accurate material and object differentiation are paramount.

3) Segmentation result visualization on Mcubes dataset. The segmentation visualizations in Fig. 7 underscore the incremental benefits of integrating multiple data modalities into semantic segmentation tasks on the Mcubes dataset. Each added modality—AoLP, DoLP, and NIR—contributes uniquely to the enhancement of the segmentation accuracy, improving specific challenges encountered with the RGB-only model. The results vividly demonstrate that while RGB data provides a foundational layer of visual information, the incorporation of polarization and thermal data crucially enriches the feature set available for segmentation, thereby enabling more precise and contextually aware delineations of urban scene elements. This multimodal approach illustrates the potential for such technologies to be applied in real-world scenarios where diverse environmental factors and varied object interactions complicate the accurate interpretation of urban spaces.

IV-E Ablation Experiment

1) Ablation experiment on different modalities. The data presented in Table V showcases the mIoU percentages for semantic segmentation performance on the Multimodal Material Segmentation (MCubeS) dataset. The analysis compares the performance of various models utilizing different combinations of modalities: RGB, RGB-A, RGB-A-D, and RGB-A-D-N. Our model demonstrates a progressive improvement in segmentation accuracy as additional modalities are integrated. Starting with RGB alone, our model achieves a mIoU of 49.22, which is higher compared to MCubeSNet’s 33.70 and CMNeXt’s 48.16. This trend of superior performance continues with the inclusion of AoLP (RGB-A), where our model scores 49.89, slightly ahead of CMNeXt’s 48.42. The enhancement is more pronounced in configurations involving both AoLP and DoLP (RGB-A-D), with our model achieving a mIoU of 50.26, compared to CMNeXt’s 49.48. The most comprehensive modality combination, incorporating RGB, AoLP, DoLP, and NIR (RGB-A-D-N), allows our model to reach its peak performance at a mIoU of 51.69, which stands out significantly against CMNeXt’s 51.54 and MCubeSNet’s 42.86. The incremental improvements observed with each additional modality underscore the efficacy of integrating multimodal data to capture a richer, more comprehensive feature set for material segmentation. The integration of NIR, in particular, appears to provide critical enhancements, likely due to its capability to offer consistent material properties detection that is less dependent on visible light conditions.

2) Ablation experiment on different module combinations. The data presented in Table VI illustrates a comparative analysis of semantic segmentation performance on the MCubes dataset using different architectural enhancements within a given framework. These results highlight the influence of various module integrations on the mIoU metric. Starting with a baseline configuration that utilizes a Linear Layer only, the model achieves a mIoU of 49.39. This setup serves as a foundational benchmark for evaluating the effectiveness of additional modules. Upon incorporating a Channel Attention mechanism, there is a modest increase in performance, with the mIoU improving to 50.34. Channel Attention likely aids the model in focusing more effectively on relevant features by re-weighting channel-specific features, thus providing a more refined feature map for segmentation tasks. A more substantial improvement is observed when a Pyramid Scene Parsing (PSP) module is added, resulting in the mIoU of 50.62. The PSP module, known for its capability to aggregate context information at different scales, evidently enhances the model’s ability to capture and integrate multi-scale contextual information, which is crucial for accurate segmentation. Our model, U3M, which incorporates RGB-ADN modalities, achieves the highest performance with a mIoU of 51.69. This indicates that the combination of RGB data along with AoLP and DoLP from the ADN modality, effectively utilized in U3M, significantly contributes to the segmentation accuracy. This performance underscores the model’s robustness and its enhanced capability to discriminate between material types and conditions in a complex dataset like MCubes.

V Conclusion and Limitation

1) Conclusion. The proposed U3M makes a significant leap in multimodal semantic segmentation, featuring innovative modality integration and feature fusion techniques. It addresses modal bias by employing an unbiased multiscale modal fusion methodology that equitably treats all modalities, thereby reducing manual bias. This model utilizes multiscale fusion modules that combine convolutional and pooling strategies, effectively integrating modalities at various scales and enhancing adaptability across diverse environments. Extensive experiments on challenging datasets show U3M consistently surpasses existing models in both accuracy and robustness, proving its suitability for real-world applications such as autonomous driving and urban planning.

2) Limitation and future work. Future work could focus on optimizing the model’s architecture for greater efficiency. Additionally, integrating more varied modalities and extensive real-world testing could broaden its application scope, ensuring it meets the evolving demands of practical implementations in areas like autonomous driving and urban planning.

References

  • [1] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
  • [2] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 801–818.
  • [3] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” Advances in neural information processing systems, vol. 34, pp. 12 077–12 090, 2021.
  • [4] Y. Liao, J. Xie, and A. Geiger, “Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3292–3310, 2022.
  • [5] S. Dong, W. Zhou, C. Xu, and W. Yan, “Egfnet: Edge-aware guidance fusion network for rgb–thermal urban scene parsing,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [6] C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part I 13.   Springer, 2017, pp. 213–228.
  • [7] W. Zhou, S. Dong, M. Fang, and L. Yu, “Cacfnet: Cross-modal attention cascaded fusion network for rgb-t urban scene parsing,” IEEE Transactions on Intelligent Vehicles, 2023.
  • [8] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention network for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3146–3154.
  • [9] J. Liu, Z. Liu, G. Wu, L. Ma, R. Liu, W. Zhong, Z. Luo, and X. Fan, “Multi-interactive feature learning and a full-time multi-modality benchmark for image fusion and segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 8115–8124.
  • [10] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [11] Z. Jin, T. Gong, D. Yu, Q. Chu, J. Wang, C. Wang, and J. Shao, “Mining contextual information beyond image for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 7231–7241.
  • [12] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
  • [13] J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers,” IEEE Transactions on Intelligent Transportation Systems, 2023.
  • [14] X. Hu, K. Yang, L. Fei, and K. Wang, “Acnet: Attention based network to exploit complementary features for rgbd semantic segmentation,” in 2019 IEEE International conference on image processing (ICIP).   IEEE, 2019, pp. 1440–1444.
  • [15] J. Zhang, R. Liu, H. Shi, K. Yang, S. Reiß, K. Peng, H. Fu, K. Wang, and R. Stiefelhagen, “Delivering arbitrary-modal semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1136–1147.
  • [16] Y. Liang, R. Wakaki, S. Nobuhara, and K. Nishino, “Multimodal material segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 800–19 808.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [18] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [19] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 10 012–10 022.
  • [20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [21] X. Dong, J. Bao, D. Chen, W. Zhang, N. Yu, L. Yuan, D. Chen, and B. Guo, “Cswin transformer: A general vision transformer backbone with cross-shaped windows,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 124–12 134.
  • [22] Y. Yuan, R. Fu, L. Huang, W. Lin, C. Zhang, X. Chen, and J. Wang, “Hrformer: High-resolution transformer for dense prediction,” arXiv preprint arXiv:2110.09408, 2021.
  • [23] Q. He, “Prompting multi-modal image segmentation with semantic grouping,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2094–2102.
  • [24] M. Kaykobad Reza, A. Prater-Bennette, and M. Salman Asif, “Multimodal transformer for material segmentation,” arXiv e-prints, pp. arXiv–2309, 2023.
  • [25] B. Cao, J. Guo, P. Zhu, and Q. Hu, “Bi-directional adapter for multimodal tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 927–935.
  • [26] M. K. Reza, A. Prater-Bennette, and M. S. Asif, “Multimodal transformer for material segmentation,” arXiv preprint arXiv:2309.04001, 2023.
  • [27] Q. Hou, L. Zhang, M.-M. Cheng, and J. Feng, “Strip pooling: Rethinking spatial pooling for scene parsing,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4003–4012.
  • [28] G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
  • [29] S. Choi, J. T. Kim, and J. Choo, “Cars can’t fly up in the sky: Improving urban-scene segmentation via height-driven attention networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9373–9383.
  • [30] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal, “Context encoding for semantic segmentation,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 7151–7160.
  • [31] C. Yu, J. Wang, C. Gao, G. Yu, C. Shen, and N. Sang, “Context prior for scene segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 12 416–12 425.
  • [32] Y. Yuan, L. Huang, J. Guo, C. Zhang, X. Chen, and J. Wang, “Ocnet: Object context for semantic segmentation,” International Journal of Computer Vision, vol. 129, no. 8, pp. 2375–2398, 2021.
  • [33] X. Li, X. Li, L. Zhang, G. Cheng, J. Shi, Z. Lin, S. Tan, and Y. Tong, “Improving semantic segmentation via decoupled body and edge supervision,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16.   Springer, 2020, pp. 435–452.
  • [34] S. Borse, Y. Wang, Y. Zhang, and F. Porikli, “Inverseform: A loss function for structured boundary-aware segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 5901–5911.
  • [35] B. Cheng, A. Schwing, and A. Kirillov, “Per-pixel classification is not all you need for semantic segmentation,” Advances in neural information processing systems, vol. 34, pp. 17 864–17 875, 2021.
  • [36] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1290–1299.
  • [37] M.-H. Guo, C.-Z. Lu, Z.-N. Liu, M.-M. Cheng, and S.-M. Hu, “Visual attention network,” Computational Visual Media, vol. 9, no. 4, pp. 733–752, 2023.
  • [38] Q. Hou, Z. Jiang, L. Yuan, M.-M. Cheng, S. Yan, and J. Feng, “Vision permutator: A permutable mlp-like architecture for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 1328–1334, 2022.
  • [39] D. Lian, Z. Yu, X. Sun, and S. Gao, “As-mlp: An axial shifted mlp architecture for vision,” arXiv preprint arXiv:2107.08391, 2021.
  • [40] M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. Hu, “Segnext: Rethinking convolutional attention design for semantic segmentation,” Advances in Neural Information Processing Systems, vol. 35, pp. 1140–1156, 2022.
  • [41] W. Zhou, S. Dong, C. Xu, and Y. Qian, “Edge-aware guidance fusion network for rgb–thermal scene parsing,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 3, 2022, pp. 3571–3579.
  • [42] F. Deng, H. Feng, M. Liang, H. Wang, Y. Yang, Y. Gao, J. Chen, J. Hu, X. Guo, and T. L. Lam, “Feanet: Feature-enhanced attention network for rgb-thermal real-time semantic segmentation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2021, pp. 4467–4473.
  • [43] Z. Zhao, S. Xu, C. Zhang, J. Liu, P. Li, and J. Zhang, “Didfuse: Deep image decomposition for infrared and visible image fusion,” arXiv preprint arXiv:2003.09210, 2020.
  • [44] Z. Huang, J. Liu, X. Fan, R. Liu, W. Zhong, and Z. Luo, “Reconet: Recurrent correction network for fast and efficient multi-modality image fusion,” in European conference on computer Vision.   Springer, 2022, pp. 539–555.
  • [45] P. Li, J. Chen, B. Lin, and X. Xu, “Residual spatial fusion network for rgb-thermal semantic segmentation,” arXiv preprint arXiv:2306.10364, 2023.
  • [46] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
  • [47] J. Chen, X. Wang, Z. Guo, X. Zhang, and J. Sun, “Dynamic region-aware convolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8064–8073.
  • [48] J. Zhou, V. Jampani, Z. Pi, Q. Liu, and M.-H. Yang, “Decoupled dynamic filter networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6647–6656.
  • [49] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 7077–7087.
  • [50] H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “Mmtm: Multimodal transfer module for cnn fusion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 13 289–13 299.
  • [51] W. Zhou, J. Liu, J. Lei, L. Yu, and J.-N. Hwang, “Gmnet: Graded-feature multilabel-learning network for rgb-thermal urban scene semantic segmentation,” IEEE Transactions on Image Processing, vol. 30, pp. 7790–7802, 2021.
  • [52] G. Li, Y. Wang, Z. Liu, X. Zhang, and D. Zeng, “Rgb-t semantic segmentation with location, activation, and sharpening,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 3, pp. 1223–1235, 2022.
  • [53] H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified unsupervised image fusion network,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 502–518, 2020.
  • [54] J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, and Z. Luo, “Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5802–5811.