Noise-aware infrared polarization image fusion based on salient prior with attention-guided filtering network

Kunyuan Li; Meibin Qi; Shuo Zhuang; Yimin Liu; Jun Gao

doi:10.1364/OE.492954

1. Introduction

Infrared imaging technology can utilize infrared radiation intensity to detect targets. When the difference in infrared radiation between the target and the background is slight [1], it is difficult for infrared imaging to achieve effective target detection. In addition, with the development of infrared camouflage and interference technology, the difficulty of infrared detection is further increased. Polarization is an inherent property of light, which provides some unique information about the objects, such as surface smoothness, 3D normal [2], and material composition [3]. The differences in surface material, roughness, and physicochemical properties of the target make it exhibit distinct infrared polarization characteristics. Therefore, infrared polarization images can effectively identify artificial targets hidden in the natural background. The research on infrared polarization image fusion by combining infrared intensity (S0) and degree of linear polarization (DoLP) has excellent potential for many fields of application, such as space exploration, military reconnaissance, and disaster relief search. Owing to the distinct imaging mechanism, the infrared intensity and polarization images can reflect different features, and each other includes complementary and discriminative information from the same scene. The infrared details from the intensity and the salient target information from the polarization should be preserved or even enhanced in the fused result.

Traditional image fusion algorithms usually perform activity-level measurements in the spatial or transformed domains and design fusion rules manually to achieve image fusion. Multi-scale transformations [4], saliency-based [5], low-rank representations [6,7], and sparse representations [8] are representative methods. These methods generally use the same transformation or representation to extract features from the source image but must consider the essential differences between different source images. Secondly, the manually designed activity level measurement and fusion rules cannot adapt to complex fusion scenarios, and the algorithm complexity is increasing to improve the fusion performance.

With the wide application of deep learning, more deep neural network fusion methods have been applied to image fusion. CNN has become the backbone for image fusion [9–19] to address the shortcoming above. Well-designed network structures such as auto-encoder-based [9,10], resnet-based [11], and generative adversarial network-based [12–15] frameworks have achieved good results in some specific fusion tasks. In recent years, as an alternative to CNN, Transformer [20] has utilized a self-attention mechanism to obtain better long-range information. It shows promising performance in natural language processing and computer vision tasks. In particular, the image fusion community also introduces Transformer to model global dependence and shows competitive fused results [21,22].

Although current fusion methods have yielded promising results, they still need to address the challenges that arise during the fusion process of infrared polarization images. (1) First, the complex acquisition environment and lighting conditions can lead to noise in polarization imaging. Polarization parameters are derived from the measured intensity by nonlinear operators, which may amplify the noise of intensity measurement [23]. As shown in Fig. 1, despite possessing more salient target characteristics than S0, the DoLP is more susceptible to noise interference [24]. (2) Existing methods combine S0 and DoLP without considering the noise interference during fusion. Advances in neural networks allow for more information to be captured, but they do not distinguish effectively between contributions from each image, leading to noise interference in the fused image. (3) Compared with the improvement of network architecture, there are currently few methods that can analyze the distribution characteristics of infrared polarization images based on the imaging mechanism, to obtain corresponding prior information for the fusion.

Fig. 1. Infrared polarization images in noisy scenes.

Download Full Size | PDF

Different infrared polarization features are analyzed and compared to address the above issues in this paper. We utilize the angle of polarization (AoP) to extend the original DoLP and introduce the concept of polarization distance [25] (PD) to better reflect the contrast between the target and the background. With the combination of the above polarization prior information, we propose a multi-scale polarization salient feature fusion (PSFF) method, which enables the highlight of salient targets and suppresses background interference more effectively compared to DoLP alone. In addition, we observe that even though the infrared intensity S0 has low contrast, the background contains rich details and is less affected by noise interference compared to polarization images. Therefore, we construct an attention-guided filtering (AGF) based fusion network, which utilizes S0 as the guidance image and adaptively generates the filtering kernel through cross-attention. With this weighted filter operation, the network can effectively integrate the background details of infrared intensity with the polarization salient target. It further reduces noise interference while preserving salient targets. In summary, our contributions are shown as follows:

• To address the noise interference problem in infrared polarization image fusion, we refine the input of DoLP and propose an input framework based on polarization salient feature fusion to reduce background interference.
• Based on the differences in the distribution of infrared intensity and polarization images, we construct a fusion network based on attention-guided filtering, further suppressing background noise and preserving salient targets.
• The quantitative and qualitative results demonstrate the effectiveness of the proposed method. Our approach performs favorably in noisy scenes compared to the existing image fusion methods. The code is available in Code 1 [26].

2. Related work

2.1 Traditional fusion methods

According to the main idea of traditional fusion methods, these fusion methods can be divided into different categories, including methods based on multi-scale transformation [4], methods based on sparse representation [8], methods based on subspaces [27], methods based on saliency [5], and other fusion methods [28–30]. The multi-scale transform-based fusion method includes multi-scale decomposition, fusion, and reconstruction. Among them, wavelet transformation [31], pyramid transformation [4], and curvelet transformation are common forms of decomposition and recombination. Methods based on sparse representation learn an over-complete dictionary from a set of training images and use it for image fusion. Constructing the dictionary is the key to this type of method. Subspace-based methods are similar to sparse representation but implement information processing in the complete basis space. Principal component analysis [27], independent component analysis, and non-negative matrix factorization [32] are all subspace-based methods. Saliency methods capture regions of interest by utilizing the sensitivity of the human visual system to pixel intensity. Other methods include total variation-based [28], entropy-based [29], and polarization statistical characteristics-based. For example, Mo et al. [30] improve the fusion performance by weighting and accumulating polarization images based on the multi-angle orthogonal differential polarization characteristics.

2.2 Deep learning-based image fusion

Recently, many image fusion methods based on CNN have been proposed. For instance, Li et al. [33] use VGG-19 network to extract deep information for image fusion. However, their fusion strategy is relatively simple, and the feature extraction ability of the VGG network needs improvement, leading to information loss. As convolutional autoencoder networks have strong image representation ability, more researchers have migrated this structure to image fusion tasks. Prabhakar et al. [34] first use this architecture for multi-exposure image fusion, but the network structure is relatively simple, and the extracted features are prone to losing important information. Inspired by this work, Li [9] proposes a new image fusion architecture, DenseFuse, which consists of an encoder, a fusion layer, and a decoder. The fusion layer uses a weighted calculation of features by either feature addition or L1-norm. This structure guides the design of subsequent image fusion networks. However, the final fusion effect depends on the manually designed fusion strategy. To solve the problem of feature fusion strategies, some researchers have proposed using end-to-end neural networks to fuse images directly. The GAN-based image fusion framework FusionGAN [35] utilizes the generator to fuse the feature levels of the input source images. At the same time, the discriminator restricts the detailed information that the fused image generated by the generator can get from the visible image. That is, the fused image can get as much detailed information as possible from the visible image but is not very similar to the visible image. The loss function of FusionGAN consists of two parts: content loss and discriminator loss. Content loss allows the fused image to retain salient target information from the infrared image. In contrast, discriminator loss allows the fused image to have detailed information, such as the texture and edge of the visible image. For the polarization image fusion task, Liu et al. [15] propose a semantic-guided dual discriminator GAN, which constrains the generator to extract features of two modalities and reduce information loss. Recently, due to the influence of the field of natural language processing, in addition to the above CNN-based framework, researchers have proposed some image fusion frameworks based on Transformer [36,37]. In particular, Ma et al. [21] propose a general image fusion method based on the current mainstream SwinTransformer network to construct a fusion network involving intra- and inter-domain feature fusion and combining the loss of SSIM, texture and intensity to constrain the network, which can effectively improve the fusion performance of various images such as infrared-visible, multi-focus and multi-exposure.

Previous studies have shown promising results. Nevertheless, the existing methods are primarily designed for infrared-visible, multi-focus, and multi-exposure image fusion, which may not be optimal for infrared polarization image fusion, especially in noisy scenes. In addition, compared with the improvement of network structure, few studies have been done to improve image fusion performance based on infrared polarization characteristics. Therefore, by analyzing the characteristic distribution of the infrared polarization image, this paper exploits the polarization salient feature priors to improve network input, suppress background noise interference, and highlight the salient target. At the same time, according to the distribution differences between infrared intensity and polarization images, we construct a fusion network based on the AGF using the cross-attention mechanism of SwinTransformer. We utilize the infrared intensity image as a guide map to achieve fusion filtering, which has the advantage of preserving edge contour information of salient targets while reducing background interference. Our approach is expected to provide improved accuracy and reliability in object detection.

3. Proposed method

This section introduces the proposed infrared polarization image fusion method in noisy scenes. We first improve the traditional input method based on salient prior from polarization imaging. After that, we present the overall network structure, polarization salient feature fusion network, attention-guided filtering fusion module, and the corresponding loss function.

3.1 Infrared polarization input with salient prior

Generally, optical intensity images $I_{0}$, $I_{45}$, $I_{90}$, and $I_{135}$ in the polarization direction of 0, 45, 90, and 135 degrees can be obtained with a polarization camera. Equation (1) further describes the relations between the Stokes vectors (S0, S1, and S2) through the polarization ellipse.

(1)$$\left[ {\begin{array}{c} {{\rm{S0}}}\\ {{\rm{S1}}}\\ {{\rm{S2}}} \end{array}} \right] = \left[ {\begin{array}{c} {{a^2}}\\ {{a^2}\cos (2\chi )\cos (2\psi )}\\ {{a^2}\cos (2\chi )\sin (2\psi )} \end{array}} \right] = \left[ {\begin{array}{c} {\frac{{\rm{1}}}{{\rm{2}}}({I_0} + {I_{45}} + {I_{90}} + {I_{135}})}\\ {{I_0} - {I_{90}}}\\ {{I_{45}} - {I_{135}}} \end{array}} \right]$$

In Eq. (1), $\psi$ represents the AoP defined in Eq. (2), $\chi$ denotes the angle of ellipticity, and $a$ is the radiation intensity. From Equation (3), DoLP can be expressed as $\left | {\cos (2{\rm {\chi }})} \right |$, where $\left | \cdot \right |$ represents the absolute value operator.

(2)$$AoP = \psi = \frac{1}{2}{\tan ^{ - 1}}\left( {\frac{{{\rm{S2}}}}{{{\rm{S1}}}}} \right)$$

(3)$$DoLP = \frac{{\sqrt {{{({\rm{S1}})}^2} + {{({\rm{S2}})}^2}} }}{{{\rm{S0}}}} = \left| {\cos (2{\rm{\chi }})} \right|$$

Combining DoLP with AoP according to [38], the feature $\rho$, DoLP with sign information, is obtained, as shown in Eq. (4).

(4)$$\rho = \frac{{{\rm{S2}}}}{{{\rm{S0}}\sin (2AoP)}}{\rm{ = }}\cos (2{\rm{\chi }})$$

We can obtain the corresponding infrared polarization features using the above formulas, as shown in Fig. 2. It can be seen that feature $\rho$ can suppress noise background interference in DoLP.

Fig. 2. Comparison between DoLP and its extension $\rho$.

Download Full Size | PDF

We further introduce the polarization distance model [25] to obtain salient target regions in infrared polarization images. Polarization distance is a hypothetical model based on biological vision, which simulates biological polarization information processing mechanisms. The concept of "distance" is derived from the visual model of color distance, while polarization distance can be used to analyze the contrast between the target and background.

(5)$$R(\phi ,\rho ) = \left[ {1 + \left( {\frac{{({S_P} - 1)\rho }}{{{S_P} + 1}}} \right)\cos (2\phi )} \right]$$

(6)$$P{T_1} = \ln \frac{{R(0,\rho )}}{{R(\pi /2,\rho )}},P{T_2} = \ln \frac{{P{T_1}}}{{R(\pi /4,\rho )}},P{T_3} = \ln \frac{{P{T_2}}}{{R(3\pi /4,\rho )}}$$

(7)$${R_B}(\phi ,\rho ) = R(\phi ,\rho )(1 - \rho )$$

(8)$$P{B_1} = \ln \frac{{{R_B}(0,\rho )}}{{{R_B}(\pi /2,\rho )}},P{B_2} = \ln \frac{{P{B_1}}}{{{R_B}(\pi /4,\rho )}},P{B_3} = \ln \frac{{P{B_2}}}{{{R_B}(3\pi /4,\rho )}}$$

Here, we use ${0^ \circ }$, ${45^ \circ }$, ${90^ \circ }$, and ${135^ \circ }$ directional infrared polarization images as input to compute the four-channel polarization distance [39]. Equation (5) computes the sensitivity $R(\phi,\rho )$ for linear polarizations of light in different directions, where $\phi$ belongs to the set $\left \{ {0,{\pi \mathord {\left / {\vphantom {\pi 4}} \right. } 4},{\pi \mathord {\left / {\vphantom {\pi 2}} \right. } 2},{{3\pi } \mathord {\left / {\vphantom {{3\pi } 4}} \right. } 4}} \right \}$. Instead of utilizing the DoLP, we employ the feature $\rho$ to highlight salient targets. ${S_P}$ is the level of adequate polarization sensitivity. We then combine the sensitivity of different directional polarizations and use Eq. (6) to calculate the activation values $P{T_1}$, $P{T_2}$, and $P{T_3}$ in the target region. Similarly, through formulas (7) and (8), we can obtain the corresponding sensitivity ${R_B}(\phi,\rho )$ and activation values $P{B_1}$, $P{B_2}$, and $P{B_3}$ in the background region. The polarization distance PD reflecting the contrast between the target and the background is as follows:

(9)$$PD = \frac{{\left| {P{T_3} - P{B_3}} \right|}}{{2\ln {S_P}}}$$

In Fig. 3, the infrared polarization distance further highlights the salient target area and suppresses the interference from the background. Based on the infrared polarization salient prior mentioned above, we combine the extension $\rho$ of DoLP with PD to enhance the fusion with S0.

Fig. 3. Infrared polarization distance PD with salient target features. To better highlight salient features, the pseudo-color map is used to visualize PD.

Download Full Size | PDF

3.2 Overall framework

Figure 4 shows the proposed infrared polarization fusion network utilizing SwinTransformer Layer [40] (STL) as the basic module. STL employs a hierarchical architecture with shifted windows to capture rich spatial information at different scales. STL first reshapes the input of size $H \times W \times C$ to the tensor of size $\frac {{HW}}{{{M^2}}} \times {M^2} \times C$ and partitions it into non-overlapping windows of size $M \times M$. According to Eq. (10), the self-attention feature is calculated within each non-overlapping window.

(10)$$Attention\left( {{F_Q},{F_K},{F_V}} \right) = SoftMax\left( {\frac{{{F_Q}F_K^T}}{{\sqrt d }} + B} \right){F_V}$$

Fig. 4. The overall network structure of our approach.

Download Full Size | PDF

For each window feature $F$, $F_Q$, $F_K$, and $F_V$ are defined as

(11)$${F_Q} = F{W_Q},{F_K} = F{W_K},{F_V} = F{W_V},({F_Q},{F_K},{F_V} \in {{\mathbb{R}}^{{M^2} \times d}})$$

$W_Q$, $W_K$, and $W_V$ are projection matrices. $d$ is the query/key dimension, and $B$ is the relative positional encoding. The attention score is computed by multiplying $F_Q$ and $F_K$, which represents the relationship between each pixel and every other pixel. It is then scaled by $1/\sqrt d$, combined with location coding information $B$, and normalized using SoftMax. The resulting weighted attention features are obtained by multiplying $F_V$.

We perform window multi-head self-attention (WMSA) by executing three times attention functions and combining their outputs. The resulting features are transformed using a multi-layer perceptron (MLP) with Gaussian error linear units. This approach is based on the operation in [40]. The above process is formulated as

(12)$$F = {\rm{WMSA}}({\rm{LN}}(F)) + F,F = {\rm{MLP}}({\rm{LN}}(F)) + F.$$

The LayerNorm (LN) is added before both WMSA and MLP, and the residual connection is employed for both modules. We shift the obtained window attention feature tensor by $\left \lfloor {\frac {M}{2}} \right \rfloor$ and perform WMSA again to achieve shifted window multi-head attention (SWMSA), which enables information interaction between windows. It is followed by repeating the above feature transformation and normalization to obtain the output of STL.

Based on the STL, we perform multi-scale feature extraction on both infrared intensity and polarization images separately. Then, we design the polarization salient feature fusion network PSFF for the fusion of $\rho$ and PD. Next, we construct the attention-guided filtering module AGF to fuse the infrared intensity and polarization features at different scales. Finally, according to the operation in [41], we reconstruct the fused infrared polarization features with up-sampling STL to obtain the final fusion image $I_f$.

3.3 Polarization salient feature fusion network

PSFF is designed to extract infrared polarization salient target features. Specifically, we use STL to extract the infrared polarization features of $\rho$ and PD at different scales. Here, STL shares weights to obtain common salient target regions. Moreover, we introduce the attentional feature fusion module (AFF) [42] to merge polarization salient target features from $\rho$ and PD.

As shown in Fig. 5, AFF can adapt to large and small targets based on multi-scale channel attention (MSCA). MSCA applies channel attention to multiple scales by adjusting spatial pooling sizes. Based on global channel attention, it uses Point-wise Conv convolution to introduce local channel attention features, which enables the fusion of different scales of infrared polarization features. In our work, we adopt the AFF utilizing two-stage iterative feature fusion, as proposed in [42]. The mathematical representation of AFF is:

(13)$$\begin{array}{l} z' = {\rm{MSCA(}}x + y{\rm{)}} \otimes x + (1 - {\rm{MSCA(}}x + y{\rm{)}}) \otimes y\\ z = {\rm{MSCA(}}z'{\rm{)}} \otimes x + (1 - {\rm{MSCA(}}z'{\rm{)}}) \otimes y \end{array}$$

Fig. 5. Attentional feature fusion module. After feature fusion, it has the same shape as the input feature. $\otimes$ denotes element-wise multiplication.

Download Full Size | PDF

3.4 Attention-guided filtering module

We introduce the concept of the guided filter [43], enabling the fusion of background details and salient targets in infrared intensity and polarization images. Guided filtering transfers structural information from the guide image to the target image, enhancing visual information in different domains and modalities. The attention-guided filtering module AGF proposed in this paper is composed of two sub-modules, attention kernel learning (AKL) and guided image filtering (GIF). AKL can generate corresponding filtering kernels based on the cross-attention of infrared intensity and polarization images. By using the infrared intensity feature as guidance, we can effectively reduce noise interference in the polarization image while preserving salient target information. This approach leverages the distinct features of both images to enhance the overall quality of the output.

As shown in Fig. 6, we take features $F_{S0}$ and $F_P$ of infrared intensity and polarization images as inputs of AKL. AKL first employs dual-branch STL to generate filtering kernels $W_{S0}$ and $W_P$ separately for infrared intensity and polarization features. Then, we further introduce the cross-attention fusion unit to integrate global interactions between infrared intensity and polarization information. Multi-head cross-attention (MCA) is performed in the fusion unit to achieve inter-domain contextual interaction. Therefore, the operation of the cross-attention fusion unit can be defined as follows:

(14)$$\begin{array}{l} \{ {F_{Q1}},{F_{K1}},{F_{V1}}\} = \{ {F_P}W_Q^1,{F_P}W_K^1,{F_P}W_V^1\} ,\\ \{ {F_{Q2}},{F_{K2}},{F_{V2}}\} = \{ {F_{S0}}W_Q^2,{F_{S0}}W_K^2,{F_{S0}}W_V^2\} ,\\ {{A'}_P} = {\rm{MCA(LN}}({F_{Q1}},{F_{K2}},{F_{V2}})) + {F_P},\\ {{A'}_{S0}} = {\rm{MCA}}({\rm{LN}}({F_{Q2}},{F_{K1}},{F_{V1}})) + {F_{S0}},\\ {A_P} = {\rm{MLP}}({\rm{LN}}({{A'}_P})) + {{A'}_P},\\ {A_{S0}} = {\rm{MLP}}({\rm{LN}}({{A'}_{S0}})) + {{A'}_{S0}}. \end{array}$$

Fig. 6. The network architecture of the proposed attentional kernel learning module.

Download Full Size | PDF

As shown in Eq. (14), we perform an attention-weighting operation on the polarization feature $F_{Q1}$ and the features $F_{K2}$ and $F_{V2}$ from the infrared intensity image. The same operation is applied to features $F_{Q2}$, $F_{K1}$, and $F_{V1}$. $A_{P}$ and $A_{S0}$ are the outputs of the cross-attention interaction. These attention maps adaptively combine the filtering kernels $W_P$ and $W_{S0}$ from infrared polarization and intensity features. The final guided filter kernel $W_f$ can be expressed as

(15)$${W_f} = {A_P} \otimes {W_P} + {A_{S0}} \otimes {W_{S0}}$$

After that, by the guided image filtering module GIF, we conduct concatenation and convolution operations on $F_P$ and $F_{S0}$, and use the kernel $W_f$ for filtering to obtain a fusion feature $F_f$. The filtering process of pixels$\left \{ {(u,v)\left | {0 \le u < H,0 \le v < W} \right.} \right \}$ can be formulated as follows:

(16)$$\left\{ \begin{array}{l} F' = Conv(\left[ {{F_P},{F_{S0}}} \right])\\ {F_f}(u,v) = \sum\nolimits_{x ={-} r}^r {\sum\nolimits_{y ={-} r}^r {W_f^{(u,v)}} (x,y)} \cdot F'(u - x,v - y) \end{array} \right.{\rm{}}$$

where $\left [ { \cdot, \cdot } \right ]$ denotes the concatenation operation, and the value of $r$ is $\left \lfloor {\frac {k}{2}} \right \rfloor$. Using the up-sampling STL, the feature $F_f$ is reconstructed to obtain the final fused image $I_f$.

3.5 Loss Function

Referring to [22], we utilize the multi-scale weighted structural similarity loss (MWSSIM) to train the network. $Los{s_{MWSSIM}}$ can be formulated as follows:

(17)$$\begin{array}{l} Los{s_{MWSSIM}}(S0,\rho ,PD,{I_f}) = 1 - \frac{1}{{\left| w \right|}}\sum\nolimits_w ( {\alpha _w}Los{s_{SSIM}}(S0,{I_f};w)\\ + {\beta _w}Los{s_{SSIM}}(\rho ,{I_f};w) + {\gamma _w}Los{s_{SSIM}}(PD,{I_f};w)) \end{array}$$

where $\alpha _w$, $\beta _w$, and $\gamma _w$ denote the weight coefficient:

(18)$$\left\{ \begin{array}{l} {\alpha _w} = \frac{{g({\sigma ^2}({w_{S0}}))}}{{g({\sigma ^2}({w_{S0}})) + g({\sigma ^2}({w_\rho })) + g({\sigma ^2}({w_{PD}}))}}\\ {\beta _w} = \frac{{g({\sigma ^2}({w_\rho }))}}{{g({\sigma ^2}({w_{S0}})) + g({\sigma ^2}({w_\rho })) + g({\sigma ^2}({w_{PD}}))}}\\ {\gamma _w} = \frac{{g({\sigma ^2}({w_{PD}}))}}{{g({\sigma ^2}({w_{S0}})) + g({\sigma ^2}({w_\rho })) + g({\sigma ^2}({w_{PD}}))}} \end{array} \right.$$

$g(\theta ) = \max (\theta,0.0001)$ is a correction function to increase the robustness of the solution. Additionally, $Los{s_{SSIM}}(x,y;w)$ represents the local structure similarity loss between $x$ and $y$ with window $w$ and can be formulated as

(19)$$Los{s_{SSIM}}(x,y;w) = \frac{{(2{{\bar w}_x}{{\bar w}_y} + {C_1})(2\sigma ({w_x}{w_y}) + {C_2})}}{{(\bar w_x^2 + \bar w_y^2 + {C_1})({\sigma ^2}({w_x}) + {\sigma ^2}({w_y}) + {C_2})}}$$

Here, ${\bar w_x}$ denotes the mean of image $x$ when the sliding window size is $w(w \in \{ 3,5,7,9,11\} )$, $\sigma ({w_x}{w_y})$ is the covariance of ${w_x}$ and ${w_y}$, ${\sigma ^2}( \cdot )$ indicates the variance of the corresponding one, and $C_1$ and $C_2$ are constants and set as 0.0001 and 0.0009, respectively. To adapt to the attention-guided filtering network proposed in this paper, we utilize the filtering operation from Ref. [43] to process $\rho$ and PD in the loss function, and this operation is denoted as $G( \cdot )$. In Eq. (20), the variables $I$, $p$, and $q$ represent the guiding image, filtering input image, and output image. Meanwhile, $i$ and $j$ denote the pixel indexes. Additionally, ${\mu _k}$ and $\sigma _k^2$ represent the mean and variance of $I$ within ${\omega _k}$ centered at the pixel $k$, whereas $\epsilon$ serves as the regularization parameter.

(20)$$\left\{ \begin{array}{l} {q_i} = {G_i}(I,p) = \sum\nolimits_j {{W_{ij}}} (I){p_j}\\ {W_{ij}}(I) = \frac{1}{{{{\left| \omega \right|}^2}}}\sum\limits_{k:(i,j) \in {\omega _k}} {\left( {1 + \frac{{({I_i} - {\mu _k})({I_j} - {\mu _k})}}{{\sigma _k^2 + \epsilon}}} \right)} \end{array} \right.$$

Therefore, the total training loss ${L_{Total}}$ is obtained by combining the intensity loss ${L_{Int}}$ and gradient loss ${L_{G}}$ based on MWSSIM and the loss ${L_{P}}$ of the joint $\rho$ and PD.

(21)$${L_{Int}} = Los{s_{MWSSIM}}(S0,G(S0,\rho ),G(S0,PD),{I_f})$$

(22)$${L_{G}} = Los{s_{MWSSIM}}(\left| {\nabla S0} \right|,\left| {\nabla G(S0,\rho )} \right|,\left| {\nabla G(S0,PD)} \right|,\left| {\nabla {I_f}} \right|)$$

(23)$${L_{P}} = \frac{1}{{HW}}\left\| {\left| {\nabla {I_f}} \right| - \max (\left| {\nabla G(S0,\rho )} \right|,\left| {\nabla G(S0,PD)} \right|)} \right\|$$

(24)$${L_{Total}} = {\lambda _1}{L_{Int}} + {\lambda _2}{L_{G}} + {\lambda _3}{L_{P}}$$

where $\nabla$ indicates the Sobel gradient operator. ${\lambda _1}{\rm {\ =\ 1}}$, ${\lambda _2}{\rm {\ =\ 0}}{\rm {.5}}$, and ${\lambda _3}{\rm {\ =\ 0}}{\rm {.1}}$ are the hyper-parameters controlling the trade-off of each sub-loss term.

4. Experiments

4.1 Implementation details

We utilize the LDDRS dataset [44], composed of long-wave infrared polarization images, for image fusion. This dataset consists of 2113 groups of images with dimensions of $512\times 640$ taken during various traffic conditions during the day and night. For the dataset split, we randomly select 1690 pairs of infrared intensity and polarization images for training, reserve 211 images for validation, and use the remaining for testing. We set the iterations and learning rate to $7k$ and $10^{-4}$, respectively. We quantitatively evaluate our method using eight metrics: the information entropy metric EN, the standard difference metric SD, the blind image integrity notator using the DCT statistics metric BLIINDS-II [45], the blind/referenceless image spatial quality evaluator metric BRISQUE [46], the human perceptual quality-based metric NIMA [47], the contrast enhance-based image quality metric CEIQ [48], the spatial-spectral entropy-based quality metric SSEQ [49], and real-world imagery perception-based image quality evaluator metric PIQE [50]. These metrics measure the quality of fused images in different ways including natural preservation (BLIINDS-II, BRISQUE), human perception (NIMA), contrast distortion (CEIQ, SSEQ), and perception of significant spatial regions (PIQE). The lower the score of BLINDS-II, BRISQUE, SSEQ, and PIQE, the better.

4.2 Ablation analysis

We implement the ablation study to validate our approach. The baseline is the model without the PSFF and AGF. We take the origin DoLP as input of the baseline. Instead of AGF, we concatenate the infrared intensity and polarization features for fusion. Figure 7 shows that the infrared intensity image S0 contains more detailed background information but has low contrast.

Fig. 7. Fusion results of the ablation study.

Download Full Size | PDF

While DoLP exhibits prominent object features, but the background is easily disturbed by noise. Therefore, the direct fusion of S0 and DoLP by the baseline leads to a lot of noise interference in the fusion result, and the target appears blurred. After improving the input of the infrared polarization features with PSFF, salient target features are extracted by combining $\rho$ with PD. The noise level of the fused image is significantly reduced. Subsequently, AGF is applied to fuse the infrared intensity and polarization features, further reducing background noise interference and enhancing the clarity of target edges. The quantitative evaluation results in Table 1 indicate that PSFF and AGF can improve the fusion performance of infrared polarization images. Compared with the baseline, the fusion results have improved by approximately 41%, 48%, and 33% in the BRISQUE, BLIINDS-II, and SSEQ metrics, respectively. In Fig. 8, further visualization of the attention maps $A_{S0}$ and $A_P$ of AKL in guided filtering reveals that the filtering weight for the infrared intensity feature is concentrated in the background. In contrast, the weight distribution of polarization features is mainly focused on salient targets and edge contours in the scene. It validates the effectiveness of the attention-guided filtering fusion method proposed in this paper.

Fig. 8. Visualization of the learned attention maps for kernel combination.

Download Full Size | PDF

Table 1. Quantitative evaluation results with PSFF and AGF on validation data.

View Table | View all tables in this article

4.3 Comparison with the state-of-the-art

We compare our approach with other state-of-the-art methods: NSST [51], MDLatLRR [6], RFN-Nest [11], DIDFusion [16], U2Fusion [19], PFNet [17], SeAFusion [18], SwinFusion [21], and TIPFNet [22]. Figure 9 shows the qualitative comparison results of infrared polarization image fusion under noisy conditions. It can be observed that neither the traditional fusion methods, NSST and MDLatLRR, nor the fusion methods based on CNN (RFN-Nest, DIDFusion, U2Fusion, PFNet, SeAFusion) and Transformer (SwinFusion, TIPFNet) can effectively suppress the noise background interference. These methods mainly focus on enhancing the fusion networks to extract more useful information from both S0 and DoLP images. However, they fail to effectively analyze the difference in the distribution of backgrounds and targets in infrared polarization images. Especially when DoLP is severely disturbed by noise, it is difficult for existing methods to distinguish the contribution of S0 and DoLP to the fusion result, resulting in the fused image containing a lot of noise interference. Our method refines the input for infrared polarization image based on salient prior and builds a fusion network based on attention-guided filtering. The results in Fig. 9 demonstrate that our method can effectively suppress the background noise interference while preserving the edge contour information of targets. Table 2 shows quantitative comparison results with the state-of-the-art. It can be seen that our approach achieves the best results in seven out of eight metrics. To further demonstrate that our fusion method can enhance image quality and improve the contrast of the target scene, we use the YOLOv4 [52] to perform object detection on the fused infrared polarization image. As shown in Fig. 10, due to the lack of salient features, S0 shows low detection accuracy, while DoLP lacks details and is disturbed by background noise, so it is difficult to achieve effective target detection. Even the detection result of DoLP in the first row fails to show effective targets. In contrast, our approach can achieve effective target detection, and the detection results show higher confidence than other methods.

Fig. 9. Qualitative comparison with state-of-the-art.

Download Full Size | PDF

Fig. 10. Object detection results on the fused infrared polarization images.

Download Full Size | PDF

Table 2. Quantitative evaluation results of different methods on testing images (the best and second-best results are in bold and underlined, respectively).

View Table | View all tables in this article

5. Conclusion

This paper investigates the infrared polarization image fusion under noisy background. We improve the DoLP-based fusion input and propose an input framework based on polarization salient features. By multi-scale fusion of the infrared polarization features $\rho$ and polarization distance PD, the background interference is suppressed, and the saliency of the target is enhanced. In addition, we construct a fusion network based on attention-guided filtering, which utilizes a cross-attention mechanism to generate filter kernels and uses the infrared intensity image as the guide image for fusion filtering. This method can preserve the edge contour information of polarization salient targets and further suppress background interference. We validate the effectiveness of the proposed approach through quantitative and qualitative experiment results. Compared with existing image fusion methods, our method performs better in noisy scenes, preserving the salient objects of infrared polarization while suppressing noise interference. Additional object detection experiment shows that our fusion method can help improve the performance of the advanced computer vision task.

Funding

National Natural Science Foundation of China (61771180, 62201191); Major Science and Technology Projects in Anhui Province (202203a05020023).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [44].

References

1. N. Li, Y. Zhao, R. Wu, and Q. Pan, “Polarization-guided road detection network for lwir division-of-focal-plane camera,” Opt. Lett. 46(22), 5679–5682 (2021). [CrossRef]

2. X. Li, F. Liu, P. Han, S. Zhang, and X. Shao, “Near-infrared monocular 3d computational polarization imaging of surfaces exhibiting nonuniform reflectance,” Opt. Express 29(10), 15616–15630 (2021). [CrossRef]

3. K. Usmani, G. Krishnan, T. O’Connor, and B. Javidi, “Deep learning polarimetric three-dimensional integral imaging object recognition in adverse environmental conditions,” Opt. Express 29(8), 12215–12228 (2021). [CrossRef]

4. J. Chen, X. Li, L. Luo, X. Mei, and J. Ma, “Infrared and visible image fusion based on target-enhanced multiscale transform decomposition,” Inf. Sci. 508, 64–78 (2020). [CrossRef]

5. J. Ma, Z. Zhou, B. Wang, and H. Zong, “Infrared and visible image fusion based on visual saliency map and weighted least square optimization,” Infrared Phys. Technol. 82, 8–17 (2017). [CrossRef]

6. H. Li, X.-J. Wu, and J. Kittler, “Mdlatlrr: A novel decomposition method for infrared and visible image fusion,” IEEE Trans. on Image Process. 29, 4733–4746 (2020). [CrossRef]

7. X. Liu and L. Wang, “Infrared polarization and intensity image fusion method based on multi-decomposition latlrr,” Infrared Phys. Technol. 123, 104129 (2022). [CrossRef]

8. Q. Zhang, Y. Liu, R. S. Blum, J. Han, and D. Tao, “Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: A review,” Inf. Fusion 40, 57–75 (2018). [CrossRef]

9. H. Li and X.-J. Wu, “Densefuse: A fusion approach to infrared and visible images,” IEEE Trans. on Image Process. 28(5), 2614–2623 (2019). [CrossRef]

10. H. Xu, H. Zhang, and J. Ma, “Classification saliency-based rule for visible and infrared image fusion,” IEEE Trans. Comput. Imaging 7, 824–836 (2021). [CrossRef]

11. H. Li, X.-J. Wu, and J. Kittler, “Rfn-nest: An end-to-end residual fusion network for infrared and visible images,” Inf. Fusion 73, 72–86 (2021). [CrossRef]

12. J. Ma, H. Xu, J. Jiang, X. Mei, and X.-P. Zhang, “Ddcgan: A dual-discriminator conditional generative adversarial network for multi-resolution image fusion,” IEEE Trans. on Image Process. 29, 4980–4995 (2020). [CrossRef]

13. H. Xu, J. Ma, and X.-P. Zhang, “Mef-gan: Multi-exposure image fusion via generative adversarial networks,” IEEE Trans. on Image Process. 29, 7203–7216 (2020). [CrossRef]

14. J. Li, H. Huo, C. Li, R. Wang, and Q. Feng, “Attentionfgan: Infrared and visible image fusion using attention-based generative adversarial networks,” IEEE Trans. Multimedia 23, 1383–1396 (2021). [CrossRef]

15. J. Liu, J. Duan, Y. Hao, G. Chen, and H. Zhang, “Semantic-guided polarization image fusion method based on a dual-discriminator gan,” Opt. Express 30(24), 43601–43621 (2022). [CrossRef]

16. P. Li, “Didfuse: deep image decomposition for infrared and visible image fusion,” in Proceedings of International Joint Conferences on Artificial Intelligence(IJCAI), (2021), p. 976.

17. J. Zhang, J. Shao, J. Chen, D. Yang, B. Liang, and R. Liang, “Pfnet: an unsupervised deep network for polarization image fusion,” Opt. Lett. 45(6), 1507–1510 (2020). [CrossRef]

18. L. Tang, J. Yuan, and J. Ma, “Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network,” Inf. Fusion 82, 28–42 (2022). [CrossRef]

19. H. Xu, J. Ma, J. Jiang, X. Guo, and H. Ling, “U2fusion: A unified unsupervised image fusion network,” IEEE Trans. Pattern Anal. Mach. Intell. 44(1), 502–518 (2022). [CrossRef]

20. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, vol. 30 (2017).

21. J. Ma, L. Tang, F. Fan, J. Huang, X. Mei, and Y. Ma, “Swinfusion: Cross-domain long-range learning for general image fusion via swin transformer,” IEEE/CAA J. Autom. Sinica 9(7), 1200–1217 (2022). [CrossRef]

22. K. Li, M. Qi, S. Zhuang, Y. Yang, and J. Gao, “Tipfnet: a transformer-based infrared polarization image fusion network,” Opt. Lett. 47(16), 4255–4258 (2022). [CrossRef]

23. X. Li, H. Li, Y. Lin, J. Guo, J. Yang, H. Yue, K. Li, C. Li, Z. Cheng, H. Hu, and T. Liu, “Learning-based denoising for polarimetric images,” Opt. Express 28(11), 16309–16321 (2020). [CrossRef]

24. N. Hagen and Y. Otani, “Stokes polarimeter performance: general noise model and analysis,” Appl. Opt. 57(15), 4283–4296 (2018). [CrossRef]

25. M. J. How and N. J. Marshall, “Polarization distance: a framework for modelling object detection by polarization vision systems,” in Proceedings of the Royal Society B: Biological Sciences, vol. 281 (2014), p. 20131632.

26. Li Kunyuan, “Noise-aware infrared polarization image fusion based on salient prior with attention-guided filtering network,” GitHub (2023), https://github.com/lkyahpu/NIPFNet.

27. Z. Fu, X. Wang, J. Xu, N. Zhou, and Y. Zhao, “Infrared and visible images fusion based on rpca and nsct,” Infrared Phys. Technol. 77, 114–123 (2016). [CrossRef]

28. J. Ma, C. Chen, C. Li, and J. Huang, “Infrared and visible image fusion via gradient transfer and total variation minimization,” Inf. Fusion 31, 100–109 (2016). [CrossRef]

29. J. Zhao, G. Cui, X. Gong, Y. Zang, S. Tao, and D. Wang, “Fusion of visible and infrared images using global entropy and gradient constrained regularization,” Infrared Phys. Technol. 81, 201–209 (2017). [CrossRef]

30. S. Mo, J. Duan, W. Zhang, X. Wang, J. Liu, and X. Jiang, “Multi-angle orthogonal differential polarization characteristics and application in polarization image fusion,” Appl. Opt. 61(32), 9737–9748 (2022). [CrossRef]

31. Y. Liu, J. Jin, Q. Wang, Y. Shen, and X. Dong, “Region level based multi-focus image fusion using quaternion wavelet and normalized cut,” Signal Processing 97, 9–30 (2014). [CrossRef]

32. W. Kong, Y. Lei, and H. Zhao, “Adaptive fusion method of visible light and infrared images based on non-subsampled shearlet transform and fast non-negative matrix factorization,” Infrared Phys. Technol. 67, 161–172 (2014). [CrossRef]

33. H. Li, X.-J. Wu, and J. Kittler, “Infrared and visible image fusion using a deep learning framework,” in Proceedings of International Conference on Pattern Recognition (ICPR), (2018), pp. 2705–2710.

34. K. R. Prabhakar, V. S. Srikar, and R. V. Babu, “Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs,” in Proceedings of IEEE International Conference on Computer Vision (ICCV), (2017), pp. 4724–4732.

35. J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang, “Fusiongan: A generative adversarial network for infrared and visible image fusion,” Inf. Fusion 48, 11–26 (2019). [CrossRef]

36. L. Qu, S. Liu, M. Wang, and Z. Song, “Transmef: A transformer-based multi-exposure image fusion framework using self-supervised multi-task learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, (2022), pp. 2126–2134.

37. X. Su, J. Li, and Z. Hua, “Transformer-based regression network for pansharpening remote sensing images,” IEEE Trans. Geosci. Remote Sensing 60, 1–23 (2022). [CrossRef]

38. H. S. Clouse, H. Krim, and O. Mendoza-Schrock, “A scaled, performance driven evaluation of the layered-sensing framework utilizing polarimetric infrared imagery,” in Evolutionary and Bio-Inspired Computation: Theory and Applications V, vol. 8059 (SPIE, 2011), pp. 75–84.

39. B. Zhong, X. Wang, D. Wang, T. Yang, X. Gan, Z. Qi, and J. Gao, “Target–background contrast enhancement based on a multi-channel polarization distance model,” Bioinspir. Biomim. 16(4), 046009 (2021). [CrossRef]

40. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 10012–10022.

41. H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” in Proceedings of European Conference on Computer Vision (ECCV) Workshops, (2023), pp. 205–218.

42. Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, “Attentional feature fusion,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, (2021), pp. 3560–3569.

43. K. He, J. Sun, and X. Tang, “Guided image filtering,” IEEE Trans. Pattern Anal. Mach. Intell. 35(6), 1397–1409 (2013). [CrossRef]

44. N. Li, Y. Zhao, Q. Pan, S. G. Kong, and J. C.-W. Chan, “Full-time monocular road detection using zero-distribution prior of angle of polarization,” in Proceedings of European Conference on Computer Vision (ECCV), (2020), pp. 457–473.

45. M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: A natural scene statistics approach in the dct domain,” IEEE Trans. on Image Process. 21(8), 3339–3352 (2012). [CrossRef]

46. A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. on Image Process. 21(12), 4695–4708 (2012). [CrossRef]

47. H. Talebi and P. Milanfar, “Nima: Neural image assessment,” IEEE Trans. on Image Process. 27(8), 3998–4011 (2018). [CrossRef]

48. Y. Fang, K. Ma, Z. Wang, W. Lin, Z. Fang, and G. Zhai, “No-reference quality assessment of contrast-distorted images based on natural scene statistics,” IEEE Signal Processing Letters 22(7), 838–842 (2014). [CrossRef]

49. L. Liu, B. Liu, H. Huang, and A. C. Bovik, “No-reference image quality assessment based on spatial and spectral entropies,” Signal Processing: Image Commun. 29(8), 856–863 (2014). [CrossRef]

50. N Venkatanath, D. Praneeth, M. C. Bh, S. S. Channappayya, and S. S. Medasani, “Blind image quality evaluation using perception based features,” in National Conference on Communications (NCC), (2015), pp. 1–6.

51. M. Yin, W. Liu, X. Zhao, Y. Yin, and Y. Guo, “A novel image fusion algorithm based on nonsubsampled shearlet transform,” Optik 125(10), 2274–2282 (2014). [CrossRef]

52. A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv, arXiv:2004.10934 (2020). [CrossRef]

Methods	EN ↑	SD ↑	BLIINDS-II↓	BRISQUE ↓	NIMA ↑	CEIQ ↑	SSEQ ↓	PIQE ↓
Baseline	6.7136	62.9550	37.3983	44.6299	4.4805	2.9707	44.0389	48.8063
Baseline+PSFF	6.7190	69.5496	26.3305	29.6794	4.6677	3.0634	31.5923	48.2670
Baseline+PSFF+AGF	7.0906	84.2152	21.8898	23.1662	4.6694	3.2909	29.4906	44.3060

Methods	EN ↑	SD ↑	BLIINDS-II↓	BRISQUE ↓	NIMA ↑	CEIQ ↑	SSEQ ↓	PIQE ↓
NSST	6.3725	65.5001	30.9541	46.1801	4.4569	2.8078	48.7670	54.0111
Mdlatlrr	6.1881	52.4403	37.5561	49.5897	4.0024	2.7656	52.0724	56.3225
RFN-Nest	6.5921	68.5392	41.2347	53.7275	3.9943	2.9194	52.9710	71.8132
DIDFuse	5.9850	75.2079	35.3265	44.8600	4.1861	2.4600	43.0697	54.5900
U2Fusion	5.7878	57.6935	48.1429	48.8849	3.8285	2.4907	56.9430	55.9085
PFNet	6.6813	59.8444	33.2908	43.2965	4.3654	3.0212	44.1631	56.2109
SeAFusion	7.0209	76.7233	23.5204	36.8103	4.3608	3.2302	36.6161	49.9489
SwinFusion	6.8844	75.6023	32.1480	42.0698	4.2370	3.1910	41.0155	49.7075
TIPFNet	6.4994	57.7481	34.9031	45.1729	4.0319	2.9547	42.7571	51.6895
Ours	7.0884	81.2445	25.1122	23.5893	4.5937	3.2928	29.2761	46.6389

Methods	EN ↑	SD ↑	BLIINDS-II↓	BRISQUE ↓	NIMA ↑	CEIQ ↑	SSEQ ↓	PIQE ↓
Baseline	6.7136	62.9550	37.3983	44.6299	4.4805	2.9707	44.0389	48.8063
Baseline+PSFF	6.7190	69.5496	26.3305	29.6794	4.6677	3.0634	31.5923	48.2670
Baseline+PSFF+AGF	7.0906	84.2152	21.8898	23.1662	4.6694	3.2909	29.4906	44.3060

Methods	EN ↑	SD ↑	BLIINDS-II↓	BRISQUE ↓	NIMA ↑	CEIQ ↑	SSEQ ↓	PIQE ↓
NSST	6.3725	65.5001	30.9541	46.1801	4.4569	2.8078	48.7670	54.0111
Mdlatlrr	6.1881	52.4403	37.5561	49.5897	4.0024	2.7656	52.0724	56.3225
RFN-Nest	6.5921	68.5392	41.2347	53.7275	3.9943	2.9194	52.9710	71.8132
DIDFuse	5.9850	75.2079	35.3265	44.8600	4.1861	2.4600	43.0697	54.5900
U2Fusion	5.7878	57.6935	48.1429	48.8849	3.8285	2.4907	56.9430	55.9085
PFNet	6.6813	59.8444	33.2908	43.2965	4.3654	3.0212	44.1631	56.2109
SeAFusion	7.0209	76.7233	23.5204	36.8103	4.3608	3.2302	36.6161	49.9489
SwinFusion	6.8844	75.6023	32.1480	42.0698	4.2370	3.1910	41.0155	49.7075
TIPFNet	6.4994	57.7481	34.9031	45.1729	4.0319	2.9547	42.7571	51.6895
Ours	7.0884	81.2445	25.1122	23.5893	4.5937	3.2928	29.2761	46.6389

Noise-aware infrared polarization image fusion based on salient prior with attention-guided filtering network

Abstract

1. Introduction

2. Related work

2.1 Traditional fusion methods

2.2 Deep learning-based image fusion

3. Proposed method

3.1 Infrared polarization input with salient prior

3.2 Overall framework

3.3 Polarization salient feature fusion network

3.4 Attention-guided filtering module

3.5 Loss Function

4. Experiments

4.1 Implementation details

4.2 Ablation analysis

4.3 Comparison with the state-of-the-art

5. Conclusion

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (10)

Tables (2)

Equations (24)

Optics Express