ASF-Transformer: neutralizing the impact of atmospheric turbulence on optical imaging through alternating learning in the spatial and frequency domains

Ziran Zhang; Ziran Zhang; Bin Zhao; Bin Zhao; Yueting Chen; Zhigang Wang; Dong Wang; Jiawei Sun; Jie Zhang; Zhihai Xu; Xuelong Li; Xuelong Li; Xuelong Li

doi:10.1364/OE.503131

1. Introduction

Turbulence, a complex phenomenon in classical physics, arises from the interaction between inertial and viscous forces within a flow field [1,2]. As the atmosphere is present everywhere and its boundary conditions are complex, atmospheric turbulence unavoidably affects optical imaging. Turbulence triggers anisotropy and inhomogeneity in airflow states, leading to fluctuations in the refractive index [3,4]. As the refractive index of the propagation medium is non-uniform, the direction and phase of light undergo continuous changes throughout the light propagation process. When light enters imaging equipment, the distorted wavefront induces tilts and blurring on the image plane [5]. As a result, this phenomenon negatively impacts a variety of imaging applications, such as remote sensing [6], astronomical observation [7], laser communication [8], and long-distance photography [9]. Furthermore, solving the corresponding partial differential equations is extremely challenging, owing to the complexity of atmospheric boundary conditions and the multifaceted factors that contribute to turbulence image degradation. Hence, this problem has long been a topic of research interest, receiving widespread and sustained attention.

Traditional approaches to counteracting the negative impacts of atmospheric turbulence have often utilized techniques such as lucky imaging [10], ghost imaging [11], or hardware-based adaptive optics [12]. However, these approaches encounter limitations when applied to dynamic situations and real-time applications. There are also methods based on deconvolution [13,14], but the spatial variation of the Point Spread Function (PSF) has limited their performance. Recently, deep learning-based methods have shown promise in image restoration tasks [15–17] and computational imaging [18–20] without using generative adversarial schemes. Nevertheless, mainstream deep learning methods for turbulence removal [21–23] rely on Generative Adversarial Networks (GANs). While GANs generate fine image details, their lack of physical constraints makes reliable results difficult. Conversely, using Convolutional Neural Networks (CNNs) to combat atmospheric turbulence leads to overly smoothed images and lost detail. This issue can be attributed to the limited receptive field of CNNs. Fortunately, our previous research has shown that each point in the Fourier domain possesses global perceptibility, and a complementary relationship exists between learning in the spatial domain and the Fourier domain [24]. Furthermore, drawing upon the diffraction theory of light and the split-step wave propagation model [25], turbulence alternately affects the spatial wavefront phase and the Fourier characteristics of the light field. Thus, to harmonize data-driven and physics-driven methodologies, it is crucial to approach the network design from an optical perspective.

In this paper, we introduce a deep-learning approach inspired by optical theory to mitigate the effects of atmospheric turbulence on optical imaging. Drawing on the split-step propagation model and correlation principle, we propose the LASF mechanism for learning alterations in both spatial and frequency domains. The LASF mechanism reflects the process of light field propagation in turbulence, and can adaptively offset the negative impact of turbulence by learning a large amount of data. The ASF-Transformer, presented in this paper, is designed based on the LASF mechanism. Furthermore, the introduction of Patch FFT loss assists in the training of the ASF-Transformer, further enhancing the quality of the reconstructed images. The capability of the ASF-Transformer is evaluated across various testing mediums. Experimental results reveal a visual quality improvement, effectively alleviating the image quality degradation caused by atmospheric turbulence, whether on algorithm-simulated data or physical-turbulence data. Through a quantitative comparative analysis with publicly available methods released in recent years, our method emerged as state-of-the-art. Our design process, guided by principles from optical theory, not only introduces a new paradigm but also gives valuable insights that can steer the structural design of future neural networks.

2. Concept and principle

Atmospheric turbulence imaging is a classic problem that has been researched for several decades. Drawing upon the research conducted by D. L. Fried [26], the modulation transfer function (MTF) of optical imaging through the atmosphere turbulence can be expressed as:

(1)$$\text{MTF}(f)=\tau_{0}(f) \exp \left\{-\frac{1}{2}\left[D(\lambda R f)-\left\langle(\mathbf{a} \cdot \lambda R \mathbf{f})^{2}\right\rangle\right]\right\},$$

where $\mathbf {a}$ represents an isotropic random vector. $\left \langle \cdot \right \rangle$ represents the time average. $\tau _{0}(f)$ is the $\text {MTF}$ of a diffraction-limited lens. $D(\lambda Rf)$ represents the refractive index structure function. $\lambda$, $R$, and $f$ correspond to wavelength, focal length, and frequency, respectively. Under long-exposure conditions, the term $\left \langle (\mathbf {a} \cdot \lambda R \mathbf {f})^{2}\right \rangle$ within Eq. (1) can be disregarded. As a result, long-time averaging will cause $\mathbf {a}$, which possesses isotropic randomness, to approach zero. This in turn leads to a decrease in the $\text {MTF}(f)$, resulting in image blur, but effectively eliminates random pixel offset distortions. Conversely, during short exposures, the contribution of $\mathbf {a}$ cannot be neglected. Short exposures not only enhance the $\text {MTF}(f)$, leading to a sharper image but also make the distortion caused by random pixel offsets more noticeable. In realistic scenarios, both distortion and blurring coexist and can be represented as [22]:

(2)$$\widetilde{\mathbf{I}}=\mathcal{B}(\mathrm{MTF}, x, y,t) \otimes(\mathcal{T}(\eta(p, T, \lambda), \theta, \ldots, x, y,t) \circ \mathbf{I})+\mathbf{N},$$

where $\widetilde {\mathbf {I}}$ represents the turbulence-degraded image, $\mathbf {I}$ stands for the ideal sharp image. $\mathcal {B}$ signifies the blur kernel that varies spatially. $\otimes$ denotes the convolution operation. $\mathcal {T}$ describes the displacement vector map of pixel positions, accounting for distortion or tilt. $\circ$ represents the pixel position resampling. The term $\mathbf {N}$ corresponds to the noise introduced by the sensor. $x$ and $y$ represent spatial coordinates. $t$ stands for time. The symbol $\eta (p, T, \lambda )$ stands for the refractive index, influenced by pressure $p$, temperature $T$, and wavelength $\lambda$. $\theta$ represents the angle of arrival. Turbulence introduces blurring and distortion to optical imaging, and our method aims to address both of these challenges.

Understanding the principles of turbulence-induced image degradation is essential for developing effective mitigation techniques. The degradation of images through atmospheric turbulence can be modelled using a split-step propagation approach. As shown in Fig. 1, the clear image $\mathbf {I}$ propagates through the atmosphere in discrete steps, involving the application of the Fresnel diffraction and the Kolmogorov phase screen [27]. The phase screen represents the random phase distortions caused by atmospheric turbulence. By applying the Fourier transform $\mathcal {F}(\cdot )$ to the phase screen, the phase-modulated wavefront can be obtained. The wavefront at a distance $z+\Delta z$ can then be approximated using the inverse Fourier transform of the product of the exponential phase modulation factor, $\exp (iA\Delta z)$, and the Fourier transform of the product of the incident wavefront, $E(z, x, y)$, and the phase screen, $\exp (i\phi (x, y))$. The resulting wavefront provides an approximation of the wavefront perturbation at the new distance [25]:

(3)$$E(z+\Delta z, x, y) \approx \mathcal{F}^{{-}1}\{\exp (i A \Delta z) \cdot \mathcal{F}[\exp (i \phi(x, y)) \times E(z, x, y)]\}.$$

The propagation distance $d$ of the light is divided into multiple steps $\Delta z$, and the final wavefront state $E(z+d, x, y)$ is obtained. Atmospheric turbulence causes random perturbations in the phase and amplitude of light waves, leading to a reduction in image quality. Correlated imaging offers a more refined understanding of the specific effects of turbulence on the optical field [28]. In long-distance optical imaging, light propagation under turbulent conditions can be considered as exhibiting partial coherence. The cross-correlation function is often used to describe the interference effect between two light waves and to predict the shape and intensity of the interference pattern. The cross-correlation function between two light fields $E_1(x_1,t_1)$ and $E_2(x_2,t_2)$ in turbulence can be defined as:

(4)$$\Gamma\left(P_{1}, P_{2} ; \tau\right) = \left\langle E\left(P_{1}, t\right) E^{*}\left(P_{2}, t + \tau\right)\right\rangle,$$

where $P_1$ and $P_2$ are the points in the field, $t$ stands for time, and $\tau$ is the time delay between the two points. $E$ represents the complex amplitude of the field, and $\langle \cdot \rangle$ denotes the time-averaging operation. The cross-correlation function in optics provides foundational inspiration for the design of self-attention neural networks. This function quantifies the correlation between different points in a light field, which mirrors the self-attention mechanism’s approach to estimating interdependencies between different input points in neural networks. Therefore, studying the mutual coherence in optics can provide valuable insights into designing innovative neural networks capable of more effectively capturing long-distance dependencies.

3. Methods

To mitigate the degradation caused by both light diffraction and phase distortions due to turbulence, we introduce a learning-based mechanism, named Alternating Learning in Spatial and Frequency domains(LASF), which facilitates iterative restoration across the spatial and frequency domains alternately. The LASF mechanism reflects the process of light field propagation in turbulence. After learning from a large amount of data, it can adaptively offset the negative impact caused by turbulence. As shown in Fig. 2, the LASF method comprises two essential components: the Spatial-Aware Transformer Block (SATB) and the Frequency-Aware Transformer Block (FATB). This approach mimics the split-step wave propagation process, enabling the feature maps to employ self-attention mechanisms alternately within the spatial and Fourier domains. The self-attention mechanism, renowned for its robustness in sequence processing [29], effectively captures long-range dependencies in images. This helps to comprehend how light sources interact and the overall image composition, leading to superior restoration results.

Fig. 1. Split-step propagation model.

Download Full Size | PDF

Fig. 2. Architecture of the ASF-Transformer. Red blocks denote spatial self-attention, and cyan blocks signify frequency self-attention with correlation. These two blocks are connected alternately, establishing the LASF mechanism.

Download Full Size | PDF

The proposed ASF-Transformer architecture, as shown in Fig. 2, consists of two identically structured multi-scale modules and a refining module. Currently, many image restoration models, such as Restormer [16], FFTformer [17], and MIRNetV2 [30], adopt multi-scale structural designs. Integrating multi-scale architectures into image restoration networks allows for simultaneous capture of diverse features and promotes feature fusion across scales, thus improving restoration performance. The multi-scale structure progressively reduces the feature map size while increasing the channel count. Ultimately, feature maps from different scales are fused together. In the ASF-Transformer, each multi-scale module is composed of three parallel branches, wherein the second and third branches receive inputs from the first branch, which have been downsampled by factors of 2 and 4, respectively, and have expanded the channel dimensions by the same factors. Within each branch, the feature map size is upsampled by a factor of 2 midway and at the end, which is followed by a reduction in channel dimension by the same factor. These upsampled and compressed feature maps are subsequently fused with the corresponding feature maps from the preceding branch. Ultimately, the feature maps from all the branches are merged into the first branch. The refining block integrates the output feature maps from the multi-scale modules and conducts further restoration, ultimately producing a clear image.

The spatial variability of the PSF poses significant obstacles in the estimation of the blur kernel, which is notably adverse to conventional iterative optimization approaches. However, our methods bypass the need for blur kernel estimation, thereby sidestepping this problem. To better address the spatially varying blur, we introduce two operations referred to as patch division $P(\cdot )$ and patch merging $P^{-1}(\cdot )$, as shown in Fig. 3. These functions are important in our approach because they facilitate the transformation of the feature map into a more manageable size and format for processing. Specifically, $P(\cdot )$ divides the feature map into patches for localized self-attention and filtering, enabling us to handle the spatially varying turbulence degradation. $P^{-1}(\cdot )$ is used to merge the patches back together, recreating the complete feature map. The method allows us to reconstruct feature maps computationally efficiently using smaller patches while still considering the associated local window context.

Fig. 3. Patch division and patch merging.

Download Full Size | PDF

To realize self-attention in the frequency domain, the FATB block utilizes Fourier transformations and cross-correlation [28] to shift the dot product attention computation in the self-attention mechanism from spatial domain matrix multiplication to frequency domain element-wise multiplication [17]. More specifically, if the input feature is denoted as $X$, the query features $F_q$, key features $F_k$, and value features $F_v$ are initially obtained through linear transformations. Subsequently, $F_q$ and $F_k$ undergo a Fourier transform, followed by element-wise multiplication in the frequency domain:

(5)$$A = P^{{-}1}(\mathcal{F}^{{-}1}(\mathcal{F}(P(F_q))\odot \mathcal{F}(P(F_k)))),$$

where $\mathcal {F}(\cdot )$ represents the Fourier transform, $\mathcal {F}^{-1}(\cdot )$ represents the inverse transform, $P(\cdot )$ and $P^{-1}(\cdot )$ respectively represent the operations of patch division and merging, and $\odot$ represents element-wise multiplication. The size of the patch is set to 8. $A$ is the attention map obtained through frequency domain operations. Finally, $F_v$ is weighted and fused with $A$, and the output is obtained through a residual connection:

(6)$$X_{f}^{att} = \text{Conv}(L(A)F_v)+X ,$$

where $X_{f}^{att}$ represents the frequency domain self-attention (Freq-SA) feature map, $\text {Conv}$ represents the convolution operation, $L(\cdot )$ represents the layer normalization layer. To further bolster Fourier domain restoration capabilities, we introduce a learnable block Fourier filter, which we refer to as the Frequency Gating Mechanism (FGM). This mechanism adaptively decides what low-frequency and high-frequency information should be preserved. The implementation of FGM is as follows:

(7)$$X_f = \mathcal{F}(P(\text{Conv}(L(X_{f}^{att})))),$$

(8)$$X_{f}^{out} = G_{gl}(P^{{-}1}(\mathcal{F}^{{-}1}(W\odot X_f))) + X_{f}^{att},$$

where $X_f$ represents the frequency feature map, $W$ is the learned quantization matrix and $G_{gl}$ represents the GEGLU activation function [31]. Consequently, by learning the $W$ matrix to perform frequency domain weighting, the frequency information that is beneficial for the restoration of sharp images can be adaptively preserved.

Another key component is the SATB block, which captures image information in the spatial domain using self-attention [16,32]. Specifically, given a layer-normalized feature map $L(X)$, it initially generates the query ($Q^{HW\times C}$), key ($K^{HW\times C}$), and value ($V^{HW\times C}$) via depthwise separable convolutions that expand the channels by three times. The matrix $K^{HW\times C}$ is reshaped to $\hat {K}^{C\times HW }$, leading their dot product to create a transposed attention of size $\mathbb {R}^{C\times C}$, thereby avoiding the much larger computational cost of the $\mathbb {R}^{HW\times HW}$ regular attention. The spatial self-attention (Spat-SA) is defined as follows:

(9)$$X_{s}^{att} = \text{Conv}(V\cdot \text{Softmax}(\hat{K}Q/\alpha)) + X,$$

where $\alpha$ is a scaling parameter. The Softmax function normalizes input scores, transforming them into a probability distribution suitable for information perception. For the spatial self-attention feature map $X_{s}^{att}$, the information flow is selectively transmitted further via a spatial gating mechanism (SGM), which can be represented as:

(10)$$X_{s}^{out} = \text{Conv}(G_{ge}(\text{Conv}(L(X_{s}^{att})))\odot \text{Conv}(L(X_{s}^{att}))) + X_{s}^{att} ,$$

where $G_{ge}$ is the GELU activation function [31], $L$ denotes layer normalization, and $\odot$ represents element-wise multiplication. Residual connections [33] are incorporated within the gating units of both FATB and SATB, facilitating smoother gradient propagation and accelerating network convergence.

To achieve a fusion of features across different scales, the feature maps of different scales are first upsampled to a common scale. Subsequently, they are fused using the coordinate attention mechanism [34]. This mechanism decomposes channel-wise attention into two parallel one-dimensional feature encodings, effectively integrating spatial coordinate information. The implementation of the entire upsampling and fusion block is defined as follows:

(11)$$X_{fuse}=X_c^{T}\operatorname{Conv}(G_{sig}(L(\operatorname{Conv}(\operatorname{P_h}(X_c) \odot\operatorname{P_w}(X_c))))) ,$$

where $\operatorname {P_h}(X)$ denotes average pooling in the feature map’s height direction, and $\operatorname {P_w}(X)$ denotes average pooling in the feature map’s width direction. $\text {Conv}$ represents the convolution, $L$ is the normalization function, $G_{sig}$ signifies the Sigmoid activation function, $(\cdot )^{T}$ represents the transpose of a matrix, and $X_c$ represents the input feature matrix and can be written as:

(12)$$X_c=\left[\begin{array}{cc} \operatorname{Conv}(X_{1}) \\ \operatorname{Conv}(\text{Up}\left(X_{2}\right)) \end{array}\right],$$

where $\text {Up}(\cdot )$ is the upsampling block that doubles the height and width of the image while halving the channels. Conversely, the downsampling block does the exact opposite. $X_{1}$ and $X_{2}$ represent feature maps.

The patch Fourier transform loss $\mathcal {L}_{patchfft}$, is introduced to oversee the frequency domain of the image. The patch Fourier transform loss $\mathcal {L}_{patchfft}$ can be expressed as follows:

(13)$$\mathcal{L}_{patchfft}=\frac{1}{h \times w}\sum_{u=1}^{h}\sum_{v=1}^{w}|\text{Re}(\Delta F(u, v))+\text{Im}(\Delta F(u, v))|,$$

(14)$$\Delta F(u,v)=\mathcal{F}(P(\widetilde{\mathbf{I}}(x,y)))-\mathcal{F}(P(\mathbf{I}(x,y))),$$

where $u$ and $v$ denote the spatial frequency coordinates, and $h$ and $w$ represent the height and width of the image, respectively. $\text {Re}(\cdot )$ refers to the real part, and $\text {Im}(\cdot )$ refers to the imaginary part. This newly-proposed loss function ensures that the network pays attention to the frequency components of the image during the learning process, thereby leading to more sharper outputs. Perceptual loss [35] and pixel loss are also utilized as the critic function for our network:

(15)$$\mathcal{L}=\lambda_{1}\mathcal{L}_{pixel}+\lambda_{2}\mathcal{L}_{perceptual}+\lambda_{3}\mathcal{L}_{patchfft},$$

where $\mathcal {L}_{pixel}$ represents the mean squared error loss at the pixel level, $\mathcal {L}_{perceptual}$ denotes the perceptual loss. The pixel loss offers a straightforward way to measure the difference between two images on a pixel-by-pixel basis. The perceptual loss, constituted by the first 7 convolutional layers of the pre-trained VGG19 [36] network, enables the model to learn the feature difference between turbulence degradation and real scenes, thereby restoring a clearer image.

4. Experiments and results

The ASF-Transformer is implemented based on the PyTorch framework and Python 3.9. The hardware configuration includes a GeForce RTX 3090 GPU and an Intel Xeon Platinum 8369B CPU. During training, a batch size of $4$ is used and images are cropped to a size of $160\times 160$. Inference can be conducted at the desired resolution as long as the GPU has sufficient memory. The training process consists of 400k iterations, which are divided into two cycles. During the first cycle, a fixed learning rate of $3 \times 10^{-4}$ is applied for the initial 92k iterations. In the second cycle, cosine annealing is used to adjust the learning rate from $3\times 10^{-4}$ to $1\times 10^{-6}$ over 308k iterations. The AdamW optimizer is utilized, with $\beta _1=0.9$ and $\beta _2=0.999$. The patch size for the Patch FFT loss is set to 16. The values of $\lambda _{1}$, $\lambda _{2}$, and $\lambda _{3}$ are set to 1, 0.01, and 0.1, respectively. The code is built on the BasicSR framework [37], thus ensuring that all the code for image quality assessment is universally recognized as standard.

We utilized an open-source dataset [22] to train and test our model, which includes real photographic data, physical-turbulence data, and algorithm-simulated data. The real photographic data, which lacks clear reference images, included frames from documentaries and shots from Brisbane, during periods when the local temperature ranged from approximately 93$^{\circ }$F to 99$^{\circ }$F. This portion of the dataset consists of a total of 81 images. The physical-turbulence data was collected by using six gas burners to artificially create turbulence, subsequently capturing the corresponding degraded and clear images. The experimental equipment for capturing turbulence-degraded images, as depicted in Fig. 4, comprises a Canon EOS 5D Mark IV camera, a 27-inch display monitor, and six gas burners to create turbulence. The camera is stationed 1.8 meters away from the display, with the closest gas burner situated 1.5 meters in front of the display. The physical-turbulence dataset includes a total of 27,417 pairs of training images, 3,428 pairs of test images, and 3,428 pairs of validation images. The algorithm-simulated dataset was procured using a heuristics-based turbulence-simulation method [38]. As shown in Fig. 5, this simulation dataset was constructed by applying PSF blurring and pixel offset tilt to clear images. Specifically, based on specific property settings for the turbulence model and imaging systems, parameters including the atmospheric refractive index structure parameter, inner scale, outer scale, aperture diameter, focal length, and imaging distances were utilized to generate a spatially variant PSF. This allowed for the simulation of spatially variant blur effects. The next step was to leverage the White Sands Missile Range data [39] to compute angle-of-arrival variances for spherical and plane waves, simulating pixel offset. During the simulation process, the influence of different focal lengths and imaging distances was also taken into account, ensuring the generalization capability of the model. The algorithm-simulated data includes a total of 5784 pairs of training images, 723 pairs of testing images, and 723 pairs of validation images.

Fig. 4. The process of obtaining physical-turbulence data.

Download Full Size | PDF

Fig. 5. The process of generating algorithm-simulated data.

Download Full Size | PDF

The performance of the ASF-Transformer in countering the effects of atmospheric turbulence on optical imaging has been evaluated across a diverse set of test mediums, encompassing checkerboards, resolution targets, algorithm-simulated data, physical-turbulence data, and real photographic data. The restoration results are illustrated in Fig. 6. The degraded checkerboards and resolution targets reveal spatial-variation turbulence degradation. However, this did not hinder the ASF-Transformer from rectifying the distortion and blur. With the assistance of the ASF-Transformer, a noticeable enhancement in image contrast was achieved. The successful mitigation of turbulence-induced blur was apparent. Even though there was a slight tilt on checkerboards and resolution targets, the distorted edges were significantly mitigated. In algorithm-simulated data, the stripes on the zebra’s body were well-recovered. In physical-turbulence data, the real details of the image were substantially reclaimed. In real photographic data, ASF-restored images reveal sharper edges and effective suppression of wavy artifacts. These findings indicate an improvement in the MTF, underscoring the ASF-Transformer’s capability in enhancing image quality. To further substantiate the reliability of our method, we conducted a comparative analysis with several deep learning-based image restoration algorithms reported in the last two years. This includes approaches within the turbulence removal domain, such as the GAN-based TSR-WGAN [22] and EAF-WGAN [21], along with methods within the low-level vision domain, such as the CNN-based NAFNet [15], the spatial self-attention-based (SA-based) Restomer [16], and the frequency self-attention-based(FA-based) FFTformer [17]. As illustrated in Fig. 7, the ASF-Transformer, relying on the LASF mechanism, manages to recover the intricate texture of the elephant’s leg, whereas results produced by other methods appear overly smooth. The image quality was assessed using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). The quantitative comparison results are presented in Tab.1, from which it can be discerned that our method achieves the best performance on the existing dataset, even without the utilization of a generative adversarial scheme. A noteworthy aspect of this study is the ASF-Transformer’s state-of-the-art restoration results in terms of quantitative metrics. Compared to the second-best performer, the ASF-Transformer improved the PSNR by 0.64dB and the SSIM by 0.0050 on the algorithm-simulated dataset, and it increased the PSNR by 0.52dB and the SSIM by 0.0050 on the physical-turbulence dataset. We also noticed that the evaluation scores for physical-turbulence data were lower than those for algorithm-simulated datasets across all methods. This discrepancy arises from the complexities in real captured images, where multiple factors contribute to image degradation, and the ground truths used for learning might also contain some intrinsic degradation. Despite these obstacles, the ASF-Transformer still achieved measurable enhancements, reinforcing its potential as a versatile solution to the multifaceted challenges of optical imaging in various conditions.

Fig. 6. Restoration results of turbulence-degraded images by ASF-Transformer.

Download Full Size | PDF

Fig. 7. Visual comparison of restoration results across different methods.

Download Full Size | PDF

Table 1. Comparison of methods on algorithm-simulated and physical-turbulence data. $\uparrow$ indicates higher is better. Results for methods with $^{\ast }$ are from literature, as there are no open source implementations [21] or high replication hardware requirements [22]. Bold and underlined values indicate the best and second-best performance, respectively.

View Table | View all tables in this article

5. Ablation Study

In an experiment analyzing the effect of Patch FFT loss on ASF-Transformer’s ability to counter turbulence degradation, two setups were compared: one without the Patch FFT loss and another utilizing the Patch FFT loss. As shown in Tab.2, the inclusion of the Patch FFT loss outperforms the version without it across different datasets. For turbulence datasets simulated by the algorithm, using the Patch FFT loss brought about a significant improvement in PSNR (from 34.53dB to 35.16dB) and SSIM (from 0.9612 to 0.9634). When evaluated on physical-turbulence data, the ASF-Transformer trained with Patch FFT loss achieved better results. Its PSNR was 32.18dB and SSIM was 0.9617, compared to 31.97dB and 0.9597dB without Patch FFT loss. For the visualization results, please refer to the Supplement 1. These results highlight the importance of Patch FFT loss in improving the effectiveness of ASF-Transformer against turbulence degradation.

Table 2. Performance comparison based on the presence of the Patch FFT loss. Values before $/$ indicate PSNR, while those after indicate SSIM. Bold values indicate the best performance.

View Table | View all tables in this article

To investigate the efficacy of a multi-scale structure in addressing turbulence degradation, an ablation study on the ASF-Transformer’s branch configurations was conducted. Table 3 presents the performance metrics of ASF-Transformer as we progressively incorporate its branches. For algorithm-simulated data, the model employing solely the first branch (Branch 1) yielded a PSNR of 34.00dB and an SSIM of 0.9525. Introducing the second branch (Branch 1,2) led to noticeable improvements with PSNR and SSIM values elevating to 34.34dB and 0.9587, respectively. The ASF-Transformer performs best when all three branches are engaged (Branch 1, 2, 3), resulting in a PSNR of 35.16dB and an SSIM of 0.9634 for algorithm-simulated data. This trend mirrors in the physical-turbulence data as well. The single branch configuration results in a PSNR of 31.48dB and an SSIM of 0.9546. With the addition of the second branch, there’s a moderate enhancement to 31.93dB in PSNR and 0.9591 in SSIM. The complete model, which combines all three branches, achieves the highest performance with a PSNR of 32.18dB and an SSIM of 0.9617. These results validate the efficacy of a multi-scale structure within the ASF-Transformer. The incorporation of additional branches leads to better performance metrics, highlighting the critical role of integrating multi-scale features in mitigating turbulence degradation.

Table 3. Performance of ASF-Transformer with different branch configurations. Values before $/$ indicate PSNR, while those after indicate SSIM. Bold and underlined values indicate the best and second-best performance, respectively.

View Table | View all tables in this article

To investigate the roles of SATB and FATB in ASF-Transformer, we performed an ablation study varying block combinations, as displayed in Tab.4. When the model incorporated only the SATB blocks, it recorded a PSNR of 34.93 dB and an SSIM of 0.9596 on the algorithm-simulated data, and 31.82 dB and 0.9584 on the physical-turbulence data. In contrast, a model employing only the FATB blocks delivered a slightly superior performance with a PSNR of 34.98 dB and SSIM of 0.9618 for the algorithm-simulated data, and 32.17 dB and 0.9614 for the physical-turbulence data. A configuration with FATB preceding SATB registered a PSNR of 35.01 dB and an SSIM of 0.9603 for the algorithm-simulated dataset, while the physical-turbulence data saw values of 32.06 dB and 0.9606, respectively. However, our proposed arrangement, starting with SATB followed by FATB, proved to be the most efficient, resulting in the highest PSNR of 35.16 dB and SSIM of 0.9634 on the algorithm-simulated data and 32.18 dB and 0.9617 on the physical-turbulence dataset. This highlights the importance of the interplay between SATB and FATB in mitigating turbulence degradation. The combination of SATB followed by FATB provides the best performance.

Table 4. Performance of different block combinations in ASF-Transformer. Values before $/$ indicate PSNR, while those after indicate SSIM. Bold and underlined values indicate the best and second-best performance, respectively.

View Table | View all tables in this article

Our experiments also confirmed that turbulence-induced image degradation can be effectively resolved without resorting to GANs. To assess the effect of a discriminator in our model, we conducted an ablation study comparing performance metrics with and without it. The discriminator employed in our experiments adopts the U-Net discriminator with spectral normalization, as utilized in Real-ESRGAN [40]. The experimental results are presented in Tab.5. The model with a discriminator achieved a PSNR of 34.91dB and an SSIM of 0.9623 on the algorithm-simulated data. For the physical-turbulence data, the PSNR was measured at 32.01dB and the SSIM at 0.9603. Surprisingly, our model showed superior performance without a discriminator. On the algorithm-simulated dataset, it achieved a PSNR of 35.16dB and an SSIM of 0.9634. On the physical-turbulence dataset, it achieved a PSNR of 32.18dB and an SSIM of 0.9617. This observation demonstrates that our ASF-Transformer can effectively improve image evaluation metrics in the presence of atmospheric turbulence without requiring adversarial training.

Table 5. Performance comparison based on the presence of a discriminator in the model. Values before $/$ indicate PSNR, while those after indicate SSIM. Bold values indicate the best performance.

View Table | View all tables in this article

6. Conclusion

This study has contributed to the ongoing efforts to address the complex challenges of atmospheric turbulence in optical imaging. In conclusion, we have introduced the ASF-Transformer model and a novel LASF mechanism. This mechanism reflects the process of light field propagation in turbulence and can offset the negative impact of turbulence by learning from a large amount of data. The Patch FFT loss has been utilized to enhance the performance of the model. The ASF-Transformer possesses the ability to recover intricate textures, eliminating the need for generative adversarial schemes. The ASF-Transformer outperformed existing turbulence removal methods in various scenarios, offering a new direction away from mainstream GAN-based solutions. Additionally, the integration of optical theory principles into the neural network design adds a valuable dimension to both the optics and artificial intelligence domains, opening doors for future exploration and refinement in neural network design.

Funding

Shanghai Artificial Intelligence Library, National R&D Program of China (2022ZD0160100); National Natural Science Foundation of China (62106183 and 62376222).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [22,41].

Supplemental document

See Supplement 1 for supporting content.

References

1. Y. Yang, W. Chen, R. Verzicco, and D. Lohse, “Multiple states and transport properties of double-diffusive convection turbulence,” Proc. Natl. Acad. Sci. 117(26), 14676–14681 (2020). [CrossRef]

2. I. Stiperski and M. Calaf, “Generalizing Monin-Obukhov similarity theory (1954) for complex atmospheric turbulence,” Phys. Rev. Lett. 130(12), 124001 (2023). [CrossRef]

3. D. Lohse and K.-Q. Xia, “Small-scale properties of turbulent Rayleigh-Bénard convection,” Annu. Rev. Fluid Mech. 42(1), 335–364 (2010). [CrossRef]

4. H.-D. Xi and K.-Q. Xia, “Flow mode transitions in turbulent thermal convection,” Phys. Fluids 20(5), 1 (2008). [CrossRef]

5. K. Wang, M. Zhang, J. Tang, L. Wang, L. Hu, X. Wu, W. Li, J. Di, G. Liu, and J. Zhao, “Deep learning wavefront sensing and aberration correction in atmospheric turbulence,” PhotoniX 2(1), 8–11 (2021). [CrossRef]

6. M. Xiang, A. Pan, Y. Zhao, X. Fan, H. Zhao, C. Li, and B. Yao, “Coherent synthetic aperture imaging for visible remote sensing via reflective Fourier ptychography,” Opt. Lett. 46(1), 29–32 (2021). [CrossRef]

7. B. Ma, Z. Shang, Y. Hu, K. Hu, Y. Wang, X. Yang, M. C. Ashley, P. Hickson, and P. Jiang, “Night-time measurements of astronomical seeing at dome a in antarctica,” Nature 583(7818), 771–774 (2020). [CrossRef]

8. Y. Ren, G. Xie, H. Huang, et al., “Adaptive-optics-based simultaneous pre-and post-turbulence compensation of multiple orbital-angular-momentum beams in a bidirectional free-space optical link,” Optica 1(6), 376–382 (2014). [CrossRef]

9. K. Mei and V. M. Patel, “LTT-GAN: Looking through turbulence by inverting GANS,” IEEE J. Sel. Top. Signal Process. 17(3), 587–598 (2023). [CrossRef]

10. N. M. Law, C. D. Mackay, and J. E. Baldwin, “Lucky imaging: high angular resolution imaging in the visible from the ground,” Astron. & Astrophys. 446(2), 739–745 (2006). [CrossRef]

11. D. Shi, C. Fan, P. Zhang, H. Shen, J. Zhang, C. Qiao, and Y. Wang, “Two-wavelength ghost imaging through atmospheric turbulence,” Opt. Express 21(2), 2050–2064 (2013). [CrossRef]

12. D. Shi, C. Fan, P. Zhang, J. Zhang, H. Shen, C. Qiao, and Y. Wang, “Adaptive optical ghost imaging through atmospheric turbulence,” Opt. Express 20(27), 27992–27998 (2012). [CrossRef]

13. Y. Xie, W. Zhang, D. Tao, W. Hu, Y. Qu, and H. Wang, “Removing turbulence effect via hybrid total variation and deformation-guided kernel regression,” IEEE Trans. on Image Process. 25(10), 4943–4958 (2016). [CrossRef]

14. A. J. Webb, M. C. Roggemann, and M. R. Whiteley, “Atmospheric turbulence characterization through multiframe blind deconvolution,” Appl. Opt. 60(17), 5031–5036 (2021). [CrossRef]

15. L. Chen, X. Chu, X. Zhang, and J. Sun, “Simple baselines for image restoration,” in European Conference on Computer Vision (Springer, 2022), pp. 17–33.

16. S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M.-H. Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2022), pp. 5728–5739.

17. L. Kong, J. Dong, J. Ge, M. Li, and J. Pan, “Efficient frequency domain-based transformers for high-quality image deblurring,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), pp. 5886–5895.

18. Z.-M. Li, S.-B. Wu, J. Gao, H. Zhou, Z.-Q. Yan, R.-J. Ren, S.-Y. Yin, and X.-M. Jin, “Fast correlated-photon imaging enhanced by deep learning,” Optica 8(3), 323–328 (2021). [CrossRef]

19. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica 6(8), 921–943 (2019). [CrossRef]

20. K. Yanny, K. Monakhova, R. W. Shuai, and L. Waller, “Deep learning for fast spatially varying deconvolution,” Optica 9(1), 96–99 (2022). [CrossRef]

21. X. Liu, G. Li, Z. Zhao, Q. Cao, Z. Zhang, S. Yan, J. Xie, and M. Tang, “Eaf-wgan: Enhanced alignment fusion-wasserstein generative adversarial network for turbulent image restoration,” IEEE Trans. Circuits Syst. Video Technol. 33(10), 5605–5616 (2023). [CrossRef]

22. D. Jin, Y. Chen, Y. Lu, J. Chen, P. Wang, Z. Liu, S. Guo, and X. Bai, “Neutralizing the impact of atmospheric turbulence on complex scene imaging via deep learning,” Nat. Mach. Intell. 3(10), 876–884 (2021). [CrossRef]

23. S. N. Rai and C. V. Jawahar, “Removing atmospheric turbulence via deep adversarial learning,” IEEE Trans. on Image Process. 31, 2633–2646 (2022). [CrossRef]

24. Z. Zhang, H. Li, G. Lv, H. Zhou, H. Feng, Z. Xu, Q. Li, T. Jiang, and Y. Chen, “Deep learning-based image reconstruction for photonic integrated interferometric imaging,” Opt. Express 30(23), 41359–41373 (2022). [CrossRef]

25. J. Liu, P. Wang, X. Zhang, Y. He, X. Zhou, H. Ye, Y. Li, S. Xu, S. Chen, and D. Fan, “Deep learning based atmospheric turbulence compensation for orbital angular momentum beam distortion and communication,” Opt. Express 27(12), 16671–16688 (2019). [CrossRef]

26. D. L. Fried, “Optical resolution through a randomly inhomogeneous medium for very long and very short exposures,” J. Opt. Soc. Am. 56(10), 1372–1379 (1966). [CrossRef]

27. R. Lane, A. Glindemann, and J. Dainty, “Simulation of a Kolmogorov phase screen,” Waves in Random Media 2(3), 209–224 (1992). [CrossRef]

28. P. Zhang, W. Gong, X. Shen, and S. Han, “Correlated imaging through atmospheric turbulence,” Phys. Rev. A 82(3), 033817 (2010). [CrossRef]

29. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems30, 1 (2017).

30. S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Learning enriched features for fast image restoration and enhancement,” IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 1934–1948 (2022). [CrossRef]

31. H. Fang, J.-U. Lee, N. S. Moosavi, and I. Gurevych, “Transformers with learnable activation functions,” arXiv, arXiv:2208.14111 (2022). [CrossRef]

32. Y. Cai, J. Lin, Z. Lin, H. Wang, Y. Zhang, H. Pfister, R. Timofte, and L. Van Gool, “Mst++: Multi-stage spectral-wise transformer for efficient spectral reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 745–755.

33. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778.

34. Q. Hou, D. Zhou, and J. Feng, “Coordinate attention for efficient mobile network design,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2021), pp. 13713–13722.

35. J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, (Springer, 2016), pp. 694–711.

36. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, arXiv:1409.1556 (2014). [CrossRef]

37. X. Wang, L. Xie, K. Yu, K. C. Chan, C. C. Loy, and C. Dong, “BasicSR: Open source image and video restoration toolbox,” Github, 2022, https://github.com/XPixelGroup/BasicSR.

38. B. Xue, Y. Liu, L. Cui, X. Bai, X. Cao, and F. Zhou, “Video stabilization in atmosphere turbulent conditions based on the Laplacian-Riesz pyramid,” Opt. Express 24(24), 28092–28103 (2016). [CrossRef]

39. E. Repasi and R. Weiss, “Analysis of image distortions by atmospheric turbulence and computer simulation of turbulence effects,” in Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XIX, vol. 6941 (SPIE, 2008), pp. 256–268.

40. X. Wang, L. Xie, C. Dong, and Y. Shan, “Real-esrgan: Training real-world blind super-resolution with pure synthetic data,” in Proceedings of the IEEE/CVF international conference on computer vision, (2021), pp. 1905–1914.

41. Z. Zhang, B. Zhao, Y. Chen, Z. Wang, D. Wang, J. Sun, J. Zhang, Z. Xu, and X. Li, “Alternating spatial-frequency transformer,” Github, 2023, https://github.com/naturezhanghn/ASFTransformer.

Methods	Type	Algorithm-simulated data		Physical-turbulence data
Methods	Type	$↑$ PSNR(dB)	$↑$ SSIM	$↑$ PSNR(dB)	$↑$ SSIM
TSR-WGAN $^{*}$ [22]	GAN-based	26.15	0.5526	30.47	0.8983
EAF-WGAN $^{*}$ [21]	GAN-based	26.69	0.5621	31.46	0.9047
NAFNet [15]	CNN-based	34.32	0.9550	31.55	0.9556
Restormer [16]	SA-based	34.45	0.9569	31.66	0.9567
FFTformer [17]	FA-based	34.52	0.9584	31.26	0.9534
ASF-Transformer	LASF-based	35.16	0.9634	32.18	0.9617

No.	Type	Algorithm-simulated Data	Physical-turbulence Data
1	Without Patch FFT loss	34.53 / 0.9612	31.97 / 0.9597
2 (Ours)	With Patch FFT loss	35.16 / 0.9634	32.18 / 0.9617

No.	ASF-Transformer	Algorithm-simulated data	Physical-turbulence data
1	Branch 1	34.00 / 0.9525	31.48 / 0.9546
2	Branch 1,2	34.34 / 0.9587	31.93 / 0.9591
3(Ours)	Branch 1,2,3	35.16 / 0.9634	32.18 / 0.9617

No.	Module combination	Algorithm-simulated data	Physical-turbulence data
1	All SATB	34.93 / 0.9596	31.82 / 0.9584
2	All FATB	34.98 / 0.9618	32.17 / 0.9614
3	FATB followed by SATB	35.01 / 0.9603	32.06 / 0.9606
4 (Ours)	SATB followed by FATB	35.16 / 0.9634	32.18 / 0.9617

No.	Type	Algorithm-simulated data	Physical-turbulence data
1	With discriminator	34.91 / 0.9623	32.01 / 0.9603
2 (Ours)	Without discriminator	35.16 / 0.9634	32.18 / 0.9617

ASF-Transformer: neutralizing the impact of atmospheric turbulence on optical imaging through alternating learning in the spatial and frequency domains

Abstract

1. Introduction

2. Concept and principle

3. Methods

4. Experiments and results

5. Ablation Study

6. Conclusion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (7)

Tables (5)

Equations (15)

Optics Express