ATN-Res2Unet: an advanced deep learning network for the elimination of saturation artifacts in endoscopy optical coherence tomography

Yongfu Zhao; Ruiming Kong; Fei Ma; Sumin Qi; Cuixia Dai; Cuixia Dai; Jing Meng

doi:10.1364/OE.517587

1. Introduction

Optical coherence tomography (OCT) stands out as a compelling biomedical imaging modality due to its non-invasive nature and high-resolution cross-sectional imaging capability for biological tissues [1–4]. Widely applied in medical diagnostics and treatments across various disciplines such as ophthalmology [5], dermatology [6], cardiology [7], and neurology [8], swept-source optical coherence tomography (SS-OCT) has gained prominence for its high-sensitivity and high-speed tissue imaging, particularly beneficial for internal organ imaging in areas such as gastrointestinal tracts and airways [9,10]. Despite significant advancements in SS-OCT technology, persistent challenges in the form of speckle noises [11], motion artifacts [12], and saturation artifacts hinder the imaging process. Notably, saturation signals pose a common issue in endoscopy SS-OCT systems due to the challenging control of the incident angle of light. This issue arises when SS-OCT detects luminal tissues containing highly reflective structures (e.g., lubricated visceral mucosal surfaces, ducts in endoscopic probes, metallic stent struts), where strong reflected signals surpass the detector’s maximum receiving level. Although employing low laser energy can mitigate saturation artifacts, it concurrently reduces image contrast, impacting the discernment of detailed tissue information. Consequently, these saturation artifacts compromise image quality significantly, leading to the loss of vital information along streaking artifacts throughout the imaging depth. Such artifacts not only diminish the accuracy of clinical diagnosis but also pose challenges in subsequent OCT image processing and analysis.

Several hardware-based innovations have been proposed to address saturation artifacts in Fourier domain OCT (FD-OCT) [13,14]. For instance, Wu et al. developed a two-level spectral domain OCT (SD-OCT) system utilizing a dual-line charge-coupled device (CCD) to suppress artifacts [13]. The unsaturated signals identified in the second line are employed to offset the saturated signals detected in the first line. Li et al. introduced a dual-channel detection method to rectify saturation effects in SS-OCT [14]. In this study, dual-channel data are not collected from two different angles, but as two parts of the same interference Aline, split through a broadband power divider, and then digitized by a 12-bit two-channel analog-to-digital converter (ADC). While hardware-based approaches provide an effective way of suppressing saturation artifacts experimentally, they increase the system complexity and cost to a certain degree. Moreover, they occasionally have limitations in some application scenarios. Specifically, the method in [13] cannot remove the strong saturation artifacts, and a dual-line CCD camera is needed to detect signals, which is not satisfied in the SS-OCT with a balanced detector, that is, this approach is not suitable for SS-OCT. The dual-channel method in [14] can achieve good performance in the endoscopy SS-OCT; however, the compensation may lead to a large error because the ratio of the signal on the two channels is non-linear across the spectral domain, and is roughly estimated [15].

Compared with the expensive physical resources required by hardware-based correction approaches, software-based methods can result in reduced hardware costs. Initial attempts, such as linear interpolation, addressed streaking artifacts using adjacent A-line information [16,17]. However, this method is applicable only to sparse saturated signals, risking the loss of real sample structure information. To address this, a parabolic pseudo-spectral reconstruction method was proposed for SD-OCT [18], preserving the structure information of densely saturated A-lines. However, if the spectrum becomes significantly saturated, the parabolic curve may not suitably describe the saturated region, causing an increase in the ensuing error. Recently, a dictionary-based sparse representation approach was employed for the correction of saturation artifacts as an in-painting problem [15]. In this study, a universal dictionary was initially trained using patches extracted from carefully selected clear sample images. Following the designation of saturated A-lines as zeros, the saturated patches in spectral-domain optical coherence tomography (SD-OCT) images underwent in-painting through the application of the trained dictionary utilizing the orthogonal matching pursuit (OMP) method. This approach effectively preserves the majority of structural information within the samples and is adaptable to both sparse and densely distributed saturated A-lines. However, the in-painting of saturated signals involves a linear combination of trained basis vectors, inevitably resulting in a certain degree of image blurring in the restored images. Furthermore, in instances where the streaking band of saturation artifacts is relatively wide, the necessity for larger-sized training patches arises, consequently leading to more pronounced blurring in the reconstructed results.

In an effort to overcome the limitations associated with current hardware and software methods, this study introduces a deep learning (DL) [19] model named ATN-Res2Unet, incorporating multi-attention mechanisms, multi-scale perception, and notch filtering [20]. ATN-Res2Unet aims to eliminate saturation artifacts in OCT images, representing a pioneering application of DL to address saturation artifact issues in OCT systems. The multi-scale perception and attention mechanism enhance the detection and removal of varying-width saturation artifacts in OCT images. Notch filtering is employed to generate ground truth from weak-artifact images and is integrated into the DL model to further enhance its performance. With the exception of the advanced ATN-Res2Unet, other contributions of this study are summarized as follows: (1) To address the challenging problem of acquiring data pairs in endoscopy OCT for network optimization, a dataset-construction method that integrates real artifacts is proposed, as shown in Fig. 1(a); (2) Two kinds of OCT images with saturation artifacts coming from two different endoscopy probes are used to verify the proposed method, demonstrating the excellent generality of ATN-Res2Unet; (3) To the best of our knowledge, this is the first time that the proposed deep learning method has been employed to remove the saturation artifacts in endoscopy OCT images, and the proposed ATN-Res2Unet is an excellent DL model for this task; synthetic and real artifact-images of mouse bile ducts and colon lumen based on SS-OCT platform demonstrate its feasibility and superiority. In summary, leveraging the capabilities of deep learning methods allows for adaptive processing of OCT images, effectively suppressing undesired artifacts while preserving essential structural information in the signals. This approach presents a promising avenue to enhance endoscopy OCT imaging, thereby improving its accuracy and applicability in clinical diagnosis.

Fig. 1. Flowchart of eliminating saturation artifacts in OCT images based on DL. (a) Construction of training dataset. (b) Model optimization and artifacts elimination.

Download Full Size | PDF

2. Materials and methods

The comprehensive workflow of this study is depicted in Fig. 1, segmented into three distinct phases: construction of the training dataset, model optimization, and artifacts elimination. In the initial phase, the training data pairs are formed by incorporating image patches featuring both strong and weak artifacts. Following data augmentation, these pairs are input into the proposed ATN-Res2Unet during the second phase. Subsequently, the saturation artifacts present in OCT images are eradicated through the utilization of the optimized network model. Detailed explanations of the critical components depicted in this figure are elaborated upon in the subsequent subsections. To enable readers in the optics field to better understand the proposed method, Supplement 1 Table S1 was provided to explain the common jargon used in deep learning.

2.1 Dataset construction

The saturation artifacts in OCT images do not conform to zero-mean noise characteristics, rendering the noise2noise DL model [21] unsuitable for this context. The zero-mean noise in the OCT acquisition data has been subtracted during the image reconstruction process. The saturation signals occur when the power of the reflected light occasionally exceeds the input range of the used detector or ADC, and they appear as streaking patterns originating from highly reflective areas. A detailed description of the characteristics of saturation artifacts can be found in [14] and [15]. Consequently, we must employ a supervised DL strategy reliant on ground truth that is free from saturation artifacts. However, obtaining B-scan images with and without saturation artifacts from the same cross-section in endoscopy OCT is nearly impossible owing to the instability of the externally driving catheter associated with B-scan positions and the inevitable generation of artifacts [22]. Even with reduced light intensity, artifacts persist albeit in a weakened state, leading to a simultaneous degradation in image quality.

In this study, we propose a strategy for constructing data pairs for network training, as illustrated in Fig. 1(a). This process can be delineated into three steps: (1) selection of B-scans containing both strong and weak artifacts from the acquired OCT images; (2) extraction of multi-size streaking artifacts from the strong artifact data and random insertion into images with weak artifacts to generate the source data; (3) application of a notch filter to the B-scans with weak artifacts to obtain clean images as ground truth. Figure 1(a) lists a flowchart showing the construction of a training dataset that aims to overcome the difficulty of acquiring paired data with the same cross sections. In this process, the real-artifact data are extracted from strong-artifact B-scans and merged into the weak-artifact B-scans to generate the source data, ensuring that the source image contains artifacts consistent with those produced by in vivo imaging. The result of weak-artifact images after the notch filter is used as the target data (i.e., ground truth) because the weak artifacts can be easily removed by a notch filter, thus providing high-quality label data. As a result, the constructed dataset comprises paired data with and without saturation artifacts for the same cross sections, and satisfy the optimization requirements for supervised DL models that were developed in this study. Subsequently, to augment the training data, various transformations, including mirroring, stretching, scaling, translation, and flipping, were applied to the constructed data pairs. Finally, the augmented training dataset was fed into the DL network to obtain the optimized model. Data augmentation is a popular and empirically validated strategy in the field of deep learning [23–25]. To demonstrate the advantages of this technique in the elimination of saturation artifacts in OCT images, we conducted related experiments and provided results in Supplement 1 Figure S1 and Table S2.

For a clearer illustration of the notch filtering process on weak-artifact images, a sample B-scan is used to demonstrate the steps, with intermediate results presented in Fig. 2. Initially, the weak-artifact image is transformed into the Fourier domain, and then a notch filter is applied to eliminate the vertical streaking artifacts. The difference map between the original image and the filtered one reveals the removal of exclusively vertical artifacts. According to the characteristics of streak patterns of saturation artifacts and the analysis using a large number of images, we determined the optimal range of the notch filter: in the spectral image, it is two symmetric rectangular areas on the left and right sides of the center with the three pixels center-offset for each side; each width is 3/8 of the image width except for the central area, and the height is 9 pixels. To illustrate how to determine the range of the notch filter, experiments were conducted and the results are listed in Supplement 1 Figure S2 and Table S3.

Fig. 2. Illustration of notch filtering on weak-artifact images.

Download Full Size | PDF

2.2 Network architecture

U-Net has proven to exhibit strong performance in medical image segmentation and denoising owing to its U-shaped framework encompassing both encoder and decoder components [26–30]. Leveraging U-Net as the foundation, we have introduced a novel network, named ATN-Res2Unet. The structure of ATN-Res2Unet, illustrated in Fig. 3, essentially represents a residual multi-scale network [31] that integrates multiple attention mechanisms from CBAM [32] and incorporates notch filtering. The figure, depicted using a sample image containing 512 $\times$ 512 pixels, provides explanations for each element (3D volume boxes) in the model at the bottom.

Fig. 3. Structure of ATN-Res2Unet.

Download Full Size | PDF

The encoder is comprised of four segments: the initial shallow segment incorporates a notch filter module and max pooling, while the subsequent three segments involve 3 $\times$ 3 convolution, Res2Net, and down-sampling. In the decoder path, the architecture mirrors that of the encoder in a symmetrical manner. Skip connections, augmented with convolutional block attention modules (CBAM), establish connections between corresponding parts of the encoder and decoder to mitigate information loss resulting from max-pooling. The distinguishing features of ATN-Res2Unet stem from the incorporation of CBAM, Res2Net, and the notch filter block, capitalizing on their individual strengths in feature recognition and noise suppression.

Res2Net, presented in Fig. 4(a), enhances network performance with a large depth while incurring smaller computation and memory overheads, making it well-suited for real-world applications with resource constraints. The structure of Res2Net comprises multiple sub-networks, each learning features at different scales. Processed feature maps are then fused together through channel cascading. Diverging from conventional residual connections, Res2Net introduces a novel cross-stage connection spanning different sub-networks, enabling effective information transmission and improved handling of large-scale image recognition tasks.Res2Net can be mathematically described as Eq. (1):

(1)$$F(y) = F(x) + \sum (\mathbf{W}_{i} \otimes G_{i}(x)),$$

where $x$ represents the input feature map, $F(x)$ represents the identity map that retains the input features in the residual link, $G_{i}(x)$ represents the feature transformation operation of the i-th branch, $\mathbf {W}_{i}$ represents the weight of the i-th branch, and $F(y)$ represents the final output of the block. The summation ($\sum$) involves the weighted combination of the feature maps from all branches. The notch filter block is devised to aid in suppressing artifacts in the frequency domain, as illustrated in Fig. 4(b). Within this block, a 3 $\times$ 3 convolutional block is initially applied, followed by batch normalization (BN) and a ReLU layer. Subsequently, a Res2Net is employed to capture multi-scale features. The notch filtering process is subsequently applied to each feature map derived from Res2Net to mitigate artifacts. Finally, the output of this process is combined with the original data through a skip connection to preserve signal fidelity. The notch filter is mathematically described by Eq. (2)–(6):

(2)$$S(u, v) = \mathcal{F}\{f(x, y)\},$$

(3)$$S_{\text{shift}}(u, v) = \mathcal{F}_{\text{shift}}\{S(u, v)\},$$

(4)$$H(u, v) = \begin{cases} 0, & \text{optimal notch regions} \\ 1, & \text{otherwise} \end{cases},$$

(5)$$G(u, v) = H(u, v) \cdot S_{\text{shift}}(u, v),$$

(6)$$g(x, y) = \mathcal{F}^{{-}1}_{\text{shift}}\{\mathcal{F}^{{-}1}\{G(u, v)\}\},$$

where $f(x, y)$ represents the intensity of the original image at the spatial coordinates $(x, y)$, $S(u, v)$ is the frequency spectrum after 2D FFT on $f(x, y)$, and $S_{\text {shift}}(u, v)$ is the centered frequency spectrum. $H(u, v)$ denotes the notch filter, designed to eliminate specific frequency components by setting the filter value to 0 within the defined optimal notch regions (otherwise 1). $G(u, v)$ is the product of the notch filter and the shifted frequency spectrum, representing the spectrum after filtering. $g(x, y)$ is the reconstructed spatial domain image after the inverse 2D FFT on $G(u, v)$, and $\mathcal {F}^{-1}_{\text {shift}}$ denotes the inverse centering operation. In the Notch Filter block, the activation function increases the nonlinearity of the entire network, causing the network to have more powerful representation and abstraction capabilities. The BN layer pulls the data distribution back to the normal distribution with a mean of 0 and a variance of 1 so that the value of the input activation function can produce a more obvious gradient during backpropagation. They both promote network convergence and prevent the problem of vanishing gradients.

Fig. 4. Illustration of Res2Net, Notch Filter Block, and CBAM Attention. (a) The structure of the Res2Net. (b) The structure of Notch Filter Block. (c) Flowchart of CBAM Attention.

Download Full Size | PDF

CBAM integrates channel and spatial attention mechanisms sequentially, as depicted in Fig. 4(c). It demonstrates robust position awareness, wide applicability, and relatively modest computational cost, having been validated to achieve state-of-the-art performance across various image recognition tasks. In CAM, two branches of MaxPool and AvePool are initially applied to the input feature maps. Subsequently, a shared multi-layer perceptron (MLP) is employed to learn the channel weights. The mathematical models for CAM are defined by Eq. (7) and (8):

(7)$$F' = M_C(F) \otimes F,$$

(8)$$\begin{aligned} M_C(F) & = \sigma (\text{MLP}(\text{AvgPool}(F)) + \text{MLP}(\text{MaxPool}(F))), \\ & = \sigma (W_1(W_0(F_{\text{avg}}^c)) + W_1(W_0(F_{\text{max}}^c))) \end{aligned}$$

where $F \in \mathbb {R}^{C \times H \times W}$ and $F' \in \mathbb {R}^{C \times H \times W}$ represent the input and weighed feature maps, $M_C(F)$ is channel attention function, $\sigma$ represents the Sigmoid activation function, and $W_0$ and $W_1$ are the network parameters in MLP.

In SAM, both maximum and average pooling operations are conducted on each pixel-position vector along the channel direction to generate two spatial attention maps. These maps are subsequently merged into a single map to assign weights to each feature map through dot production. The computational process is articulated by Eq. (9) and (10):

(9)$$F^{\prime\prime} = M_S(F') \otimes F',$$

(10)$$\begin{aligned} M_S(F') & = \sigma (f^{7\times 7}(\text{AvgPool}(F'); \text{MaxPool}(F'))), \\ & = \sigma (f^{7\times 7}([F_{\text{avg}}^{s}; F_{\text{max}}^{s}])) \end{aligned}$$

where, $M_S$ represents the spatial attention function, $f^{7 \times 7}$ is a $7\times 7$ convolutional kernel, and $F''$ is the refined feature map.

In theory, channel attention aims to model the correlation between different channels, automatically obtain the importance of each feature channel through network learning, and finally assign different weight coefficients to each channel to enhance important features and suppress unimportant features. Spatial attention aims to improve feature representation in key regions. It generates a weight mask for each location in the feature map and weights the output by determining the dot product between them to enhance specific target areas of interest while attenuating irrelevant background areas. To demonstrate the effectiveness of CBAM, we conducted related experiments, and the results before and after CBAM are provided in Supplement 1 Figure S3.

2.3 Model optimization and Implementation

The proposed DL model was optimized and executed on a PC featuring an Intel Core i9-12900KF CPU, NVIDIA GeForce RTX 3090Ti SUPER GPU, and 32 GB RAM. All programs were implemented in Python 3.8.5 using Tensorflow 2.4, and the system operated on the Windows 11 platform. During the optimization process, the adaptive moment estimation (Adam) was chosen as the optimizer. The initial learning rate for all network models was set to 0.001, with a batch size of 4, and the training process spanned 50 epochs. Given the regression nature of this study, the mean squared error (MSE) served as the loss function. MSE is the average squared difference between predicted results and ground truth. Notably, MSE offers advantages such as ease of differentiation, suitability for optimization algorithms like gradient descent, and a strong penalty for large deviations. This choice contributes to robust model training, mitigates overfitting, and enhances prediction accuracy.

2.4 Imaging system set up

This study utilized an endoscopy SS-OCT system developed by our research group for data acquisition. The SS-OCT system employs a swept source (SL131090, Thorlabs, USA) centered at 1310 nm with a bandwidth of 100 nm (FWHM) and a sweep frequency of 100 kHz to illuminate tissues. The emitted beam from the swept source is divided into two paths using a coupler (90:10) and directed into two circulators (CIR-1310-50-APC, Thorlabs, USA). One path is directed to the reference arm, while the other enters the sample arm. In the reference arm, the light passes through a collimator, lens, and mirror to ensure suitable light propagation. In the sample arm, the OCT beam is guided into a custom-built motor drive unit to enable three-dimensional scanning. A ball-lens-based endoscopy probe was specifically designed for optical scanning. The design process involves the fusion of a single-mode fiber with a non-core fiber (NCF) (MM-125-FA FUD3582, Nufern, USA) using a fusion splicer (FSM-100P, Fujikura). A detailed illustration of this SS-OCT system is presented in Fig. 5.

Fig. 5. Schematic of endoscopy SS-OCT system.

Download Full Size | PDF

2.5 Imaging protocol

To assess the effectiveness of our proposed ATN-Res2Unet in eliminating saturation artifacts, we conducted an ex vivo imaging experiment on the bile duct. In this experiment, the bile duct was meticulously dissected and placed flat on a surface following a midline incision. To minimize potential interference from additional optical elements, we eliminated the fiber optic slip ring and established a direct connection between the probe and the OCT system. Subsequently, the imaging probe was positioned on an electric displacement table, emitting a vertically downward beam focused on the inner surface of the flat bile duct. This scanning process was conducted at a consistent speed. Following the ex vivo experiment, in vivo experiments were performed on the colon lumen in mice to demonstrate the efficacy of our method. For the in vivo experiment, 8-week-old balb/c mice were chosen for imaging. Initially, the mice were placed in a closed plexiglass chamber, and a mixture of 5% isoflurane in oxygen was introduced to induce general anesthesia, ensuring the mice remained immobilized throughout the procedure. Continuous administration of anesthesia through a plastic pipe was maintained until the mice were fully anesthetized. To prevent interference from fecal matter and gas in the rectum during imaging, the rectum was meticulously flushed with a saline solution using a rectal irrigator. During the experiments, a total of 1000 cross-sectional OCT images were acquired through a 3 cm longitudinal scan. All animal experiments were conducted in accordance with the protocol approved by the Animal Research Committee of the Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences.

3. Experimental results

In this study, three imaging experiments were conducted using two distinct imaging probes integrated into the endoscopic SS-OCT system introduced in section 2.4. These probes are referred to as probe-1 and probe-2 in this work. Probe-1 was utilized for imaging the ex vivo bile duct and the in vivo colon lumen of mice, while probe-2 was specifically employed to acquire data from the in vivo colon lumen of mice. Due to the inherent challenges in ensuring uniformity in the manufacturing process of the small endoscopic probe, the characteristics of saturation artifacts in images from the two probes exhibit notable distinctions. In images obtained from probe-2, saturation artifacts are relatively weak and manifest as continuous, thin, and regular lines. In the data acquired from probe-1, the artifacts are more pronounced, characterized by wider streaking bands (i.e., more adjacent saturation A-lines), and they exhibit a discontinuous and irregular pattern. This enables a comprehensive evaluation of the performance of our proposed method in addressing saturation artifacts with varying features. The data acquisition process is detailed in Section 2.5, and the experimental results are presented in the subsequent subsections.

3.1 Ex vivo imaging using probe-1

We initiated ex vivo imaging experiments on the bile duct of mice using probe-1 to ascertain the feasibility and effectiveness of the proposed ATN-Res2Unet. Denoising experiments were conducted on both synthetic and real-artifact data. To assess our proposed method, we conducted comparative experiments involving several techniques, including notch filtering, dictionary learning, and various DL networks.

The results for synthetic-artifact data are presented in Fig. 6. Figure 6(A) depicts a synthetic-artifact image with pronounced saturation artifacts and wide streaking patterns. The denoised results of the B-scan using traditional and DL methods are shown in Fig. 6(C)–(H). The notch filter, designed to eliminate noise or interference signals at specific frequencies, effectively removed most saturated A-lines but left certain artifacts visible, as indicated by the red arrow in Fig. 6(C). Notably, when artifacts in the image manifest as high-frequency continuous signals, the notch filter’s performance is suboptimal. In the case of dictionary learning (Fig. 6(D)), the outcome appears smooth and artifact-free, but significant loss of details is observed, attributed to the wide artifact bands in the image necessitating a larger patch size in dictionary learning, resulting in substantial image blurring. Figure 6(E)–(H) display the denoising results based on U-Net, Res2Unet, Res2Unet with an attention mechanism (AT-Res2Unet), and our proposed ATN-Res2Unet. Overall, DL methods outperformed traditional approaches significantly. While U-Net left residual artifact traces in the image, other DL frameworks, including Res2Unet, AT-Res2Unet, and ATN-Res2Unet, effectively eliminated saturated artifacts while retaining the primary structural information of the image.

Fig. 6. Artifact-eliminating experiments on synthetic-artifact data from ex vivo OCT imaging on bile duct by probe-1. (A) B-Scan with synthetic artifact-image; (B) Ground truth; (C) Notch filter result; (D) Dictionary learning result; (E)–(H) Results from DL models of U-Net, Res2UNet, AT-Res2Unet, and the proposed ATN-Res2Unet.

Download Full Size | PDF

The denoising results for real-artifact data obtained through the endoscopic SS-OCT system with probe-1 are presented in Fig. 7. Figure 7(A) displays an original OCT B-scan from the bile duct, while Fig. 7(B) shows the region of interest (ROI) containing saturation artifacts. Figures 7(C)–(H) showcase the denoising outcomes of the ROI using different methods. Figure 7(I) includes a bar graph comparing the contrast-to-noise ratio (CNR) and signal-to-noise ratio (SNR) for each denoising method. Areas indicated by the red arrows in these figures represent dissatisfactory denoising results.

Fig. 7. Artifact-eliminating experiments on real-artifact data from ex vivo OCT imaging on bile duct by probe-1. (A) Original noisy OCT B-scan; (B) ROI in A containing artifacts; (C) Notch filtering result; (D) Dictionary learning result; (E)–(H) Results of DL models: U-Net, Res2UNet, AT-Res2Unet, and the proposed ATN-Res2Unet, respectively; (I) Bar graph displaying CNR and SNR metrics for different methods.

Download Full Size | PDF

In Fig. 7(C), notch filtering effectively eliminates artifacts with weak amplitudes but struggles with strong noises exhibiting horizontal structures. The denoising effect of dictionary learning in Fig. 7(D) mirrors its performance on synthetic artifact-image, resulting in a blurred image. Deep learning methods, represented by U-Net in Fig. 7(E), demonstrated superior performance, with only slight residues of artifacts. Notably, the resemblance between real artifacts and the signals is higher than that of synthetic artifact-image, suggesting a discrepancy between training data and real-artifact images. Nevertheless, our network model has demonstrated robust generalization capabilities, further attesting to the viability of our proposed paired-data construction approach.

Figure 6 and 7 provide visual comparisons of various denoising techniques on both synthetic and real data. In this experiment, the width of the notch filter is set to 3/8 of the width of the image with a 9-pixel height and a 3-pixel central offset at each side, and they are determined by the criterion introduced in Supplement 1 Fig. S2. Here, we employ small ROI to train the network to realize savings in the training cost and to predict the small ROI in synthetic data and entire image directly for in vivo data. Experiments demonstrate the excellent performance of this strategy. To verify other strategies of network training and prediction based on either a small ROI or the entire image, related experiments were conducted, and the results are listed in Supplement 1 Figure S4 and Table S4.

Subsequently, we conducted a quantitative analysis on these methods. Using 36 sets of synthetic test data, we computed evaluation metrics of structural similarity (SSIM), peak signal-to-noise ratio (PSNR), mean squared error (MSE), CNR, and SNR, and their specific values are tabulated in Table 1. The wide artifacts cause the dictionary learning results to be exceedingly blurred, leading to high CNR and SNR. Compared with the traditional notch filter, the statistical indicators for DL methods increased by 5.2–7.6%, 25.6–45.1%, 23.2–29.0%, 35.2–76.9%, and 35.2–71.6%, respectively. This highlights the precision of our model in reconstructing images. Under the absence of a ground truth for the real-artifact data, our evaluation relies solely on the CNR and SNR metrics, the values of which are displayed in the bar graphs in Fig. 7(I). Holistically evaluating this data, deep learning methods distinctly outperform both notch filtering and dictionary learning. While dictionary learning excels in terms of CNR and SNR, it does so at the cost of compromising image structural information. Among all deep learning models, our proposed ATN-Res2Unet consistently outperforms across all evaluation metrics. This quantitative analysis further corroborates the superior performance of our ATN-Res2Unet model.

Table 1. Quantitative comparisons among different methods for ex vivo synthetic data based on probe-1.^a

View Table | View all tables in this article

3.2 In vivo imaging using probe-1

In Section 3.1, we demonstrated the significant advantages of deep learning methods over other techniques, relying solely on ex vivo data from the bile duct. However, this is not sufficient to comprehensively showcase the capabilities of various methods. In light of this, we conducted in vivo imaging on the mice colon using the SS-OCT endoscopy system with probe-1. The acquired in vivo images undoubtedly increased the complexity of the denoising task.

Figure 8 presents comparative results of real-artifact removal using notch filtering and deep learning methods. Due to the substantial width of artifacts in the dataset, the denoised image by dictionary learning appeared significantly blurred, analogous to the results in Fig. 7(D), and is not shown in this figure. Figure 8(A) displays an original B-Scan image with strong artifacts radiating from the center to the entire imaging area along the axial direction, exhibiting characteristics similar to the ex vivo data, as both were acquired by the same probe. The denoising outcomes from different methods are illustrated in Fig. 8(B)–(F). To show the details more precisely, we selected four regions of interest (ROIs) named ROI-1, ROI-2, ROI-3, and ROI-4, and magnified them for display. Observing the images in the second row, the notch filter has unique benefits in eliminating stripe saturation artifacts immersed in dense signals, as shown in ROI-3; however, it is not satisfactory for other regions, such as ROI-4. For DL methods, the simple U-Net can suppress most artifacts in all regions but leads to dramatic signal loss, as illustrated by rectangular or elliptical boxes in ROIs in Fig. 8(C). With the incorporation of the multi-scale Res2Block and CBAM attention module, the ability to remove artifacts is enhanced, and more structural information is preserved simultaneously (see ROI-1 and ROI-4 in Fig. 8(D) and 8(E)). However, the convolution kernel in the network is essentially a filter, generally resulting in image blurring in the process of denoising, particularly with a large convolution template, as observed in the regions indicated by arrows in ROI-3 in Fig. 8(D) and 8(E)). Moreover, artifacts still exist that are not removed, as indicated by rectangular boxes in ROI-2. Nonetheless, when the notch filtering block is integrated into the DL model, image blurring is effectively reduced, and subtle saturation artifacts are successfully eliminated with detailed information preserved simultaneously, as shown in Fig. 8(F) and its ROIs. Subsequent quantitative analysis on CNR and SNR listed in Table 2 also confirms the superior performance of our proposed ATN-Res2Unet over other methods.

Fig. 8. Artifact-eliminating experiments on real-artifact data from in vivo OCT imaging on mouse colon by probe-1. (A) Original noisy image; (B) Result from notch filter; (C)–(F) Results from DL models of U-Net, Res2UNet, AT-Res2Unet, and the proposed ATN-Res2Unet, respectively; (1)–(4) corresponding enlarged ROIs indicated by boxes in (A)–(F).

Download Full Size | PDF

Table 2. Quantitative comparisons among different methods for in vivo real data based on probe-1.^a

View Table | View all tables in this article

3.3 In vivo imaging using probe-2

The fabrication of mini probes is difficult and complex in endoscopic SS-OCT, resulting in probes with different imaging characteristics even if produced using the same principles and process. Recognizing this fact, we conducted additional in vivo imaging experiments on the mouse colon using probe-2 to verify the generality of our proposed DL models. Figure 9(A) displays one B-scan acquired by probe-2, with the saturation artifacts in the image exhibiting different features—narrower width and weaker signals compared with those from probe-1. Figure 9(B)–(G) illustrate the denoised results using notch filtering, dictionary learning, and DL models. To better observe the real structure of the colon, we performed a polar coordinate transformation on the original signal matrix to obtain polar coordinate images, as shown in Fig. 9(a)–(g). In this dataset, the filtering effects of the notch filter are not as effective as in the previous dataset from probe-1, with still visible dramatic artifact lines in the filtered image (Fig. 9(B)). In contrast, the dictionary learning method performs considerably better than in the imaging experiments based on probe-1 in handling such narrow artifacts (Fig. 9(C)), owing to the small patch size used in dictionary learning (8 $\times$ 8), which is sufficient for the narrow artifact bands. However, the overall image resolution noticeably decreases. Due to the simplicity of the noise signals, all the deep learning models achieve satisfactory results in this dataset, demonstrating excellent performance in both artifact removal and image resolution preservation. Detailed information can be observed in the enlarged sub-figures of three selected ROIs. For this dataset, the denoising results of the synthesized artifact data are shown in Fig. 10. The denoising performance aligns consistently with the visual effect of the actual data. Specifically, as shown in Fig. 10(C), the magnified region highlighted in red clearly reveals that the notch filter method still retains noises. Figure 10(D) shows dramatic blurriness in the area pointed to by the red arrow, in the case of dictionary learning.

Fig. 9. Artifact-eliminating experiments on real-artifact data from in vivo OCT imaging on mouse colon by probe-2. (A) Original noisy image; (B) Result from notch filter; (C) Result from dictionary learning; (D)–(G) Results from DL models of U-Net, Res2UNet, AT-Res2Unet, and proposed ATN-Res2Unet, respectively; (a)–(g) corresponding polar-coordinate images; (1)–(3) corresponding enlarged ROIs indicated by boxes in (A)–(G). (H) Bar graph displaying CNR and SNR metrics for different methods.

Download Full Size | PDF

Fig. 10. Artifact-eliminating experiments on synthetic-artifact data from in vivo OCT imaging on mouse colon by probe-2. (A) Synthetic noisy image; (B) Ground truth; (C) Result from notch filter; (D) Result from dictionary learning; (E)–(F) Results from DL models of U-Net, Res2UNet, AT-Res2Unet, and proposed ATN-Res2Unet, respectively.

Download Full Size | PDF

To evaluate the performance of various methods quantitatively, we computed parameters similar to previous experiments, and the values are listed in Fig. 9(H) and Table 3. Similar results can be concluded; the DL method is superior to other methods, and our proposed ATN-Res2Unet exhibits the best result in all quantitative indicators. Compared with notch filter and dictionary learning, the statistical indicators for our methods (SSIM, PSNR, MSE, CNR, and SNR) were increased by 6.3–19.5%, 31.9–63.5%, 9.3–70.3%, 56.0–59.1%, and 9.4–14.9%, respectively.

Table 3. Quantitative comparisons among different methods for in vivo imaging with probe-2.^a

View Table | View all tables in this article

3.4 Convergence analysis

In the preceding sections, we demonstrated the advantages of DL models in removing saturation artifacts through visual results and quantitative analysis on three datasets—one ex vivo and two in vivo. Here, we discuss the experiments conducted to evaluate the convergence of DL networks. Their loss curves on three datasets are depicted in Fig. 11. Overall, the loss curves of all DL models exhibit rapid convergence during the training process, with the loss value decreasing to below 0.02 within 10 epochs. This suggests that DL networks possess rapid training and convergence abilities. Particularly, the designed ATN-Res2Unet network achieves minimum errors on two in vivo datasets, demonstrating its superiority in in vivo imaging.

Fig. 11. Loss curves of DL networks during the training process.

Download Full Size | PDF

3.5 Time efficiency analysis

In this section, we discuss time-comparison experiments that were performed on different methods to evaluate their time efficiency, as shown in Table 4. The size of the entire image is 512 $\times$ 512 pixels in the time analysis. For deep learning methods, the training time is calculated based on small ROI (512 $\times$ 512 pixels) statistics, and the prediction time is based on the entire image. This is to ensure fairness during comparison with other methods because a notch filter is used on the entire image and the dictionary is trained via small patches. The proposed network model demonstrates excellent denoising capabilities owing to the incorporation of the Notch Filter block and CBAM module; however, the training time and prediction time both increase compared with other models. Fortunately, the prediction of the proposed method is fast, achieving approximately 3.4 fps based on the platform (Section 2.3) used in our experiments. As a result, the proposed method can achieve a fast image display on relatively low-environment devices and is expected to achieve a real-time image display under high computing platform configurations.

Table 4. Analysis of time efficiency among different methods.

View Table | View all tables in this article

4. Discussion and conclusion

In this study, the ATN-Res2Unet network has been validated across three datasets, demonstrating its ability to accurately detect and rectify saturation artifacts in endoscopic OCT images while preserving essential structural information. Our method surpasses traditional techniques such as notch filtering and dictionary learning, as evidenced by improved quantitative metrics. This development is expected to enhance the clinical utility of endoscopic OCT in diagnosing luminal diseases such as cardiovascular and intestinal diseases through deep learning, allowing for clearer observation of minute tissue structures, improving diagnostic accuracy and specificity of treatment plans. Therefore, to verify the performance of the proposed DL model on tissues with branch structures and more irregular cross sections, we conducted imaging experiments on bovine heart vessel branches and the cancerous colon of a mouse, and the experimental results are presented in Supplement 1 Figure S5 and Figure S6, respectively. To verify the performance of the proposed DL model on a multi-layered structure, we conducted an artifact-removal experiment on the OCT whole-eye image of a mouse, and the results are provided in Supplement 1 Fig. S7. Despite these advancements, scope for improvement remains. For example, optimizing hyperparameters, such as channel counts and kernel sizes, could further refine denoising capabilities. Moreover, the network’s generalization ability in complex structures is currently limited by the dataset size. Hence, expanding training datasets is crucial at the next step. Additionally, the network architecture could benefit from integrating more advanced self-attention mechanisms and a streamlined encoder–decoder to reduce resource demands.

During our experiments, a notch filter was applied to address axial saturation artifacts: while effective, it occasionally filters out legitimate signals overlapping with noise frequencies. This is illustrated in Fig. 12, where small signals with a vertical direction might be inadvertently diminished. However, these instances are infrequent in SS-OCT imaging in practice. By incorporating the filter within the deep learning model, in our method we mitigate such flaws by combining the filter’s benefits and convolutional processing on the original image data.

Fig. 12. Illustration of small flaws in notch filtering.

Download Full Size | PDF

Due to the challenges in directly obtaining paired endoscopic OCT images for network optimization, we manually constructed training data pairs by inserting artifacts into the high-quality OCT images, and utilized a notch filter on them to generate labels. Although the optimized model has been proven to have excellent real artifact removal capabilities, manually constructed training data pairs cannot fully reflect the characteristics of real data pairs. In the future, we plan to explore self-supervised or semi-supervised learning methods to overcome this limitation. This transition will allow for the automatic extraction of useful features from unlabeled data, reducing reliance on manually annotated data.

In conclusion, this study paves a new avenue for harnessing deep learning to optimize OCT imaging technology. With continuous refinement and practical applications, this method can significantly augment the clinical value of OCT.

Funding

National Natural Science Foundation of China (61675134, 62175156); Natural Science Foundation of Shandong Province (ZR2020MF105); Science and Technology Innovation Project of Shanghai Science and Technology Commission (19441905800, 22S31903000); Guangdong Provincial Key Laboratory of Biomedical Optical Technology (2020B121201010); Innovation Capacity Improvement Project for Technology-based Small/Medium-sized Enterprises of Shandong Province (2023TSGC0095).

Disclosures

The authors declare no conflicts of interest.

Data availability

The dataset constructed in this paper can be publicly available at [33].

Supplemental document

See Supplement 1 for supporting content.

References

1. C. Lal, S. Alexandrov, S. Rani, et al., “Nanosensitive optical coherence tomography to assess wound healing within the cornea,” Biomed. Opt. Express 11(7), 3407–3422 (2020). [CrossRef]

2. M. Göb, T. Pfeiffer, W. Draxinger, et al., “Continuous spectral zooming for in vivo live 4d-oct with mhz a-scan rates and long coherence,” Biomed. Opt. Express 13(2), 713–727 (2022). [CrossRef]

3. D. Vasquez, F. Knorr, F. Hoffmann, et al., “Multimodal scanning microscope combining optical coherence tomography, raman spectroscopy and fluorescence lifetime microscopy for mesoscale label-free imaging of tissue,” Anal. Chem. 93(33), 11479–11487 (2021). [CrossRef]

4. J. Qiu, J. Meng, Z. Liu, et al., “Fast simulation and design of the fiber probe with a fiber-based pupil filter for optical coherence tomography using the eigenmode expansion approach,” Opt. Express 29(2), 2172–2183 (2021). [CrossRef]

5. D. WuDunn, H. L. Takusagawa, A. J. Sit, et al., “Oct angiography for the diagnosis of glaucoma: A report by the american academy of ophthalmology,” Ophthalmology 128(8), 1222–1235 (2021). [CrossRef]

6. Y. Li, R. S. Murthy, Y. Zhu, et al., “1.7-micron optical coherence tomography angiography for characterization of skin lesions–a feasibility study,” IEEE Trans. Med. Imaging 40(9), 2507–2512 (2021). [CrossRef]

7. S. Kimura, “Clinical significance of intracoronary optical coherence tomography examination in the fields of onco-cardiology/cardio-oncology,” Int. J. Cardiol. 335, 139–140 (2021). [CrossRef]

8. M. Gende, V. Mallen, J. de Moura, et al., “Automatic segmentation of retinal layers in multiple neurodegenerative disorder scenarios,” IEEE J. Biomed. Health Inform. 27(11), 5483–5494 (2023). [CrossRef]

9. J. Zhang, T. Nguyen, B. Potsaid, et al., “Multi-mhz mems-vcsel swept-source optical coherence tomography for endoscopic structural and angiographic imaging with miniaturized brushless motor probes,” Biomed. Opt. Express 12(4), 2384–2403 (2021). [CrossRef]

10. L. Qi, K. Zheng, X. Li, et al., “Automatic three-dimensional segmentation of endoscopic airway oct images,” Biomed. Opt. Express 10(2), 642–656 (2019). [CrossRef]

11. G. Ni, Y. Chen, R. Wu, et al., “Sm-net oct: a deep-learning-based speckle-modulating optical coherence tomography,” Opt. Express 29(16), 25511–25523 (2021). [CrossRef]

12. J. Hossbach, L. Husvogt, M. F. Kraus, et al., “Deep oct angiography image generation for motion artifact suppression,” in Bildverarbeitung für die Medizin 2020: Algorithmen–Systeme–Anwendungen. Proceedings des Workshops vom 15. bis 17. März 2020 in Berlin, (Springer, 2020), pp. 248–253.

13. C.-T. Wu, M.-T. Tsai, and C.-K. Lee, “Two-level optical coherence tomography scheme for suppressing spectral saturation artifacts,” Sensors 14(8), 13548–13555 (2014). [CrossRef]

14. X. Li, S. Liang, and J. Zhang, “Correction of saturation effects in endoscopic swept-source optical coherence tomography based on dual-channel detection,” J. Biomed. Opt. 23(03), 1–030502 (2018). [CrossRef]

15. H. Liu, S. Cao, Y. Ling, et al., “Inpainting for saturation artifacts in optical coherence tomography using dictionary-based sparse representation,” IEEE Photonics J. 13, 1–10 (2021). [CrossRef]

16. Y. Huang and J. U. Kang, “Real-time reference a-line subtraction and saturation artifact removal using graphics processing unit for high-frame-rate fourier-domain optical coherence tomography video imaging,” Opt. Eng. 51(7), 073203 (2012). [CrossRef]

17. J.-h. Kim, J.-H. Han, and J. Jeong, “Adaptive optimization of reference intensity for optical coherence imaging using galvanometric mirror tilting method,” Opt. Commun. 351, 57–62 (2015). [CrossRef]

18. C.-K. Lee, M.-T. Tsai, and C.-T. Wu, “A pseudo-spectrum reconstruction method for reducing saturation artifact in spectral-domain optical coherence tomography,” in Biophotonics: Photonic Solutions for Better Health Care IV, vol. 9129 (SPIE, 2014), pp. 52–57.

19. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

20. I. Aizenberg and C. Butakoff, “A windowed gaussian notch filter for quasi-periodic noise removal,” Image and Vision Computing 26(10), 1347–1353 (2008). [CrossRef]

21. J. Lehtinen, J. Munkberg, J. Hasselgren, et al., “Noise2noise: Learning image restoration without clean data,” arXiv, arXiv:1803.04189 (2018). [CrossRef]

22. D. Huang, E. A. Swanson, C. P. Lin, et al., “Optical coherence tomography,” Science 254(5035), 1178–1181 (1991). [CrossRef]

23. S. Yang, W. Xiao, M. Zhang, et al., “Image data augmentation for deep learning: A survey,” arXiv, arXiv:2204.08610 (2022). [CrossRef]

24. M. Elgendi, M. U. Nasir, Q. Tang, et al., “The effectiveness of image augmentation in deep learning networks for detecting covid-19: A geometric transformation perspective,” Front. Med. 8, 629134 (2021). [CrossRef]

25. B. Barile, A. Marzullo, C. Stamile, et al., “Data augmentation using generative adversarial neural networks on brain structural connectivity in multiple sclerosis,” Computer methods and programs in biomedicine 206, 106113 (2021). [CrossRef]

26. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, (Springer, 2015), pp. 234–241.

27. G. Chen, L. Li, J. Zhang, et al., “Rethinking the unpretentious u-net for medical ultrasound image segmentation,” Pattern Recognition p. 109728 (2023).

28. K. Sun, Y. Chen, Y. Chao, et al., “A retinal vessel segmentation method based improved u-net model,” Biomedical Signal Processing and Control 82, 104574 (2023). [CrossRef]

29. C.-H. Chuang, K.-Y. Chang, C.-S. Huang, et al., “Ic-u-net: a u-net-based denoising autoencoder using mixtures of independent components for automatic eeg artifact removal,” NeuroImage 263, 119586 (2022). [CrossRef]

30. J. Zhang, Y. Niu, Z. Shangguan, et al., “A novel denoising method for ct images based on u-net and multi-attention,” Comput. Biol. Med. 152, 106387 (2023). [CrossRef]

31. S.-H. Gao, M.-M. Cheng, K. Zhao, et al., “Res2net: A new multi-scale backbone architecture,” IEEE Trans. Pattern Anal. Mach. Intell. 43(2), 652–662 (2019). [CrossRef]

32. S. Woo, J. Park, J.-Y. Lee, et al., “Cbam: Convolutional block attention module,” in Proceedings of the European conference on computer vision (ECCV), (2018), pp. 3–19.

33. Y. Zhao, “SS-OCT-SA,” GitHub (2024), https://github.com/yongfuzhao/SS-OCT-SA.

	SSIM	PSNR	MSE	CNR	SNR
Ground Truth	1.0	/	0.0	4.8568	9.7136
Synthetic Noisy Image	0.8850	20.88	0.0082	2.1848	4.3697
Notch Filter	0.9046	25.65	0.0027	3.2973	6.5946
Dictionary Learning	0.6889	26.01	0.0025	6.0726	12.1453
U-Net	0.9513	31.00	0.00080	4.0659	8.1318
Res2UNet	0.9637	34.63	0.00039	4.7264	9.4529
AT-Res2UNet	0.9708	34.80	0.00034	4.8594	9.7189
ATN-Res2UNet	0.9724	35.08	0.00032	4.8679	9.7226

	Original Noisy Image	Notch Filter	U-Net	Res2UNet	AT-Res2UNet	ATN-Res2UNet
CNR	0.9532	5.967	2.7416	4.4894	4.4267	6.1224
SNR	0.3453	6.1621	3.1672	5.3438	8.8353	9.0459

	SSIM	PSNR	MSE	CNR	SNR
Ground Truth	1.0	/	0.0	3.3999	6.7999
Synthetic Noisy Image	0.7947	25.34	0.0029	0.5653	1.1306
Notch Filter	0.9319	34.70	0.00033	1.4384	2.8768
Dictionary Learning	0.8274	26.70	0.0021	3.4370	4.4385
U-Net	0.9758	38.16	0.00015	2.6426	4.8075
Res2UNet	0.9606	38.61	0.00013	2.5064	5.0129
AT-Res2UNet	0.9578	37.68	0.00017	2.5110	5.3692
ATN-Res2UNet	0.9820	42.19	0.00006	3.4415	6.8829

Time	Notch Filter	Dictionary Learning	U-Net	Res2UNet	AT - Res2UNet	Proposed
Training time (s)	-	829 $\pm$ 3	5032 $\pm$ 3	13672 $\pm$ 3	14960 $\pm$ 5	30490 $\pm$ 5
Testing time (s)	0 $\pm$ 0.01	5074 $\pm$ 1	0.078 $\pm$ 0.01	0.255 $\pm$ 0.01	0.267 $\pm$ 0.01	0.296 $\pm$ 0.01

	SSIM	PSNR	MSE	CNR	SNR
Ground Truth	1.0	/	0.0	4.8568	9.7136
Synthetic Noisy Image	0.8850	20.88	0.0082	2.1848	4.3697
Notch Filter	0.9046	25.65	0.0027	3.2973	6.5946
Dictionary Learning	0.6889	26.01	0.0025	6.0726	12.1453
U-Net	0.9513	31.00	0.00080	4.0659	8.1318
Res2UNet	0.9637	34.63	0.00039	4.7264	9.4529
AT-Res2UNet	0.9708	34.80	0.00034	4.8594	9.7189
ATN-Res2UNet	0.9724	35.08	0.00032	4.8679	9.7226

ATN-Res2Unet: an advanced deep learning network for the elimination of saturation artifacts in endoscopy optical coherence tomography

Abstract

1. Introduction

2. Materials and methods

2.1 Dataset construction

2.2 Network architecture

2.3 Model optimization and Implementation

2.4 Imaging system set up

2.5 Imaging protocol

3. Experimental results

3.1 Ex vivo imaging using probe-1

3.2 In vivo imaging using probe-1

3.3 In vivo imaging using probe-2

3.4 Convergence analysis

3.5 Time efficiency analysis

4. Discussion and conclusion

Funding

Disclosures

Data availability

Supplemental document

References

Supplementary Material (1)

Data availability

Cited By

Figures (12)

Tables (4)

Equations (10)

Optics Express