Fourier single pixel imaging reconstruction method based on the U-net and attention mechanism at a low sampling rate

Pengfei Jiang; Pengfei Jiang; Jianlong Liu; Jianlong Liu; Long Wu; Lu Xu; Lu Xu; Jiemin Hu; Jianlong Zhang; Yong Zhang; Yong Zhang; Xu Yang; Xu Yang; Xu Yang

doi:10.1364/OE.457551

1. Introduction

In recent years, the researches of single pixel imaging mainly focused on efficient and high-quality imaging reconstruction method [1]. The single pixel imaging based on spatial light modulation modulates the spatial distribution of the object and collects the optical signal through a single pixel detector. The image of the object is recovered through the calculation of the spatial information. At present, single pixel imaging technology based on spatial light modulation has been applied in visible light imaging [2], infrared imaging [3], terahertz imaging [4–6], image encryption [7], object tracking [8], 3D imaging [9,10] and underwater imaging [11].

Single pixel imaging can use random patterns as spatial light modulation patterns by using random patterns to illuminate or detect the light field. The object image can be reconstructed by intensity modulation and correlation calculation between the random patterns and the corresponding single pixel detection. However, single pixel imaging using spatial modulation of random patterns requires far more measurements than the number of pixels in the image in order to obtain high-quality imaging results, which results in long data acquisition time [12]. In order to reduce the data acquisition time, the single pixel imaging technology based on compressed sensing was proposed [13–14]. This technology also uses random patterns as spatial light modulation patterns, and the core of its technology is to use compressed sensing algorithms to reconstruct images. Although this method reduces the data collection time, the reconstruction time is greatly increased. In recent years, some methods of compressed sensing reduced the reconstruction time to some extent, but the reconstruction time is still long compared with the deterministic dictionary methods (such as FSI), which utilize inverse transformation to achieve the target reconstruction [15,16]. Furthermore, in the single-pixel imaging system without the deterministic dictionary method, the spatial distribution information of random speckles generated by rotating ground glass need to be measured through array detector and the measurements are correlated with the measurements of single pixel detector to reconstruct the image of target. Therefore, the hardware complexity of non-deterministic single-pixel imaging system is high, which is not conducive to system miniaturization.

In addition to using random patterns, single-pixel imaging can also use deterministic lighting patterns. The problem of long reconstruction time can be well solved by single-pixel imaging based on deterministic patterns, such as Hadamard single-pixel imaging (HSI) [17], Wavelet transform single pixel imaging [18] and FSI [19–21]. Although the complexity of deterministic dictionaries is not high, it is difficult to reconstruct high-quality results under the condition of extremely low sampling rate (<10%), similar to compressive sensing method. Because of the problems existing in deterministic dictionaries, a large number of studies hope to achieve single pixel imaging reconstruction with the help of deep learning under the condition of extremely low sampling rate.

FSI has been proved to be a base-scan single pixel imaging technique with high imaging quality and high imaging efficiency [22]. Although image reconstruction is only a simple two-dimensional inverse Fourier transform for FSI and the operation time is very short, a great number of measurements are needed to obtain high quality target images. FSI based on the four-step phase-shifting algorithm [19] clearly reconstructed a 256×256 pixels picture by 131072 (2×image pixels) measurements. To reduce the number of measurements, three-step FSI [20] and two-step FSI [21] were proposed which needed 98304 (1.5×image pixels) measurements and 65536 (1×image pixels) measurements to reconstruct a 256×256 pixels picture. However, it is difficult to realize a two-step phase-shifting FSI because of the interference of ambient light. The three-step FSI is a more appropriate choice. The imaging time of Fourier single pixel includes data acquisition time and image reconstruction time. In order to further reduce the imaging time of FSI, under-sampling is often needed for data acquisition. Due to the lack of high frequency information, the result of reconstruction will always have obvious ringing effect. There is still a great contradiction between the number of measurements and imaging quality.

Recently, deep learning (DL) has been widely used to solve problems in optics [23–25]. In order to reduce the influence of the number of measurements on the imaging quality of FSI, the FSI scheme based on DL is proposed [26–29]. These methods improve the reconstruction image quality under FSI sampling condition. However high-resolution results are needed in some special real-time imaging. The image with high resolution obtained by the same experimental device contains more noise than the image with low resolution, which is not conducive to the subsequent algorithm reconstruction. There is currently no approach that can both improve the quality of reconstructed results and achieve super resolution. In this paper, a super-resolution reconstruction method based on FSI (SR-FSI) at an extraordinary low sampling rate is proposed. In this method, the imaging system is used to obtain low resolution results, and then a super resolution reconstruction algorithm is used. This method can not only eliminate the ringing effect of FSI under the condition of under-sampling, but also improve the resolution of the generated image. The structure of the network adopts a generative adversarial network (GAN) [30] based on the U-net [31] and attention mechanism. In order to improve the reconstruction quality of FSI at a low sampling rate and obtain high resolution simultaneously, the parallel structure is adopted. The simulation and experimental results prove that the proposed method has much higher reconstruction quality than the conventional FSI and deep learning methods at the same sampling point. Therefore, the proposed SR-FSI method is suitable to reconstruct the high-resolution image of targets scene under under-sampling conditions in practical imaging.

In general, the contribution of this work is mainly in three aspects:

(1) The idea of using super-resolution as an under-sampled FSI reconstruction method is proposed to obtain high-quality high-resolution results directly from low-resolution FSI results.
(2) The network adopts a generative adversarial network structure that combines U-net and attention mechanism in parallel. The network simultaneously achieves super resolution and removes ringing effect caused by under-sampling, which provides guidance for the design of FSI reconstruction algorithm.
(3) Simulation and experiments prove that the proposed method can effectively improve the reconstruction quality of FSI at low sampling rate, which is of great significance to the practical application of FSI.

2. Methods

2.1 Imaging scheme of FSI

In FSI system, the phase shifting is adopted for spatial frequency spectrum acquisition, and the reconstructed images can be obtained via inverse Fourier transform (IFT). In this paper, a three-step phase-shifting method is used in the imaging system. The three-step phase-shifting method [20] requires 1.5 times as many measurements as the pixels in the image. The spatial distribution of pre-generated Fourier base modulation patterns is denoted as:

(1)$${P_\varphi }({x,y;{f_x},{f_y}} )= a + b \cdot \cos ({2\pi {f_x}x + 2\pi {f_y}y + \varphi } ),$$

where (x, y) are the 2D Cartesian coordinates of the illumination patterns, (f_x, f_y) represents spatial frequency, a is the average intensity that determines the average brightness of the illumination patterns, b represents the amplitude of the modulation pattern and φ is the initial phase. The intensity of the laser which illuminates the spatial light modulator is uniform and the intensity is denoted as E₀. Therefore, the distribution of speckle received by optical antenna is denoted as:

(2)$$E({x,y} )= {P_\varphi }({x,y;{f_x},{f_y}} ){E_0}.$$

The modulated laser illuminates the target. Then, the reflected laser from the target is received by the receiving antenna and focused on the single-pixel detector. Here, the reflectivity distribution of targets is denoted as R (x, y). Hence, the measurement of single-pixel detector is denoted as:

(3)$${D_\varphi }({f_x},{f_y}) = {D_n} + k\int\!\!\!\int {R(x,y){P_\varphi }} (x,y;{f_x},{f_y}){E_0}dxdy,$$

where k is the intensity modulation coefficient and D_n is the light response value caused by ambient light at the detector position. In FSI system, the spatial frequency spectrum of target can be obtained by the three-step phase-shifting approach. One spatial frequency (f_x, f_y) corresponds to three speckle patterns, which have the same (f_x, f_y) but different phases. The phase shifting between two adjacent patterns is a constant 2π/3. When the values of the phases φ are equal to 0, 2π/3 and 4π/3, the measurements of the single-pixel detector are denoted as D₀, D_2π/3 and D_4π/3 respectively. The three measurements of single-pixel detector are utilized to estimate the Fourier spatial frequency spectrum values of the targets at the spatial frequency (f_x, f_y), which can be expressed as:

(4)$$T({f_x},{f_y}) = [{2{D_0}({f_x},{f_y}) - {D_{2{\pi / 3}}}({f_x},{f_y}) - {D_{4{\pi / 3}}}({f_x},{f_y})} ]+ \sqrt 3 j[{{D_{2{\pi / 3}}}({f_x},{f_y}) - {D_{4{\pi / 3}}}({f_x},{f_y})} ].$$

Here, T (f_x, f_y) represents the Fourier spatial frequency spectrum of the reconstructed image at the (f_x, f_y) and j denotes the imaginary unit. Then, as Eq. (5) shows, the intensity image of the target scene $\tilde{R}(x,y)$ is reconstructed by processing the spatial frequency spectrum with IFT, where n is noise.

(5)$$\tilde{R}(x,y) = {F^{ - 1}}(T({f_x},{f_y}) + n).$$

In order to reduce the measurement number, an under-sampling method is often used. The spectral energy of target scene concentrates in the low-frequency part, the values for specific low frequencies of spatial frequency spectrum are collected in the FSI system. It is the sampling mask that can be utilized to determine which frequency spectral positions are measured in the spatial frequency domain. The pixel values of sampling masks are either 1 or 0. In the sampling mask matrix, 1 indicates that the spatial spectrum of this position needs to be measured, and 0 indicates that this position is not measured. The smaller the total number of pixel value of 1 on the sampling mask is, the lower the sampling rate of FSI system is. In the system, circle sampling masks with sampling rates ranging from 0.5% to 5% are employed.

Due to the lack of high frequency information when the sampling rate is low, the reconstructed under-sampling target image $\tilde{R}(x,y)$ has an obvious ringing effect. In order to achieve the goal of reconstructing high-quality and high-resolution image under low sampling rates, a super-resolution Fourier single-pixel (SR-FSI) network model is designed. The proposed SR-FSI model can achieve super-resolution reconstruction of FSI reconstruction results with low resolution (e.g., resolution 128×128), so as to obtain reconstruction results with higher resolution (e.g., resolution 256×256). Meanwhile, the network model can effectively get rid of the influence of ringing effect on image quality of high-resolution reconstruction results, and realize FSI reconstruction with high resolution and high quality.

2.2 Network structure

The architecture of the proposed SR-FSI network consists of a generator network G and a discriminator network D. The purpose of G is to reconstruct high-quality and high-resolution images, while D is used to distinguish reconstructed images from real ones. The network structure of G is shown in Fig. 1, in order to improve the reconstruction quality of FSI at low sampling rate and obtain high resolution simultaneously, the parallel structure is adopted to achieve both of these goals. The upper branch in Fig. 1 is designed to improve the resolution of the generated image. The lower branch can improve the quality of the generated image and remove the ringing effect. The input of the generator G is low resolution (LR) images with the size of 128×128, and the output is high resolution (HR) images with the size of 256×256 by the generator.

Fig. 1. The proposed generator structure of SR-FSI.

Download Full Size | PDF

The lower branch network in Fig. 1 has a U-net structure and consists of an encoder and a decoder. The encoder compresses the image features, while the decoder recovers the image features. The encoder uses three convolutional layers to down-sample the feature map, each convolution layer is followed by a Dense block [32] shown in green in Fig. 1. The decoder recovers the size of the image features by a deconvolutional layer, which also has a Dense block after it. There is a 1×1 convolutional layer to adaptively control the output of a Dense block. The feature maps of encoder and decoder with the same size are concatenated by short skip connection to achieve a better reconstruction effect. Besides, in order to prevent the loss of image information with the deepening of the network, the features of the first convolution and the output of U-net are fused through long skip connections. The parameters of convolution layer and deconvolution layer used by the module are shown in Fig. 1, where n denotes the number of convolution kernels, k is the size of the convolution kernel and s represents the step of the convolution. Batch normalization (BN) layer [33] is used to accelerate the convergence speed of the network and prevent the gradient from disappearing and overfitting.

The Dense block is applied in the network structure to facilitate the flow of information and reduce vanishing-gradient in deeper network. The structure of Dense block is shown in Fig. 2. The input of Dense block is a feature-map X₀. Dense block contains L layers. Each layer is a composite function H_L (·), where L indexes the layer. H_L (·) contains Batch Normalization (BN), rectified linear units (LRelu) and Convolution (Conv). The L^th layer receives the feature-maps of all preceding layers as input, such as X₀, …, X_L-1. After a composite function of operations, the output of the L^th layer is denoted as X_L:

(6)$${X_L} = {H_L}([{X_0},{X_1},\ldots ,{X_{L - 1}}]).$$

[X₀, X₁, …, X_L-1] is a concat layer, which concatenates of the feature-maps produced in preceding layers. In this paper, the Dense block contains two layers.

Fig. 2. The structure of Dense block.

Download Full Size | PDF

As the attention mechanism achieves good results in imaging field [34–36], the attention mechanism is applied to the upper branch network. The upper branch network in Fig. 2 consists of an initial feature extraction sub-network and a feature transformation sub-network. The initial feature extraction sub-network extracts the input features of LR via a convolutional layer. The feature transformation sub-network contains 5 convolution layers and 3 channel-wise and spatial attention residual (CSAR) blockchain, which is designed to capture more informative features for HR. LR image and HR image have similar image content, the main difference is high-frequency information. In order to obtain HR images, more high-frequency information is needed from LR images features. The LR features generated by a deep network contain different types of information across channels and spatial regions, which have different contributions for the detail recovery of high-frequency. Considering this point, CSAR units are used to increase the sensitivity of the network to high-frequency features and make it focus on learning more important features, which is beneficial to get more important information for HR images.

The structure of CSAR unit is illustrated in Fig. 3. Inspired by the attention mechanism, considering that there are different types of information within and across feature maps with different contributions for image SR, channel-wise and spatial attentions are combined into the residual blocks to adaptively modulate feature representations in global and local ways for capturing more important information. The role of channel attention (CA) is to select important features in a global manner and to suppress redundant features. The input of CA is a feature map with size of H×W×C. Individual feature channels along spatial dimensions H×W is processed by a global average pooling. Then channel-wise summary statistic with size of 1×1×C is obtained. In order to distribute different attention to different types of feature maps, a sigmoid activation function is applied to the summary statistic z. In the first convolutional layer, a scaling ratio r is designed to change the number of channels (1×1×rC) which is a hyper-parameter set to 16. The number of convolution kernels for the second convolution is the original number of channels, transforming the feature map to the previous number of channels (1×1×C). Finally, the output on the channel (H×W×C) is recalibrated by a channel-wise product.

Fig. 3. The diagram of channel-wise and spatial attention residual (CSAR) block, where ${\oplus}$ represents element-wise add and ${\otimes}$ denotes element-wise product.

Download Full Size | PDF

Consider that the information contained in the input and feature maps also varies depending on the spatial location. In order to recover high frequency details in LR images, it is necessary to make the network discriminative for different local regions and to give more attention to the more important and difficult regions to be reconstructed. Based on these, spatial attention (SA) is applied to enhance the representational power of the network. SA and CA share the same input features (H×W×C). The spatial attention mask (H×W×1), which is used to focus on more important features, is generated by a two-layer convolutional neural network, followed by a Relu function and a sigmoid function, respectively. The first convolution produces the attention graphs (H×W×µC), and the second convolution turns the attention graphs into a single attention graph (H×W×1), where scaling ratio µ is used in the convolutional layer to facilitate dimensional change. In the experiment, µ is set to 2. The spatial location of each feature map and its corresponding spatial attention weights are multiplied at the element-wise to obtain the feature map of attention. In the SA unit, the features are modulated locally, which works in conjunction with the global channel mode modulation to help the network improve its network generation capability. To improve information flow and achieve better imaging performance, a CSAR block is proposed by combining the channel attention unit and the spatial attention unit into the residual block. By overlaying multiple CSAR blocks, it is easy to focus on multi-level features in terms of channel-wise and spatial attentions, thus obtaining more multi-level important information. The outputs of the two branch networks are fused, and then the LR image features are scaled up by sub-pixel convolution, and finally the HR image can be obtained by one convolution layer.

The discriminator D is shown in Fig. 4. Both the sizes of input size output of the discriminator D are 256×256. The role of the skip connection here is to make training easier. A feature map containing a large amount of information is flattened into a feature vector, and finally a fully connected layer with node 1 is passed through to output the result. The quality of the HR images generated by the generator G is continuously improved with the aid of the discriminator D. To assess the similarity degree between the original and reconstructed HR images, the Wasserstein distance [28] is calculated. The loss function of D is described as:

(7)$${L_{D\_loss}} = \frac{1}{N}\sum\limits_{\textrm{i = }1}^N {(D(G({{\tilde{R}}_i})} - D({R_i})),$$

where N is the number of pixels in images, $\tilde{R}$ is the LR images of FSI with size of 128×128×1 under low sampling rate, R is the original image of size 256×256×1, $G(\tilde{R})$ is the reconstructed HR images by G, $D(G(\tilde{R}))$ is the probability that the reconstructed HR image with size of 256×256×1 is the original image.

Fig. 4. The network block diagram of discriminator.

Download Full Size | PDF

The loss function of G contains adversarial loss and pixel loss. The adversarial loss can be described as:

(8)$${L_{adv\_loss}} = \frac{1}{N}\sum\limits_{\textrm{i = }1}^N {D(G({{\tilde{R}}_i})} .$$

The pixel loss adopts normalized mean square error (NMSE), which is defined as:

(9)$${L_{pixel\_loss}} = \frac{{||{G({{\tilde{R}}_i}) - {R_i}} ||_2^2}}{{||{{R_i}} ||_2^2}}.$$

Therefore, the total loss of G is presented as:

(10)$${L_{G\_loss}} = {L_{adv\_loss}} + \alpha {L_{pixel\_loss}}.$$

In the proposed network model, the hyperparameter value of α is chosen as 15 in simulation and experiment. The RMSProp [37] is adopted to optimize the weights and the learning rate is set to 0.0001. The proposed network model is implemented in Python version 3.6 with TensorFlow 1.14 on Tesla V100-SXM2 GPU.

3. Numerical simulation and experimental results

3.1 Datasets

The dataset applied in simulation is a car dataset which contains 16185 images of 196 classes of cars [38]. The images from car dataset contain many detailed information and complex background. The car dataset satisfies the requirements of the proposed method. To expand the amount of dataset, some car model images crawled from the web are added to the dataset. The final number of images in the dataset was 19432. Every image in the dataset is resized into a grayscale image with 256×256 pixels. The dataset is randomly divided into training, validation and test dataset. The numbers are 13432, 3000 and 3000 respectively.

The proposed network needs HR images as labels and LR images as inputs. The 256×256 resolution images of the dataset are down-sampled by bicubic interpolation to obtain 128×128 resolution LR images, which are treated as target scenes. According to the strategy of the circle sampling mask and sampling rate, the illumination pattern set can be calculated by Eq. (1). The sub-sampled spatial frequency spectrum of target scenes is obtained based on a three-step phase shifting method. After IFT, the reconstructed 128×128 LR images of conventional FSI can be obtained. The reconstructed LR images of conventional FSI are the input of the proposed network model. The sampling rate β is defined as the ratio of the actual number of measurements to the number of measurements required by the three-step phase-shifting FSI reconstruction of the HR image. As the conjugated symmetry of the Fourier spectrum, it consumes 1.5×N×N measurements in total to fully sample an N×N image. The calculation formula is shown in Eq. (11),

(11)$$\beta = \frac{M}{{1.5 \times {W_{HR}} \times {H_{HR}}}},$$

where M are the actual number of measurements, HR is the reconstructed HR image, W and H are the width and length of the HR image.

As shown in Fig. 5, a 256×256 car target is down-sampled by bicubic interpolation to obtain a 128×128 target image. Then the spatial frequency spectrum of target is collected according to the sampling mask, and the under-sampled spatial frequency spectrum of target is obtained by a three-step phase-shifting method. The size of the circle sampling mask is set to 128×128 and the sampling rate is in the range of 0.5% to 5%. The sampling masks are binary images, whose pixel values are either 1 or 0. If the value of a sampling mask is 1, it means that the corresponding spatial frequency spectrum value is acquired, and the value of 0 means that it is not acquired. The white area of the sampling mask is the spatial frequency spectrum acquisition area corresponding to the target image. Since most of the information of the image is concentrated in low-frequency, and most of the spectral information obtained by using the circle mask is low-frequency information, and the acquired high-frequency information is not adequate. This results in imaging have a significant ringing effect after IFT of the acquired under-sampled target spectrum. The image obtained by IFT is the result of the conventional FSI reconstruction.

Fig. 5. Illustration of the data acquisition process of SR-FSI in our simulations.

Download Full Size | PDF

In Fig. 5, it can be seen that the quality of the reconstructed image improves as the sampling rate increases. When the sampling rate is 0.5%, the whole picture looks blurry and it is hard to see the details of the picture. When the sampling rate is 1%, the whole picture possesses obvious ringing effects and artifacts. When the sampling rate is increased to 3%, the whole of the car in the figure can be clearly identified, although some details are not clear, such as the wheels. When the sampling rate is 5%, the outlook of the image is even better, although there is still a lack of detail and a slight ringing effect. In order to improve the reconstruction results of FSI at low sampling rate and obtain higher resolution images, the 128×128 reconstruction results of FSI are set as the input of our proposed network model for training, and finally obtain 256×256 high-resolution results.

3.2 Numerical simulations

In the simulation, four network models are trained separately, and each network model corresponds to a sampling rate. The sampling rates are 0.5%, 1%, 3% and 5% respectively. The 4 models share the same network structure and are trained with the same car dataset. The reconstructed images with the proposed method under four sampling rates are shown in Fig. 6. In the first row, the 256×256 target image is down-sampled and transformed by FSI based on three-step phase-shifting method to obtain 128×128 LR images at different sampling rates, which are the inputs of the SR-FSI and DCAN [27]. The 128×128 LR images are enlarged to 256×256 resolution by bicubic interpolation, the results are shown in second row. The third row is the output of DCAN after bicubic interpolation. The last row is the output with 256×256 resolution of SR-FSI.

Fig. 6. Examples of network input, bicubic interpolation, DCAN after bicubic interpolation and SR-FSI results under different sampling rate.

Download Full Size | PDF

It can be observed that the LR image blurs significantly and the resolution is relatively low. When large resolution results are needed, interpolation is often applied to get a larger size image, but this usually results in a poor-quality image. When the sampling rate is comparatively lower than 1%, it is difficult to see the clear outline of the car, and even when the sampling rates are 3% and 5%, there are significant ringing effects on the images. The image qualities of DCAN outputs after bicubic interpolation are improved and the ringing effect is suppressed. But these results still lack a lot of detail. After the reconstruction of our network, the overall image becomes much smooth and the ringing effect is almost invisible. Although a lot of details are missing in the reconstructed image at a sampling rate of 0.5%, the overall quality is improved much more compared to the previous image. The PSNR and SSIM of the images are also written below each image, the image quality indexes reconstructed by the proposed method are also greatly improved. Compared with DCAN after bicubic interpolation and traditional interpolation method, our proposed SR-FSI gets the best performance.

In order to further compare the reconstruction performance of SR-FSI under the same sampling rate than that of DCAN, the same dataset is applied to these models. For SR-FSI, it is necessary not only to improve the quality of reconstruction results, but also to improve the resolution, which is more complicated than DCAN dealing with denoising. Firstly, the 256×256 and 128×128 FSI under-sampled images are obtained by the three-step phase-shifting operation. The 256×256 and 128×128 FSI under-sampled images are the inputs of DCAN and SR-FSI respectively. The outputs of DCAN and SR-FSI are reconstructed 256×256 images. In the simulation and subsequent experiments, the input and output sizes of the DCAN and SR-FSI are the same as described here and will not be described. In the simulation, the number of sampling measurements are controlled to be the same as the ones of our method. An image is randomly selected from the testing data. The reconstructed images with different methods under four sampling rates are shown in Fig. 7. The results of first row are 256×256 FSI under-sampled images. The second and third rows correspond to the results of DCAN and SR-FSI, respectively. It can be seen that the sampling rate has a great influence on the reconstruction quality of all these methods. With the increase of sampling rate, the reconstructed image gets better. When the sampling rate is 5%, both DCAN and SR-FSI have good reconstruction quality.

Fig. 7. The numerical simulation results of different methods.

Download Full Size | PDF

Moreover, different methods show great differences in reconstruction performance at the same sampling rate. When the sampling rate is particularly low, such as β=0.5% and 1%, there are obvious blurring artifacts existing in the FSI reconstructed images. The DCAN method does remove the ringing effect of the FPSI reconstruction results to some extent, but the whole image still looks blurry. The reconstructed images by SR-FSI are much clearer. When the sampling rates are increased to 3% and 5%, the reconstruction results of SR-FSI are still the best. When the sampling rate is 3%, the reconstructed car wheels of SR-FSI are clear, while the reconstructed car wheels of DCAN are hard to see the structure. When the sampling rate is 5%, the door handles are blurred although the wheels of the DCAN reconstructed car can be seen clearly.

In order to compare the reconstruction performances and generalization capability of different methods quantitatively, the average values of PSNR and SSIM of each reconstructed image in the test dataset with 3000 images are calculated. As Fig. 8 shows, four sampling rates β (0.5%, 1%, 3% and 5%) are listed to compare the average PSNRs and SSIMs of FSI, DCAN and the proposed SR-FSI. Among these results, as the sampling rate increases, the metrics of all methods improve. The SR-FSI reconstruction has the highest metrics, followed by DCAN and the worst metrics belong to FSI. Moreover, the SR-FSI has better metrics than DCAN at the same sampling rate.

Fig. 8. The average PSNRs and SSIMs of reconstructed images with testing dataset.

Download Full Size | PDF

The reconstruction time of the proposed method, FSI, DCAN and the method in [15] under the sampling rate 5% are listed in Table 1. The 100 images are selected from test dataset. The test platform is an Intel i5-9300H CPU with 16GB RAM. The time of IFT is denoted as T_I, the time of reconstruction algorithm is expressed as T_S and T_T is the total reconstruction time. The reconstruction time for FSI is the time required by IFT. For DCAN and the proposed SR-FSI, the image reconstruction time contains two steps which are IFT and network reconstruction. The reconstruction time of method in [15] only contain the reconstruction of the method, without IFT. As can be seen from the results in Table 1, FSI is the fastest, the method in [15] is slower than DCAN and the proposed SR-FSI. The DCAN is faster than SR-FSI, but only approximately 11 ms. However, SR-FSI perform better than DCAN on the quality of imaging.

Table 1. Average reconstruction time of the 100 images at sampling rate 5%

View Table

To further verify the generalization ability of the network which is trained by car dataset in other natural scenes, a ship image and an animal image are selected for testing. The reconstructed results of FSI and SR-FSI are shown in Fig. 9. The PSNR and SSIM of the reconstructed images are also calculated to quantitatively compare the reconstruction effects. The two kinds of images reconstructed by FSI are the worst both visually and quantitatively, even when the sampling rate is increased to 5%, the FSI reconstructed image still has a severe ringing effect. However, by observing the images reconstructed by SR-FSI at different sampling rates, it can find that SR-FSI can effectively remove the ringing effect caused by under-sampling. Numerically, the PSNR and SSIM of the reconstructed images by SR-FSI are also better than those of the FSI reconstructed images. The results show that the SR-FSI network trained with cars can still be used for the reconstruction of other natural images, and thus the proposed network has good generalization ability and can be effectively applied to real scenes.

Fig. 9. Reconstructed results of the ship image and bird image with FSI and SR-FSI methods.

Download Full Size | PDF

From the above simulation experiments and the generalization experiments, it can be found that the picture effect of SR-FSI reconstruction cannot satisfy visual effect when the sampling rate is lower than 1%. Therefore, considering the quality and time cost, choosing a sampling rate of 3%−5% is a more suitable choice in practical applications.

3.3 Experiment results

Two experiments were performed according to the schematic diagram of the FSI system shown in Fig. 10. In the experiments, the laser source is a pulsed laser (NPL52B/Thorlabs) with a central wavelength of 520 ± 10 nm and the average power of 12 mW. The laser is transmitted through a beam expander (BE02-05-A/Thorlabs) onto the computer-controlled spatial light modulator (SLM). The SLM (TNSLM023-A/AOE Tech Co) modulates the spatial distribution of the laser through the Fourier base patterns sent by computer. In the transmitting stage, the beam splitter (BS) is used to deflect the path of the modulated laser. The deflected laser is then emitted by the optical antenna and illuminates on the target. A camera lens is selected as the optical antenna. In the receiving stage, the reflected light of the target is received by a single-pixel detector (DET36A2/Thorlabs) through BS. The detector is connected to an 8-bit data acquisition card (PXI-5154/NI). The bandwidth of the data acquisition card is 1GHz and the sampling frequency is 2GS/s. With the experiment setup, the spatial frequency spectrum of target can be obtained by a three-step phase-shifting method. The FSI reconstruction results gained from IFT are fed into a pre-trained network model to acquire the final reconstructed images.

Fig. 10. Schematic of FSI system.

Download Full Size | PDF

In order to verify the effectiveness of the proposed method, a car model was chosen as the target in the experiment. The FSI reconstructed images with 256×256 and 128×128 resolutions are obtained which are fed into the pre-trained DCAN and SR-FSI for comparison, respectively. The experimental results at sampling rates of 0.5%, 1%, 3% and 5% are shown in Fig. 11. The reconstruction qualities of these methods increase as the sampling rate increases. All the FSI reconstruction results have significant ringing effects, and the target is difficult to distinguish when the sampling rate is less than 1%. The DCAN method improves the quality of FSI reconstruction by removing artifacts from the FSI results to some extent. The proposed SR-FSI achieves super resolution while improving the quality of the generated images and removing the ringing effect. Although it is difficult to see the difference between SR-FSI and DCAN reconstruction results from a visual point of view, the quantitative indexes of SR-FSI reconstruction results are better than those of DCAN.

Fig. 11. Results of the first experiment.

Download Full Size | PDF

To evaluate the actual generalization ability of the SR-FSI network, a skeleton model was chosen as the target in the second experiment. The experiment results are shown in Fig. 12. In the experiments, the SR-FSI and DCAN models are still previously trained with the car dataset. Both DCAN and SR-FSI are able to improve the quality of FSI results. For SR-FSI and DCAN, it can be seen that the SR-FSI model can remove the artifacts of the skeleton FSI results well. But the skeleton details of the SR-FSI reconstruction are more than the resultant details of the DCAN reconstruction. From a visual perspective, the SR-FSI model outperforms the DCAN model. Besides, the DCAN model still performs worse than the SR-FSI model in terms of the PSNR metrics of the images. When the sampling rates are 0.5% and 1%, although the SSIM metrics of SR-FSI reconstructed image are worse than those of DCAN method, the PSNR metrics of SR-FSI results are better than those of DCAN method, and the results look much clearer and smoother. The skeleton experimental results show that the SR-FSI network model which is trained by the car image dataset can not only reconstruct the car images, but also reconstruct other natural images, such as skeleton images.

Fig. 12. Results of the second experiment.

Download Full Size | PDF

4. Conclusion

To sum up, in order to obtain high-quality reconstruction results under low sampling rate, SR-FSI method is proposed, which reconstructs images with higher resolution from low-resolution FSI reconstruction results with the idea of super-resolution. In the SR-FSI model, the generator fuses U-net and attention mechanism to remove ringing effects and improve the quality of generated high-resolution images. The network model is trained using the car dataset, and the trained model is used to achieve super-resolution high-quality reconstruction of FSI results at low sampling rates.

In simulation and experiments, the performances of SR-FSI, FSI and DCAN methods are compared at different sampling rates. The sampling strategy is a circle mask with a sampling rate of 0.5% to 5%. Both simulation and experimental results show that the SR-FSI model can remove the ringing effect of FSI and improve the quality of FSI results while achieving super resolution. Moreover, the SR-FSI method is superior to the DCAN method in terms of visual perspective or quantitative analysis. Besides, other kinds of target images are taken in the simulation and experiment respectively to verify the generalization ability of the network model. The results show that the model trained with car images is able to be used for other natural images. The proposed SR-FSI model provides a prospective approach for reconstruction of FSI images due to the ability of high-quality reconstruction at low sampling rates and good generalization. Although the proposed method can recover high-quality and high-resolution images from low-resolution FSI results, the model parameters are relatively large. In the future, more contributions should be devoted to the optimization of network model parameters.

Funding

Open Foundation of Key Laboratory of Optical Field Manipulation of Zhejiang Province (ZJOFM-2020-008); Zhejiang Sci-Tech University (2021Q030); Natural Science Foundation of Zhejiang Province (LY20F010001); Natural Science Foundation of Zhejiang Province (LQ20F050010); National Natural Science Foundation of China (61801429).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. J. Shapiro, “Computational ghost imaging,” Phys. Rev. A 78(6), 061802 (2008). [CrossRef]

2. M. F. Duarte, M. A. Davenport, D. Takhar, J. N. Laska, T. Sun, K. F. Kelly, and R. G. Baraniuk, “Single-Pixel Imaging via Compressive Sampling,” IEEE Signal Proc. Mag. 25(2), 83–91 (2008). [CrossRef]

3. M. P. Edgar, G. M. Gibson, R. W. Bowman, B. Sun, N. Radwell, K. J. Mitchell, S. S. Welsh, and M. J. Padgett, “Simultaneous Real-Time Visible and Infrared Video with Single-Pixel Detectors,” Sci. Rep. 5(1), 10669 (2015). [CrossRef]

4. W. L. Chan, K. Charan, D. Takhar, K. F. Kelly, R. G. Baraniuk, and D. M. Mittleman, “A Single-Pixel Terahertz Imaging System Based on Compressed Sensing,” Appl. Phys. Lett. 93(12), 121105 (2008). [CrossRef]

5. C. M. Watts, D. Shrekenhamer, J. Montoya, G. Lipworth, J. Hunt, T. Sleasman, S. Krishna, D. R. Smith, and W. J. Padilla, “Terahertz compressive imaging with metamaterial spatial light modulators,” Nat. Photonics 8(8), 605–609 (2014). [CrossRef]

6. R. I. Stantchev, B. Sun, S. M. Hornett, P. A. Hobson, G. M. Gibson, M. J. Padgett, and E. Hendry, “Noninvasive, near-Field Terahertz Imaging of Hidden Objects Using a Single-Pixel Detector,” Sci. Adv. 2(6), 1600190 (2016). [CrossRef]

7. J. Wu and S. Li, “Optical multiple-image compression-encryption via single-pixel Radon transform,” Appl. Opt. 59(31), 9744 (2020). [CrossRef]

8. J. Wu, L. Hu, and J. Wang, “Fast tracking and imaging of a moving object with single-pixel imaging,” Opt. Express 29(26), 42589 (2021). [CrossRef]

9. X. Yang, Y. Zhang, C. Yang, L. Xu, Q. Wang, and Y. Zhao, “Heterodyne 3D ghost imaging,” Opt. Commun. 368, 1–6 (2016). [CrossRef]

10. X. Yang, L. Xu, M. Jiang, L. Wu, Y. Liu, and Y. Zhang, “Phase-coded modulation 3D ghost imaging,” Optik 220, 165184 (2020). [CrossRef]

11. X. Yang, Z. Yu, L. Xu, J. Hu, L. Wu, C. Yang, W. Zhang, J. Zhang, and Y. Zhang, “Underwater ghost imaging based on generative adversarial networks with high imaging quality,” Opt. Express 29(18), 28388–28405 (2021). [CrossRef]

12. B. Sun, M. P. Edgar, R. Bowman, L. E. Vittert, S. Welsh, A. Bowman, and M. J. Padgett, “3-D Computational imaging with single-pixel detectors,” Science 340(6134), 844–847 (2013). [CrossRef]

13. Y. Katz, Y. Bromberg, and Silberberg, “Compressive ghost imaging,” Appl. Phys. Lett. 95(13), 131110 (2009). [CrossRef]

14. R. Horisaki, H. Matsui, R. Egami, and J. Tanida, “Single-pixel compressive diffractive imaging,” Appl. Opt. 56(5), 1353–1357 (2017). [CrossRef]

15. Z. Tang, T. Tang, X. Shi, J. Chen, and Y. Liu, “Fast and high-quality single-pixel imaging,” Opt. Lett. 47(5), 1218–1221 (2022). [CrossRef]

16. W. Meng, D. Shi, J. Huang, K. Yuan, Y. Wang, and C. Fan, “Sparse Fourier single-pixel imaging,” Opt. Express 27(22), 31490–31503 (2019). [CrossRef]

17. M. Sun, L. Meng, M. P. Edgar, M. J. Padgett, and N. Radwell, “A Russian Dolls ordering of the Hadamard basis for compressive single-pixel imaging,” Sci. Rep. 7(1), 3464 (2017). [CrossRef]

18. F. Rousset, N. Ducros, A. Farina, G. Valentini, C. D’Andrea, and F. Peyrin, “Adaptive Basis Scan by Wavelet Prediction for Single-Pixel Imaging,” IEEE Trans. Comput. Imaging 3(1), 36–46 (2017). [CrossRef]

19. Z. Zhang, X. Ma, and J. Zhong, “Single-pixel imaging by means of Fourier spectrum acquisition,” Nat. Commun. 6(1), 6225 (2015). [CrossRef]

20. Z. Zhang, X. Wang, G. Zheng, and J. Zhong, “Fast Fourier single-pixel imaging via binary illumination,” Sci. Rep. 7(1), 12029 (2017). [CrossRef]

21. H. Deng, X. Gao, M. Ma, P. Yao, Q. Guan, X. Zhong, and J. Zhang, “Fourier single-pixel imaging using fewer illumination patterns,” Appl. Phys. Lett. 114(22), 221906 (2019). [CrossRef]

22. Z. Zhang, X. Wang, G. Zheng, and J. Zhong, “Hadamard single-pixel imaging versus Fourier single-pixel imaging,” Opt. Express 25(16), 19619 (2017). [CrossRef]

23. Z. Wang, M. I. Dedo, K. Guo, K. Zhou, F. Shen, Y. Sun, S. Liu, and Z. Guo, “Efficient Recognition of the Propagated Orbital Angular Momentum Modes in Turbulences With the Convolutional Neural Network,” IEEE Photonics J. 11(3), 1–14 (2019). [CrossRef]

24. M. I. Demo, Z. Wang, K. Guo, and Z. Guo, “OAM mode recognition based on joint scheme of combining the Gerchberg–Saxton (GS) algorithm and convolutional neural network (CNN),” Opt. Commun. 456, 124696 (2020). [CrossRef]

25. X. Lai, Q. Li, Z. Chen, X. Shao, and J. Pu, “Reconstructing images of two adjacent objects passing through scattering medium via deep learning,” Opt. Express 29(26), 43280–43291 (2021). [CrossRef]

26. S. Rizvi, J. Cao, K. Zhang, and Q. Hao, “Improving Imaging Quality of Real-time Fourier Single-pixel Imaging via Deep Learning,” Sensors 19(19), 4190 (2019). [CrossRef]

27. S. Rizvi, J. Cao, K. Zhang, and Q. Hao, “Deringing and Denoising in Extremely under-Sampled Fourier Single Pixel Imaging,” Opt. Express 28(5), 7360 (2020). [CrossRef]

28. X. Yang, P. Jiang, M. Jiang, L. Xu, L. Wu, C. Yang, W. Zhang, J. Zhang, and Y. Zhang, “High Imaging Quality of Fourier Single Pixel Imaging Based on Generative Adversarial Networks at Low Sampling Rate,” Opt. Lasers Eng. 140, 106533 (2021). [CrossRef]

29. Y. Hu, Z. Cheng, X. Fan, Z. Liang, and X. Zhai, “Optimizing the Quality of Fourier Single-Pixel Imaging via Generative Adversarial Network,” Optik 227, 166060 (2021). [CrossRef]

30. Q. Yang, P. Yan, Y. Zhang, H. Yu, Y. Shi, X. Mou, M. K. Kalra, Y. Zhang, L. Sun, and G. Wang, “Low-Dose CT Image Denoising Using a Generative Adversarial Network With Wasserstein Distance and Perceptual Loss,” IEEE Trans. Med. Imaging 37(6), 1348–1357 (2018). [CrossRef]

31. P. Ronneberger, T. Fischer, and Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in International Conference on medical image computing and computer assisted intervention (MICCAI) (2015), 234–241.

32. G. Huang, Z. Liu, L. Maaten, and K. Weinberger, “Densely Connected Convolutional Networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 2261.

33. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of The 32nd International Conference on Machine Learning (ICML) (2015), 448–456.

34. Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image Super-Resolution Using Very Deep Residual Channel Attention Networks,” in Proceedings of the European Conference on Computer Vision (ECCV) (2018), 294–310.

35. H. Xu and K. Saenko, “Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering,” in European Conference on Computer Vision (2016), 451.

36. L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua, “SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), 6298–6306.

37. Y. Dauphin, H. Vries, J. Chung, and Y. Bengio, “RMSProp and equilibrated adaptive learning rates for non-convex optimization,” arXiv:1502.04390 (2015).

38. J. Krause, M. Stark, J. Deng, and Li Fei-Fei, “3D Object Representations for Fine-Grained Categorization,” in International Conference on Computer Vision (ICCV) , 554–561, (2013).

Method	T_I	T_S	T_T
FSI	1.9 ms	0	1.9 ms
DCAN	1.9 ms	21.3 ms	23.2 ms
SR-FSI	1.9 ms	32.7 ms	34.6 ms
[15]	0	48.3 ms	48.3 ms

Fourier single pixel imaging reconstruction method based on the U-net and attention mechanism at a low sampling rate

Abstract

1. Introduction

2. Methods

2.1 Imaging scheme of FSI

2.2 Network structure

3. Numerical simulation and experimental results

3.1 Datasets

3.2 Numerical simulations

3.3 Experiment results

4. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (1)

Equations (11)

Optics Express