Spatial and temporal super-resolution for fluorescence microscopy by a recurrent neural network

Jinyang Li; Jinyang Li; Jinyang Li; Jinyang Li; Geng Tong; Geng Tong; Geng Tong; Yining Pan; Yining Pan; Yining Pan; Yiting Yu; Yiting Yu; Yiting Yu; Yiting Yu

doi:10.1364/OE.423892

1. Introduction

Images captured by conventional optical microscopies suffer from the Abbe’s diffraction limit, resulting in a low spatial resolution of approximately half the optical wavelength. For this reason, any emitter with the size less than the diffraction limit will look blurred and bigger than it actually is. This effect has greatly restricted the precise localization of fluorescence molecules. Multiple super-resolution (SR) methods are designed to enable a higher nanoscale observation of biological structures. Single-molecule localization microscopy (SMLM), a solution more from the hardware side including the stochastic optical reconstruction microscopy (STORM) [1] and photo-activated localization microscopy (PALM) [2], has revolutionized biological imaging in recent decades, as the same as the other two technologies, i.e., stimulated emission depletion (STED) [3] and structured illumination microscopy (SIM) [4,5]. SMLM relies on the time-modulated fluorophores emission of small groups, i.e., utilizing a low-intensity laser to randomly activate and quench a few fluorophores, the detailed emitters can then be localized with a high precision by repeating this two-step process until all fluorophores are scanned. By combining all the reconstructed emitter positions, super-resolved biological structures can be produced at a scale down to tens of nanometers.

Over the years, a series of methods have been proposed to improve the performance of localization microscopy [6,7]. However, dealing with regions containing a high density of emitters still poses a massive challenge. On one hand, what SMLM typically confronts with is an extensive number of raw diffraction-limited frames to be reconstructed. To improve the spatial resolution, each raw frame to be processed should contain as few fluorophores as possible, which will lead to a time-consuming procedure and a greatly limited practical use especially for dynamic processes with live cells. On the other hand, for the purpose of improving temporal resolution, one alternative solution is to increase the density of emitters in each raw frame to decrease the total number of raw frames. This poses an algorithmic challenge in estimating the point spread function (PSF) and further localizing each individual fluorophore. Simultaneously improve the spatial and temporal resolution still meets an obstacle that hinders further development of microscopic imaging.

Owing to the concept of artificial intelligence (AI) [8,9] and the rapid development of deep learning [10], image SR, a classic low-level computer vision task that addresses the question of reconstructing a high-resolution (HR) image from its low-resolution (LR) counterparts, becomes a practical alternative to address the above issues. In fluorescence microscopy, LR image corresponds to a diffraction-limited image and HR image means an image in which the emitters are clearly located. Convolutional neural networks (CNNs) are parameterized network architectures that are capable of extracting more abstract features and have been proven to be useful models for tackling a wide range of visual tasks. Since Dong et al. firstly introduced SRCNN [11] for image SR, deep learning based methods have shown great competence in recent years due to their superior reconstruction performance against other conventional machine learning methods. To break the barrier that SMLM usually requires a large number of frames to reconstruct a SR image, several computational strategies that build CNNs to reconstruct and/or accelerate SMLM imaging in two-dimensions (2D) were suggested [12,13]. In [12], a neural network named ANNA-PALM was presented to reconstruct SR views from the acquired localization images. In [13], CNNs for spectroscopic single-molecule localization microscopy (sSMLM) was demonstrated, and it shows a faster speed in accelerating multicolor sSMLM imaging than some existing approaches. Additionally, deep learning [14,15] was also applied to localize the samples of interest in three-dimensions (3D), necessitating 3D localization microscopy [16].

Recently, an end-to-end SR CNN model [17], named Deep-STORM, was proposed which significantly improved the spatial resolution of single-molecule microscopy. Deep-STORM is a typical encoder-decoder network in which the encoder is used for capturing the context information in the image and the decoder is used for symmetrically generating up-sampled features. This method is constructed with the conventional CNN architecture and can be influenced by the information propagation barrier and long-term memory deficiency. These issues severely limit the depth of neural networks (only eight layers in [17]) and further handicap the ability of fully utilizing and extracting various features due to the widely accepted view that the reconstruction performance can be enhanced by stacking more layers. For long-term memory, 1) although the fluorescence is randomly seized by camera, since the number of frames is typically very large, many emitters in the global scale of a single frame would appear in many other frames, with great possibility. This localization-based information is correlated; 2) most of all, the lack of high-frequency components extracted from shallow layers, which is important for SR, occurs when information propagates to deeper layers during the network training. And long-term memory is extremely vital for this situation. Another method [18] improved Deep-STORM by adding a residual block and stacking more layers. It only slightly relieved but cannot actually solve these issues. The performance can be further improved by modifying the network architecture and raising effective information propagation strategies.

Inspired by [19] which have proved that optics wave physics can be trained to learn more complex features by RNN, we present a deep-learning-based SR algorithm, namely deep recurrent-supervised network (DRSN-STORM), to simultaneously improve the spatial and temporal resolutions. Different from Deep-STORM, DRSN-STORM is a brand new network for fluorescence microscopy in three folds: (1) DRSN-STORM possesses the ability of capturing more context information without increasing the depth of network. Compared with Deep-STORM, it can empower the proposed network with the ability of maintaining a much smaller parametric magnitude (∼0.4M against ∼1.3M) to overcome overfitting. In addition, this also allows leeway for designing much more complex net structures with more layers. (2) By introducing multi-layer RNN, each intermediate output generated by each depth of RNN would contribute in reconstructing the final output. (3) A skip connection is utilized to help forward/backward information propagations, i.e., information propagation barrier and long-term memory deficiency.

2. Method

2.1 Architecture

The architecture is based on a recurrent-supervised network with a skip connection, which was inspired by the previous work [20] and ResNet [21]. The proposed framework is illustrated in Fig. 1. For our DRSN-STORM, the Feature Extracting Module (FEM) extracts shallow features of the LR input, and the Inference Module (IM) acts as the main SR solving component that extracts more abstract features. Different from the conventional CNNs, recurrently applying one identical convolution layer can widen the receptive field while maintaining a smaller number of parameters. Besides, we make use of each intermediate output of each recurrent layer to contribute to the final output. In Reconstruction Module (RM), the HR output is reconstructed by transforming multi-channel tensors back into the original image space. Besides, a skip connection is used for better transferring shallow abstract features to the deeper architecture. In all convolution layers of DRSN-STORM, 3${\times}$3 convolutional filters are utilized followed by ReLU. The number of filters in corresponding convolution layers is set as 128, which is also demonstrated at the bottom of network.

Fig. 1. The architecture of our proposed network.

Download Full Size | PDF

IM and RM are the main components that perform the task of SR. To show the procedure in detail, we unfold the main components, i.e., the parts enclosed by a purple dashed box in Fig. 1, to a plane structure illustrated as Fig. 2. $I{M_1}$ to $I{M_D}$ denote the number of recurrent layers, and Conv and Concat are abbreviation of Convolution and Concatenation, respectively. ReLU means rectified linear units for nonlinearity. Different arrow colors correspond to operations as shown in Fig. 1. During the forward propagation, the $I{M_1}$ takes the output of $FE{M_1}$ as the input and outputs a feature map, then this feature map, which is then concatenated to the input (referenced as black arrow and yellow block) to get the compensation from shallow layer. Another convolutional layer, i.e., $R{M_1}$, then takes the concatenated feature map as the input, and after this convolution operation and nonlinearity, the output is taken as the input by the last convolutional layer ($R{M_2}$) and outputs an intermediate output, i.e., grey block with Output-1. Following this procedure, each $I{M_d}$ takes the output of $I{M_{d - 1}}$ as the input and finally obtains an intermediate output Output-d. After $I{M_D}$ is finished, we obtain D intermediate outputs. The final output of the network then can be obtained by averaging all achieved intermediate outputs.

Fig. 2. Unfolded version of IM and RM.

Download Full Size | PDF

The network is fed with a LR image as the input x and outputs a target image $\hat{y}$ which predicts locations. The goal is to learn a model f that depicts the mapping. Denote ${f_{FEM}}$, ${f_{IM}}$, ${f_{RM}}$ as the subfunction of FEM, IM and RM, respectively. The target model f can be composed as $f(x) = {f_{RM}}({f_{IM}}({f_{FEM}}(x)))$.

FEM subfunction ${f_{FEM}}(x)$ takes x as the input. The intermediate output of each layer in FEM, denoted as ${L_{ - 1}}$ and ${L_0}$, can be described as

(1)$${L_{ - 1}} = \max (0,{W_{ - 1}} \ast x + {b_{ - 1}})$$

and

(2)$${L_0} = \max (0,{W_0} \ast {L_{ - 1}} + {b_0})$$

where ${W_{ - 1}}$, ${W_0}$ and ${b_{ - 1}}$, ${b_0}$ are weight and bias matrices, $\max (0,\cdot )$ is a nonlinear activation function, namely ReLU, and ${\ast} $ is a convolution operator.

IM subfunction ${f_{IM}}$ takes the output of FEM, ${L_0}$, as the input and computes the intermediate output ${L_D}$. Let r denote the recurrent function: $r(L) = \max (0,W \ast L + b)$. The recurrence mapping can be formulated as

(3)$${L_d} = r({L_{d - 1}}) = \max (0,W \ast {L_{d - 1}} + b)$$

where $d = 1,2, \cdots D$ denotes the depth of IM. ${f_{IM}}$ then can be formulated as

(4)$${f_{IM}}(L) = (r \odot r \odot \cdots \odot r)(L) = {r^D}(L)$$

where ${\odot}$ denotes a function composition and ${r^D}$ denotes the D-time compounding. The recurrence mapping used here is equivalent to leverage multiple convolutional layers (with the same parameters). ${f_{IM}}$ helps the performance of network by extracting more features at a very low price on increasing the depth of network. On the other hand, the receptive fields are widened so that the global information can be utilized as much as possible.

RM subfunction ${f_{RM}}$ takes the corresponding output of IM, ${L_d}$, and ${L_{ - 1}}$ as the input and computes the final prediction. A skip connection is utilized to transfer shallow abstract features to RM for better reconstruction performance during the forward propagation.

(5)$${L^{\prime}_d} = \max (0,{W_{D + 1}} \ast ({L_d} \oplus {L_{ - 1}}) + {b_{D + 1}})$$

and

(6)$${\bar{L}_d} = \max (0,{W_{D + 2}} \ast {L^{\prime}_d} + {b_{D + 2}})$$

the RM subfunction can then be described as

(7)$${f_{RM}}(L) = \hat{y} = \frac{1}{D}\sum\limits_{d = 1}^D {{{\bar{L}}_d}}$$

where ${L^{\prime}_d}$ and ${\bar{L}_d}$ denote the intermediate output of RM corresponding to the depth of recurrent unit, ${W_{D + 1}}$, ${W_{D + 2}}$ and ${b_{D + 1}}$, ${b_{D + 2}}$ denote the weight and bias matrices, ${\oplus}$ denotes a concatenation operation. The biggest advantage comparing with Deep-STORM is that DRSN-STORM takes full potential of all intermediate outputs coming from each recurrent unit to the final reconstructed output. This will be beneficial to better details and structures to be reconstructed.

RNN is a simple yet practical model taking advantage of smaller model capacity, which characterizes the relationship between the current input and previous information. However, two severe obstacles still hinder the performance of the RNN model. One is inherent defect of the RNN model, i.e., vanishing/exploding gradients, which can also be termed as the reverse information propagation barrier. The other is the lack of high-frequency components, termed as the forward information propagation barrier, which plays an important role in single image SR. For this purpose, it is particularly useful to investigate high-frequency features extracted from shallow layers to the final HR prediction. Hence, the skip connections used in the proposed model will not only benefit providing the high-frequency information but also alleviating vanishing/exploding gradients during the back-propagation.

2.2 Training

The depth of our model was set to nine including five recurrent layers. Twenty frames of 64${\times}$64 image size with 80 nm spatial resolution, as well as randomly distributed emitters, were generated employing the parameter-specified ImageJ software [22] and ThunderSTORM plugin [23]. Specifically, the FWHM and intensity were set from 80 nm to 400 nm and 2000 photons to 2800 photons, respectively. The parameters were set according to [17,18] and we made a little change. The molecule densities were set as 10 photons/$\mu {m^2}$ for reconstructing high-density datasets, 50, 100 or 200 photons/$\mu {m^2}$ for the comparison of NMSE. Then, each LR image was up-sampled by the up-sampling factor and then patches of size 26${\times}$26 pixels with the stride of twelve were cropped from each up-sampled image. HR counterparts were generated by visualizing the ground-truth localization with a magnification equals to the up-sampling factor. The final training examples were a set of ∼8 K pairs for an up-sampling factor of four, and ∼33 K pairs for an up-sampling factor of eight, of LR patches alongside the ground-truth localization. We utilized the mentioned datasets as the training sets to train networks and evaluate both simulated and experimental images. 80% of examples were set as the training data and the others being the validation data. Figure 3 presents one prediction result between a LR input, i.e., a test example generated by ThunderSTORM plugin setting 64${\times}$64 image size with 80 nm spatial resolution, the FWHM and intensity from 80 nm to 360 nm and 2000 photons to 2500 photons, respectively, and its corresponding output.

Fig. 3. Prediction result of quantum dot data. (a) Low resolution image. (b) Ground-truth location of (a). (c) DRSN-STORM reconstruction result with the loss function using ${l_1}$ norm and the magnified view of the selected region. (d) DRSN-STORM reconstruction result with the loss function using ${l_2}$ norm and the magnified view of the selected region.

Download Full Size | PDF

Loss function is a key issue to find optimal parameters at the training stage. Different from other conventional CNN, RNN exists vanishing/exploding gradients issues in training with a standard gradient descent method. A supervised manner with a skip connection enables gradients go through fewer layers during the back-propagation, which can alleviate the gradient back-propagation barrier. Hence, the loss function composes of two parts: an output data-fidelity term and a regularization term.

Given D supervision and a training dataset $\{{{x^{(i)}},{y^{(i)}}} \}_{i = 1}^N$, we have the following prediction,

(8)$${\hat{y}^{(i)}} = \sum\limits_{d = 1}^D {{w_d} \cdot \hat{y}_d^{(i)}}$$

and the output loss function is given by

(9)$$\sum\limits_{i = 1}^N {\frac{1}{{2N}}} \left\|{{y^{(i)}} \ast k - {{\hat{y}}^{(i)}} \ast k} \right\|_2^2$$

where $\hat{y}_d^{(i)}$ is the intermediate output of the d-th recursion, ${w_d}$ is the weights of $\hat{y}_d^{(i)}$ and k is a randomly generated Gaussian kernel of size 7${\times}$7 and standard deviation of 1 pixel, * denotes the convolution operation. The final loss function is then given by

(10)$$ \mathcal{L} = \sum\limits_{i = 1}^N {\frac{1}{{2N}}\left\|{{y^{(i)}} \ast k - {{\hat{y}}^{(i)}} \ast k} \right\|_2^2 + {{\left\|{{{\hat{y}}^{(i)}}} \right\|}_2}}$$

Different from Deep-STORM and [18] that output the predicted localizations, the proposed method follows the thought of traditional image processing, i.e., output the reconstructed image without predicting the specific localizations. Hence, we choose to constrain the results with ${l_2}$ norm. Figure 3(c) and Fig. 3(d) show the effectiveness by using ${l_2}$ norm, which renders a higher reconstruction accuracy and lower error compared with that using ${l_1}$ norm. The network was implemented in Tensorflow [24]. We trained the network for 100 epochs and the regression objective was performed by mini-batch gradient descent based on the back-propagation algorithm. Training, evaluation and testing were performed on a standard workstation equipped with 64 GB memory, two Intel Xeon Silver 4216, 2.1 GHz CPUs, and two NVIDIA GEFORCE RTX 2080Ti with 11GB video memory.

3. Result

The proposed DRSN-STORM is evaluated on both simulated and experimental fluorescence images. Different from [17] that used a logical metric, which logically increased the highest resolution by an up-sampling operation, we directly generated the diffraction-limited images and the corresponding correctly located images with an expected pixel size using ImageJ as the training data. For fairly comparing DRSN-STORM with Deep-STORM, both methods are learned with the same training data for model validation and reconstruction. And the Deep-STORM was performed by using the open-source code that the authors released.

3.1 Model validation

We evaluate the performance of DRSN-STORM with the validation data. The comparisons are constructed in term of loss value, i.e., the difference between the reconstructed measurement and the ground-truth image, and the normalized mean square error (NMSE).

As mentioned that DRSN-STORM and Deep-STORM are different in principle, we only analyze the validation loss values of DRSN-STORM under the circumstances of the up-sampling factor of four and eight during 100 epochs, respectively, which are depicted in Fig. 4. Both circumstances exhibit that the validation loss values decrease with the growth of epochs, indicating that both circumstances have the merits of quick convergence speed and avoiding overfitting. We achieved worse performance of the up-sampling factor of eight compared with the up-sampling factor of four since much more information needs to be recovered when the up-sampling factor is larger and it leads to larger errors.

Fig. 4. Loss value of DRSN-STORM on the validation data under the circumstances that the up-sampling factor equals to four and eight, respectively.

Download Full Size | PDF

Other qualitative results are also performed to compare the performance of the proposed method. Different from the natural image SR, the commonly used metric for quantitatively evaluating fluorescence image SR is NMSE, which is formulated as

(11)$$NMSE = \frac{{||{\hat{y} - y} ||_2^2}}{{||y ||_2^2}} \times 100\%$$

Figure 5 reports the comparison of NMSE under different density conditions of 50, 100, 200 photons generated by employing the ThunderSTORM plugin. Here we only show the comparisons under the condition of the up-sampling factor of four considering the computation time. We generated the original diffraction-limited frames of 64${\times}$64 image size with 80 nm spatial resolution, as well as randomly distributed emitters and their ground-truth localizations. To increase randomicity, the FWHM and intensity were set from 100 nm to 360 nm and 2000 photons to 2600 photons, respectively. For each case, we randomly selected 1000 frames of images to obtain the averaged results. Our DRSN-STORM outperforms Deep-STORM by achieving much smaller NMSE under various circumstances (both Minimum and Average).

Fig. 5. Comparison of NMSE under different photons density of (a) 50, (b) 100, (c) 200.

Download Full Size | PDF

3.2 Reconstruction results of high-density datasets

We evaluated DRSN-STORM against Deep-STORM, a high-performance end-to-end net, on two validation experimental data, i.e., a high-density realistic simulation dataset and a long-sequence tubulin dataset that were obtained from the single molecule localization challenge (EPFL) website [25].

The realistic simulation dataset consists of one-thousand high-density frames with the resolution of 128${\times}$128 pixels and the pixel size of 100 nm. The numerical aperture (NA) was 1.46 and FWHM was 224.32 nm. And the tubulin dataset consists of fifteen-thousand frames with 64${\times}$64 image size and 100 nm pixel size, and NA was set as 1.3. All the testing images were in advance up-sampled by a factor of four or eight to the desired output size. According to the processing manner of SMLM, i.e., frame by frame in one step, we generated an image combined with multiple raw frames, e.g., ten, fifty, five-hundred frames, in each processing step to decrease the total computation time and hence improve the temporal resolution.

We first evaluate the performance on the realistic simulation testing image summed with ten frames under the circumstance that the up-sampling factor equals to four, which is shown as in Fig. 6(a). The reconstructed results of Deep-STORM and DRSN-STORM are given in Fig. 6(b) and Fig. 6(c), respectively. Figure 6(g) to Fig. 6(i) and Fig. 6(j) to Fig. 6(l) show the zoom-in details corresponding to the super-resolved results performed by the two methods. Deep-STORM fails to reconstruct the complete structure of the testing image summed by few frames. Besides, our DRSN-STORM recovers simulation edges with a higher accuracy and exhibits a better consistency especially in the intersection area. The intensity profiles along the yellow lines showed in Fig. 6(m) to Fig. 6(o) verify that the DRSN-STORM reconstructs the higher pixel intensity than Deep-STORM (averaged pixel intensity 8.1063 against 3.6060, 6.1421 against 1.9251 and 10.7707 against 5.2741, respectively). DRSN-STORM also achieves slightly better FWHM (28.2 nm against 28.4 nm), but slightly worse intensity (0.76 a.u. against 0.78 a.u.).

Fig. 6. Prediction results on the realistic simulation dataset stacked with ten frames. Pixel size = 20 nm (after up-sampling). (a) Sum of ten acquired frames. (b) and (c) are the prediction results using Deep-STORM and DRSN-STORM, respectively. (d)-(f), (g)-(i), (j)-(l) Magnified views of the selected regions in (a), (b) and (c), respectively. (m), (n) and (o) Plots of intensity profiles along the yellow lines in red, green and yellow boxes, respectively. (p) and (q) FWHM and intensity profile of orange line segments (marked with number one and number two) in (b) and (c), respectively. Sea blue curves and black dots are fitted Gaussian functions curves and measured intensities, respectively. The FWHM are measured and indicated with double arrow lines.

Download Full Size | PDF

Similar comparison results can be obtained for the fifty-frame case (the up-sampling factor of four) from Fig. 7. Compared between Fig. 7(g) and Fig. 7(j), Fig. 7(h) and Fig. 7(k), Fig. 7(i) and Fig. 7(l), respectively, DRSN-STORM presents less reconstruction errors in the intersection area. It also outperforms Deep-STORM in maintaining structure consistency. Our DRSN-STORM outperforms Deep-STORM in the relatively high-density regions within images summed by few frames through reducing the spurious peaks at the positions between adjacent structural areas. It is consistent with the intensity profiles along the yellow line, as shown in Fig. 7(k), and verifies that our proposed method resolves intersection area slightly better. On the other hand, our proposed method suffers a few background noises. This is because DRSN-STORM directly resolves the diffraction-limited images summed by few frames which contain a lot of noises in the way of image processing instead of predicting the locations of emitters (Deep-STORM). As the number of combined frames increases, DRSN-STORM achieves much better FWHM (33.2 nm against 36 nm) and intensity (1.27 a.u. against 1.18 a.u.).

Fig. 7. Prediction results on the realistic simulation dataset stacked with fifty frames. Pixel size = 20 nm (after up-sampling). (a) Sum of fifty acquired frames. (b) and (c) are the prediction results using Deep-STORM and DRSN-STORM, respectively. (d)-(f), (g)-(i) and (j)-(l) Magnified views of the selected regions in (a), (b) and (c), respectively. (m), (n) and (o) Plots of intensity profiles along the yellow lines in red, green and yellow boxes, respectively. (p) and (q) FWHM and intensity profile of orange line segments (marked with number one and number two) in (b) and (c), respectively. Sea blue curves and black dots are fitted Gaussian functions curves and measured intensities, respectively. The FWHM are measured and indicated with double arrow lines.

Download Full Size | PDF

Figure 8 and Fig. 9 show the comparisons using the tubulin dataset that composed of five-hundred and one-thousand long-sequence frames (the up-sampling factor of four), respectively. Deep-STORM succeeds in recovering the whole structure of tubulin, but fails in maintaining structure consistency by a few gaps between recovered molecules. Besides, details such as protuberances as shown in red squares and their magnified views in Fig. 8(f) and Fig. 9(f) are also not revealed. Reconstruction errors in the intersection areas (green squares and their magnified views in Fig. 8(g) and Fig. 9(g)) are also observed in Deep-STORM which are consistent with the intensity profiles along the yellow line in Figs. 8(j)(k) and Figs. 9(j)(k). Obviously, the proposed method can achieve better performance both in recovering the structure and maintaining structure consistency, leading to a satisfied visual effect.

Fig. 8. Comparison using the tubulin dataset. (a) Sum of five-hundred long-sequence frames. (b) Prediction results using Deep-STORM. (c) Prediction results using DRSN-STORM. (d), (e) Magnified views of the selected regions in (a). (f), (g) and (h), (i) Magnified views of the selected regions in (b) and (c), respectively. (j) and (k) Plots of intensity profiles along the yellow lines in red and green boxes, respectively.

Download Full Size | PDF

Fig. 9. Comparison using the tubulin dataset. (a) Sum of one-thousand long-sequence frames. (b) Prediction results using Deep-STORM. (c) Prediction results using DRSN-STORM. (d), (e) Magnified views of the selected regions in (a). (f), (g) and (h), (i) Magnified views of the selected regions in (b) and (c), respectively. (j) and (k) Plots of intensity profiles along the yellow lines in red and green boxes, respectively.

Download Full Size | PDF

Fig. 10. Comparison using the tubulin dataset. (a) Sum of five-hundred long-sequence frames. (b) Prediction results using Deep-STORM. (c) Prediction results using DRSN-STORM. (d), (e) Magnified views of the selected regions in (a). (f), (g) and (h), (i) Magnified views of the selected regions in (b) and (c), respectively. (j) and (k) Plots of intensity profiles along the yellow lines in green and yellow boxes, respectively.

Download Full Size | PDF

We then evaluate the performance on the long-sequence tubulin testing images summed by five-hundred and one-thousand long-sequence frames under the circumstance that the up-sampling factor equals to eight, which are shown as in Fig. 10(a) and Fig. 11(a). It can be observed that the performance of both methods under this circumstance are poorer than that of the corresponding methods when the up-sampling factor is four. This is because the larger the up-sampling factor is, the more information needs to be recovered. Compared with a smaller up-sampling factor, a higher up-sampling factor will lead to larger reconstruction errors. The outlines of reconstructed results generated by Deep-STORM are almost the same as those of the diffraction-limited inputs, which means that the phenomenon of diffraction-limit is not relieved but merely replaced by the excessive predictions. For the case of the up-sampling factor of eight, the performance of the proposed method is also poorer than that under the circumstance of the up-sampling factor of four from the intensity profiles along the yellow line (Fig. 8(h) against Fig. 10(h), Fig. 8(i) against Fig. 10(i), Fig. 9(h) against Fig. 11(h), Fig. 9(i) against Fig. 11(i)). Although the detail errors (Figs. 8(f)(g) and Figs. 9(f)(g)) are revealed in Figs. 10(f)(g) and Figs. 11(f)(g), Deep-STORM suffers slightly poorer performance than DRSN-STORM in high-density regions. Figure 10(j), Fig. 10(k), Fig. 11(j), and Fig. 11(k) show the consistency with the intensity profiles along the yellow line between Deep-STORM and DRSN-STORM (averaged pixel intensity 12.0890 against 16.8687, 5.6577 against 14.6279, 9.2200 against 18.6461 and 5.6096 against 17.1532).

Fig. 11. Comparison using the tubulin dataset. (a) Sum of one-thousand long-sequence frames. (b) Prediction results using Deep-STORM. (c) Prediction results using DRSN-STORM. (d), (e) Magnified views of the selected regions in (a). (f), (g) and (h), (i) Magnified views of the selected regions in (b) and (c), respectively. (j) and (k) Plots of intensity profiles along the yellow lines in green and yellow boxes, respectively.

Download Full Size | PDF

DRSN-STORM is capable of yielding competitive reconstruction results compared with Deep-STORM while maintaining a shorter running time and smaller parametric magnitude (i.e., ∼0.4M against ∼1.3M). Therefore, our DRSN-STORM is more efficient and less susceptible to overfitting. Table 1 compares the running time on both realistic simulation and real experiment datasets (Fig. 6 to Fig. 11).

Table 1. Comparison of running time

View Table

4. Conclusions

In this paper, a novel framework for reconstructing high-density molecule localization under recurrent-supervised network is proposed. Owing to the specific capability of a mapping between optics-wave-based physical phenomena and the computation in the RNN, we obtain comparable results both on simulated and real experimental datasets. The proposed method outperforms Deep-STORM in maintaining structure consistency and reducing noises. Besides, our method takes less running time to recover diffraction-limited images and needs fewer training data to prevent overfitting. Overall, the proposed DRSN-STORM is believed to have the greater potential to enhance the capability of overcoming diffraction limit and recovering super-resolved images. SMLM would generate thousands of raw diffraction-limited frames, this imaging mechanism troubles the traditional image processing which usually handles a single image. This difference poses an issue for designing experiments in the way of traditional image processing. Figure 3 demonstrates that the proposed method can achieve better reconstructing performance for single image SR. However, handling up to tens of thousands raw frames frame-by-frame would be a tough task and time-consuming. While processing an image summed by a large number of raw frames would extremely reduce the performance and/or increase the design/optimization difficulty of networks. The reconstruction results would also be impacted more seriously by unknown-distributed background noises due to the frame superimposition. For different fluorescence structures, how to ensure the number of summed frames to balance the trade-off between spatial resolution and temporal resolution is still an open problem that needs further survey for the DRSN-STORM. In addition, we intend to continue to optimize the RNN-based architecture and increase the robustness performance to noises for the purpose of applying it to our previously developed hyperspectral imaging system [26] which can achieve extremely high spectral resolution but relatively low spatial/temporal resolution. By means of deep learning, the ability of exploring microcosm can be moved forward by perfecting the spatial/temporal resolution of our hyperspectral imaging system.

Funding

Key Research and Development Projects of Shaanxi Province (2020ZDLGY01-03); National Natural Science Foundation of China (51975483, 62006195); Science, Technology and Innovation Commission of Shenzhen Municipality (JCYJ20180508151936092).

Acknowledgments

We acknowledge Haoyong Li and Xue Dong for polishing the paper.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. M. J. Rust, M. Bates, and X. Zhuang, “Sub-diffraction-limit imaging by stochastic optical reconstruction microscopy (STORM),” Nat. Methods 3(10), 793–796 (2006). [CrossRef]

2. T. A. Klar and S. W. Hell, “Subdiffraction resolution in far-field fluorescence microscopy,” Opt. Lett. 24(14), 954–956 (1999). [CrossRef]

3. E. Betzig, G. H. Patterson, R. Sougrat, O. W. Lindwasser, S. Olenych, J. S. Bonifacino, M. W. Davidson, J. Lippincott Schwartz, and H. F. Hess, “Imaging intracellular fluorescent proteins at nanometer resolution,” Science 313(5793), 1642–1645 (2006). [CrossRef]

4. M. Gustafsson, “Surpassing the lateral resolution limit by a factor of two using structured illumination microscopy,” J. Microsc. 198(2), 82–87 (2000). [CrossRef]

5. W. Lukosz and M. Marchand, “Optischen abbildung unter Überschreitung der beugungsbedingten auflösungsgrenze,” Opt. Acta 10(3), 241–255 (1963). [CrossRef]

6. K. Agarwal and R. Machán, “Multiple signal classification algorithm for super-resolution fluorescence microscopy,” Nat. Commun. 7(1), 13752 (2016). [CrossRef]

7. A. Lee, K. Tsekouras, C. Calderon, C. Bustamante, and S. Pressé, “Unraveling the thousand word picture: an introduction to super-resolution data analysis,” Chem. Rev. 117(11), 7276–7330 (2017). [CrossRef]

8. G. E. Hinton, S. Osindero, and Y. W. Teh, “A Fast Learning Algorithm for Deep Belief Nets,” Neural Comput. 18(7), 1527–1554 (2006). [CrossRef]

9. G. E. Hinton and R. R. Salakhutdinov, “Reducing the Dimensionality of Data with Neural Networks,” Science 313(5786), 504–507 (2006). [CrossRef]

10. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Proceedings of Adv Neural Inform Process Syst (2012), 1097–1105.

11. C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016). [CrossRef]

12. W. Ouyang, A. Aristov, M. Lelek, X. Hao, and C. Zimmer, “Deep learning massively accelerates super-resolution localization microscopy,” Nat. Biotechnol. 36(5), 460–468 (2018). [CrossRef]

13. S. K. Gaire, Y. Zhang, H. Li, R. Yu, H. Zhang, and L. Ying, “Accelerating multicolor spectroscopic single-molecule localization microscopy using deep learning,” Biomed. Opt. Express 11(5), 2705–2721 (2020). [CrossRef]

14. N. Boyd, E. Jonas, H. Babcock, and B. Recht, “DeepLoco: Fast 3D Localization Microscopy Using Neural Networks,” bioRxiv preprint doi: https://doi.org/10.1101/267096, 2018.

15. E. Nehme, D. Freedman, R. Gordon, B. Ferdman, L. E. Weiss, O. Alalouf, T. Naor, R. Orange, T. Michaeli, and Y. Shechtman, “DeepSTORM3D: dense 3D localization microscopy and PSF design by deep learning,” Nat. Methods 17(7), 734–740 (2020). [CrossRef]

16. A. Von Diezmann, Y. Shechtman, and W. Moerner, “Three-dimensional localization of single molecules for super-resolution imaging and single-particle tracking,” Chem. Rev. 117(11), 7244–7275 (2017). [CrossRef]

17. E. Nehme, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Deep-STORM: super-resolution single-molecule microscopy by deep learning,” Optica 5(4), 458–464 (2018). [CrossRef]

18. B. Yao, W. Li, W. Pan, Z. Yang, D. Chen, J. Li, and J. Qu, “Image reconstruction with a deep convolutional neural network in high-density super-resolution microscopy,” Opt. Express 28(10), 15432–15446 (2020). [CrossRef]

19. T. W. Hughes, I. A. D. Williamson, M. Minkov, and S. Fan, “Wave physics as an analog recurrent neural network,” Sci. Adv. 5(12), eaay6946 (2019). [CrossRef]

20. C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply supervised nets,” arXiv preprint arXiv:1409.5185, 2014.

21. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), 770–778.

22. C. T. Rueden, J. Schindelin, M. C. Hiner, B. E. DeZonia, A. E. Walter, E. T. Arena, and K. W. Eliceiri, “Imagej2: Imagej for the next generation of scientific image data,” BMC Bioinf. 18(1), 529 (2017). [CrossRef]

23. M. Ovesný, P. Krížek, J. Borkovec, Z. Švindrych, and G. M. Hagen, “ThunderSTORM: a comprehensive ImageJ plug-in for PALM and STORM data analysis and super-resolution imaging,” Bioinformatics 30(16), 2389–2390 (2014). [CrossRef]

24. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: large-scale machine learning on heterogeneous systems,” arXiv: 1603.04467 (2016).

25. Biomedical Imaging Group, Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, “Benchmarking of single-molecule localization microscopy software,” http://bigwww.epfl.ch/smlm/.

26. X. Dong, X. C. Xiao, Y. N. Pan, G. Y. Wang, and Y. T. Yu, “DMD-based hyperspectral imaging system with tunable spatial and spectral resolution,” Opt. Express 27(12), 16995–17006 (2019). [CrossRef]

dataset	image size	Deep-STORM(s)	DRSN-STORM(s)
Sim. (10 frames)	512 $\times 512$	0.4801	0.4783
Sim. (50 frames)	512 $\times 512$	1.1545	0.4827
Exp. (500 frames)	256 $\times 256$	3.0362	0.1499
Exp. (1000 frames)	256 $\times 256$	6.5316	0.1603
Exp. (500 frames)	512 $\times 512$	9.5787	0.5571
Exp. (1000 frames)	512 $\times 512$	18.7076	0.5560

Spatial and temporal super-resolution for fluorescence microscopy by a recurrent neural network

Abstract

1. Introduction

2. Method

2.1 Architecture

2.2 Training

3. Result

3.1 Model validation

3.2 Reconstruction results of high-density datasets

4. Conclusions

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (11)

Tables (1)

Equations (11)

Optics Express