Temporal focusing multiphoton microscopy with cross-modality multi-stage 3D U-Net for fast and clear bioimaging

Yvonne Yuling Hu; Chia-Wei Hsu; Yu-Hao Tseng; Chun-Yu Lin; Hsueh-Cheng Chiang; Ann-Shyn Chiang; Shin-Tsu Chang; Shin-Tsu Chang; Shin-Tsu Chang; Shean-Jen Chen; Shean-Jen Chen; Shean-Jen Chen

doi:10.1364/BOE.484154

1. Introduction

Multiphoton excitation microscopy (MPEM) has emerged as one of the most useful tools in biomedical research since its introduction in the 1990s [1]. In MPEM systems, the nonlinear excitation phenomenon that occurs at the front focal point of the objective provides exceptional spatial resolution, both laterally and axially. Furthermore, the near-infrared excitation wavelength provides a low-absorption window for biological specimens, which makes possible a deeper penetration depth. Additionally, non-centrosymmetric structures such as collagen and myosin can be examined by extracting the second harmonic generation (SHG) signal, which is also a nonlinear optical process [2,3]. Given the use of a scanning device, such as a galvo mirror [4], polygon scanner [5], or acousto-optic modulator [6], the focusing spot can be scanned through the region of interest on a point-by-point basis, thereby yielding detailed insights into the structural properties of the sample with superior spatial resolution. However, conventional point-scanning multiphoton excitation microscopy (PSMPEM) can only reach imaging speed of around 30 frames per second (fps) [7]. To address this problem, several methods based on widefield illumination have been proposed, including temporal focusing multiphoton excitation microscopy (TFMPEM) [8–11], two-photon light sheet microscopy (2P-LSM) [12,13], and light-field microscopy with temporal focusing multiphoton illumination [14]. Typically, 2P-LSM systems use a cylindrical lens to generate a sheet of light, which is then swiped through the sample with a mechanical scanner. The resulting fluorescence signal emitted in the orthogonal direction is captured by a low numerical aperture (NA) objective operated synchronously with the scanning device [12]. TFMPEM provides an alternative solution for plane-illuminated in-vivo biotissue imaging [8,9], in which a diffraction grating is used to disperse the laser pulse angularly based on the diffraction equation, and the pulse is then re-overlapped in phase at the front focal plane after an objective-based 4f system. Finally, the broadened pulse is reshaped back to its original width in order to create a sufficient photon density to excite a single plane for multiphoton excitation. Notably, the plane-illumination technique in TFMPEM does not require the use of a scanner in the transverse x-y plane, and hence the imaging process is faster than that of the conventional PSMPEM method. Furthermore, with an additional z-axis stage, TFMPEM can easily achieve 30 volumes per second for in-vivo volumetric imaging [15]. However, TFMPEM is based on widefield illumination and hence the detection signal is affected by scattering, which results in crosstalk between the emitted photons on the 2D detector and degrades the image quality accordingly. Furthermore, the wavelength of the emitted photons is shorter than that of the excitation photons in two-photon microscopy systems, and thus cannot directly travel as long as the excitation beam, particularly in thick turbid samples. As a result, a significant amount of information is either missed or distorted at the detector.

Various techniques have been proposed for enhancing the contrast and image quality of TFMPEM systems by suppressing the scattering effect. For instance, the HiLo technique [16,16] and nonlinear structured illumination method [19] both make possible a higher lateral resolution and an improved sectioning ability. Moreover, line-scanning TFMPEM using a single scanner not only reduces the tissue scattering effect, but also enhances the axial excitation confinement by expanding the coverage of the laser beam on the back aperture of the objective lens [20]. However, these structured illumination methods require at least two images for reconstruction purposes, which reduces the acquisition speed of TFMPEM by 2-fold or more [11,16–19,21]. Several studies have shown that the integration of adaptive optics with the TFMPEM structure provides an effective means of compensating the aberrations and dispersions induced by the optical elements of the system, environmental disturbances, and tissue scattering, respectively. Adaptive optics allows TFMPEM to reach deeper into the biological sample while still maintaining the theoretical diffraction limit [22,23]. Although the optimization of the control loop can keep up with the image acquisition, the obtained image quality is still incomparable to that of traditional PSMPEM. In addition, expansion microscopy [24] enlarges the tissue volume by decrowding the biomolecules in an isotropic manner through chemical process, which raises the significance of fast image acquisition in bulky tissue image collection. For instance, in the connectome studies, such as drosophila [25] and mouse [26], the utilization of TFMPEM is capable of acquiring a whole-brain image approximately 30 times faster than PSMPEM under similar field-of-view, which provides an efficient method to obtain reference imaging for further application.

Deep learning technology provides the possibility to dramatically improve the image quality in microscopy systems without compromising the acquisition speed [27]. Convolutional neural networks (CNNs), in particular, provide a powerful solution for many biomedical image-processing tasks, including segmentation [28], classification [29], and reconstruction [30]. However, CNNs require a considerable quantity of dataset for the parameter training in the network to accomplish its task. CNN-based networks with a U-shape architecture, generally referred to simply as U-Net, has proved to achieve comparable results with relatively small amount of data [31–33]. However, single-stage models, which consist of only one network structure, frequently lose high-level contextual information or pixelwise spatial details. Accordingly, various multi-stage approaches have been proposed to restore the missing spatial details and high-level semantic information via the combination of multi-scale and single-scale feature extractions [34–36].

In the present study, an in-vitro TFMPEM image dataset of the mushroom body (MB) structure in the drosophila brain is used to train an image registration network and image restoration network. The pretrained registration and restoration models are then used to restore in-vivo TFMPEM images. Since the input TFMPEM images and ground truth images acquired via PSMPEM are not obtained from the same optical system, the physical pixel sizes are not matched and hence an obvious 3D displacement occurs in the image volumes. Therefore, cross-modality registration between two sets of images is mandatory. The registration process between the input images and ground truth images involves not only scaling and translation, but also 3D rotation. However, the obtained high-frame rate TFMPEM images are highly scattered and relatively weak with a very low signal-to-noise ratio (SNR). Consequently, a significant amount of feature information is lost for registration. To address this problem, a segmentation network is first utilized to extract the morphological features from the TFMPEM image. Scale-invariant feature transform (SIFT) [37], which registers images based on their local features, is one of the most commonly used registration methods nowadays. However, learning-based image registration provides an alternative method with a higher efficiency and a lower time consumption [38,39]. Accordingly, in the present study a linear registration process is first performed to conduct global shifting of the PSMPEM images, and a VoxelMorph is then used to perform nonlinear sophisticated positioning of the images in an unsupervised learning fashion using a U-Net model without ground truth warp fields or any anatomical landmarks [40]. After the registration process, the in-vitro fixed TFMPEM volumetric images are inferenced via a multi-stage 3D U-Net using the registered PSMPEM volumetric images as ground truths. Finally, 3D U-Net model based on the pretrained in-vitro model is further trained with a small in-vivo dataset in accordance with a transfer learning approach [41], and the transfer learning model is then used to restore in-vivo TFMPEM MB images.

2. Methodology

2.1 Temporal focusing multiphoton excitation microscopy and point-scanning multiphoton microscopy image as ground truth

Two-photon widefield fluorescence images of drosophila MB were acquired via a self-built TFMPEM system. An Yb:KGW amplified laser (Pharos PH1-10, Light Conversion, Lithuania) with a central wavelength of 1030 nm was used as the lighting source with a maximum pulse energy of 50 µJ/pulse, a repetition rate of 200 kHz, and a pulse width of approximately 220 fs. A mechanical shutter (VS14S-2-ZM-0-R3, Uniblitz, USA) was placed in front of the laser output to block the laser output to prevent unwanted exposure of the biological sample. In addition, a half-wave plate and polarization beam splitter were placed after the shutter to adjust the laser dosage on the sample and achieve a horizontal polarization of the laser light. The beam was then expanded by a factor of 8X to maximize its coverage on a 600 lines/mm blazed grating. The grating was arranged such that the laser pulses were incident on the grating surface with an angle of 38.17°, thereby maximizing the efficiency of the 1st-order diffraction. The dispersed beam was diffracted with a divergent angle of 0.525°. A 4f configuration consisting of a water-immersed high NA objective lens (W Plan-Apochromat 40X/1.0 DIC M27, Zeiss, Germany) was used to project the grating plane to the front focal plane of the objective, thereby creating high-throughput plane illumination on the sample. The emitted signal was collected by the same objective lens, and then filtered by a dichroic mirror (FF670, Semrock, USA) and short-pass filter (FF01-680, Semrock, USA) to block the reflected laser light. Finally, the signal was collected by a high-sensitivity EMCCD (Andor iXon Ultra 897, Oxford Instruments, UK) consisting of an array of 512 × 512 pixels, where each pixel had a size of 16 × 16 µm². The illumination area under the microscope had a diameter of approximately 180 µm. The axial scanning process required to obtain the desired volume images was performed using a synchronized piezo stage (NanoScanZ 200, Prior Scientific, UK) with a maximum travel range of 200 µm. In performing the experiments, the camera was thermoelectrically cooled to -80°C to reduce electronic noise.

The ground truth images were acquired via a PSMPEM system comprising a Ti:sapphire oscillator (Tsunami, Spectro Physics, USA) with central a wavelength of 920 nm as the lighting source. 3D scanning was achieved using a galvanometer (6215 H, Cambridge, USA), a resonant mirror (RM) (CRS series, Cambridge, USA), and a piezo stage (PD72Z4CAA, Physik Instrumente, UK). In order to prevent uneven spacing arisen from the nature behavior of RM, the sampling period was determined by the pixel-enable signal of the RM produced by the pixel clock board, and only the near linear region of the motion trajectory was utilized for sampling [42]. The excitation beam was focused on the sample by the same objective lens as that used in the TFMPEM system, and the fluorescence signal was collected by a photomultiplier tube (H7422-40, Hamamatsu, Japan). To obtain high quality ground truth images, 100 PSMPEM images were acquired at a speed of 30 fps and accumulated to increase the SNR, resulting in an image acquisition rate of 0.3 fps. To ensure that the acquisition process did not affect the specimen's properties over time, the laser dosage was precisely calibrated to 47 mW/cm² on the back focal plane of the objective lens. This meticulous adjustment helped to reduce the possibility of sample photobleaching. The overall image pixel size was 512 × 512 and corresponded to an image area of approximately 190 × 190 µm², which represented a magnification of 1.05X compared to temporal focusing imaging. As shown in Fig. 1, the image volumes obtained from the two modalities were first registered via linear global transformation and nonlinear VoxelMorph and were then fed into a multi-stage U-Net restoration network. All of the networks in the proposed architecture were trained on a single Nvidia Tesla V100 GPU with 32 GB of memory.

Fig. 1. Schematic illustration of image acquisition, registration, and restoration processes.

Download Full Size | PDF

2.2 Cross-modality registration process

In this study, the purpose is to restore TFMPEM images via a multi-stage U-Net trained model. Therefore, in the registration process, the PSMPEM ground truth images (${\boldsymbol{I}_{PS}}$) were moved to register with the corresponding TFMPEM images (${\boldsymbol{I}_{TF}}$). Since the image pairs were not obtained from the same modality, the physical units of each pixel were manually matched by resizing the PSMPEM images. Moreover, the image pair were interpolated independently in the z-direction to match the physical spacing to that of the x- and y-directions such that every axis shared same spatial resolution in a single voxel. The two images were then downsized to 256 × 256 × 96 pixels for further registration. The detailed workflow of the cross-modality registration process is shown in Fig. 2. Due to the very low SNR produced by the rapid exposure time and scattering effect in the TFMPEM system, the features in the TFMPEM images are difficult to distinguish from the background noise, and hence intensity-based registration is extremely challenging. Thus, in the present study a U-Net semantic segmentation network stacking by 5 layers of residual building blocks [43] was first employed to discriminate the signal from the noise by inferring sequentially the boundary of the drosophila MB. The network was previously trained by a labeled TFMPEM dataset composed of images acquired with different exposure times. Afterwards, the segmented image was then taken as a fixed target for the PSMPEM image. The two images were initially globally aligned through a linear transformation process, which included translation and affine transform operations (which denoted as c and $\varepsilon $, respectively) via Advanced Normalization Tools (ANTs) [44]. U-Net based VoxelMorph [40] was then used to nonlinearly refine the local deformation and perform motion correction in the case of the in-vivo drosophila images. In implementing the VoxelMorph network, the concatenation of the segmented TFMPEM volume (${\tilde{\boldsymbol{I}}_{TF}}$) and shifted PSMPEM volume (${\boldsymbol{I}_{shif}}$) was taken as the input. The encoder part consisted of 3D convolution layers with a kernel size of 3 × 3 × 3 and a stride of 2, followed by a leaky rectified linear units (ReLUs) and a maxpooling in each layer. The features of the input pair were captured to estimate the vector field ($\phi$), which was a 3D matrix represented the deformation changes in each voxel. In the decoder part, a sequence of convolutions and upsampling with the concatenation was used to propagate the features learned in the encoding part and obtain the final $\phi $. Spatial transformation was then applied to register the shifted volume via $\phi$ into the moved and registered PSMPEM volume (${\boldsymbol{I}_{reg}}$). The learned $\phi$ was evaluated and updated via the following loss function based on the mean square error (MSE) between ${\boldsymbol{I}_{TF}}$ and ${\boldsymbol{I}_{reg}}$:

(1)$$\mathrm{{\cal L}}({{\boldsymbol{I}_{TF}},\,{\boldsymbol{I}_{reg}},\,\phi } )= MSE({{\boldsymbol{I}_{TF}},{\boldsymbol{I}_{reg}}(\phi )} )+ \sigma \mathop \sum \nolimits_{p \in \mathrm{\Omega }} \left\|\nabla \phi (p )\right\|^2. $$

As shown in Eq. (1), to avoid spatial deformity of the local pixel, a smoothness loss was incorporated into the function to regularize the spatial gradient with a constant σ set equal to 0.1.

Fig. 2. Workflow of cross-modality 3D image registration process. A segmentation network is first employed to extract the signal from the noise in ${I_{TF}}$. Global linear affine is then performed using ANTs to align ${\boldsymbol{I}_{PS}}$ with ${\boldsymbol{I}_{TF}}$. Note that $\varepsilon $ and c denote linear affine transformation and translation, respectively. In addition, $\phi$ denotes the registration vector field learned from VoxelMorph and applied to ${\boldsymbol{I}_{shif}}$.

Download Full Size | PDF

2.3 Multi-stage image restoration network

The modules of the multi-stage image restoration network proposed in this study were modified from MPRNet [37] and HINet [45]. Figure 3 presents a simplified view of the proposed architecture. Generally, each individual stage is based on an encoder-decoder U-Net architecture with a channel attention mechanism, and consists of various functional modules. The input to the multi-stage restoration network was a 3D TFMPEM volume (${\boldsymbol{I}_{TF}}$) and a registered PSMPEM volume (${\boldsymbol{I}_{reg}}$) both comprising 256 × 256 × 96 voxels. The restoration network adopted adopts a 3D convolution layer with a kernel size of 3 × 3 × 3 with a stride of 2 and a channel attention block at the beginning of each stage to extract the original feature map of the input image at different scales. A channel attention module was originally placed to predict and distribute weighting in accordance with the significance of the feature channels. However, an efficient channel attention (ECA) [46] mechanism was adopted in order to alleviate the network load while avoiding dimensionality reduction and improving the weight predictions for the individual feature channels. The reliability of the channel attention weightings produced by the ECA block was further improved by taking account of the location interaction between each channel and its neighbors.

Fig. 3. Architecture of proposed multi-stage restoration network. Note that, to simply the figure, only three stages are shown for illustration purposes. More stages can be added to the architecture as required using the same configuration.

Download Full Size | PDF

The 3D convolution process was followed by a half instance normalization (HIN) block and maxpooling to form a complete layer in the encoder structure. Typically, instance normalization is utilized in restoration networks for low-level digital images. However, such an approach has a high computational cost when applied to 3D volumetric images. Consequently, in the present study, a HIN block was introduced to normalize only half of the feature map while keeping the other half unchanged. The resulting receptive field was expanded while improving the robustness of the feature map. The extracted intermediate feature was then moved to the next layer for higher level contextual information processing.

In the decoder structure, a 3D convolution layer and a residual block (RES) followed by upsampling were used to extract the high-level features. Skip connections were implemented by placing additional ECAs at the output of the HIN block so as to retain only the most relevant components in the encoder. Moreover, an element-wise method was adopted in the decoder rather than directly concatenating the feature maps in order to fuse the feature information without expanding the matrix. For the TFMPEM images considered in the present study, the encoder is presented with a large amount of spatial information with weak feature representations. The introduction of the additional ECAs in the skip connections actively suppresses activation in the irrelevant regions and shifted the focus of the training process to the relevant information in each image, thereby reducing the computational cost. In the final layer of the decoder, a self-supervised attention module (SAM) [37] was introduced to provide a supervisory mechanism for the restoration process at the current stage. Specifically, the SAM generated an attention map by comparing the ground truth (${\boldsymbol{I}_{reg}}$) with the preliminary output of the decoder in the current stage to restrain the unnecessary features at the current stage and propagate only the informative features to the next stage.

In the second and third stages, the input consisted of the concatenate of the original feature map and the output from the SAM block of the previous stage. The compositions of the encoder-decoder structures were similar to that of the first stage. Furthermore, a cross-stage feature fusion (CSFF) process was adopted to fuse the features and enhance the feature context in the current stage, which indicated by the red lines in Fig. 3. The CSFF process was performed via the pixelwise addition of the output from the convolutions of the HIN block and RES block in the previous stage at each layer with the output of the HIN block at the current stage. Introducing the CSFF process into the network ensures that the multi-scale features are propagated to the end and hence avoids the information loss that might otherwise occur due to the upsampling and downsampling operations in the network. Moreover, the fusion of the feature hierarchy enables the features to be extracted from the local high-level semantic information in a more efficient manner.

The loss function in the network was specified as the summation of the loss at each stage, where this loss was derived as the difference between the output restored image, ${\boldsymbol{I}_{res}}$, and the registered ground truth, ${\boldsymbol{I}_{reg}}$. In other words, the loss function was defined as the following Charbonnier loss:

(2)$$\mathrm{{\cal L}} = \mathop \sum \nolimits_{n = 1}^k \sqrt {\left\|\boldsymbol{I}_{res}^n - {\boldsymbol{I}_{reg}}\right\|^2 + {\epsilon ^2}} , $$

where k denotes the total number of stages in the network, $\boldsymbol{I}_{res}^n$ represents the output restored image at the ${n^{\textrm{th}}}$ stage, and $\epsilon $ is an error constant which is set to 10⁻³ in the present experiments.

3. Experimental results and discussions

3.1 Cross-modality 3D image registration

Before performing Cross-modality 3D registration, the segmentation network was utilized to extract the morphological features from the background noise in the TFMPEM images, ${\boldsymbol{I}_{TF}}$. The segmented image ${\tilde{\boldsymbol{I}}_{TF}}$ was then used in the registration process, in which ${\tilde{\boldsymbol{I}}_{TF}}$ and ${\boldsymbol{I}_{PS}}$ were both volumes stacked by 120 layers with a step size of 1.25 µm. Figure 4(a) presents the volumetric registration results obtained via linear ANTs transformation and the locally-fine VoxelMorph network at six depths of 5, 17.5, 30, 42.5, 55, and 67.5 µm, respectively, where the ${\boldsymbol{I}_{TF}}$ and ${\boldsymbol{I}_{reg}}$ images are shown in green and magenta, respectively. In general, the results confirm that the registration process functions properly for the MB lobes at a shallower depth and the calyx structure at a deeper depth. The quality of the registration results was evaluated using the dice score based on the ratio of the overlaps between ${\boldsymbol{I}_{TF}}$ and ${\boldsymbol{I}_{reg}}$, i.e.,

(3)$$\textrm{Dice}({{\boldsymbol{I}_{reg}};{{\tilde{\boldsymbol{I}}}_{TF}}} )= 2\frac{{|{{\boldsymbol{I}_{reg}} \cap {{\tilde{\boldsymbol{I}}}_{TF}}} |}}{{|{{\boldsymbol{I}_{reg}}} |+ |{{{\tilde{\boldsymbol{I}}}_{TF}}} |}}. $$

Noted that the ${\boldsymbol{I}_{reg}}$ was binarized before the calculation. A higher dice score indicates a greater similarity of the two images, and a score of 1 means that the two images are identical. In the present study, the original ${\boldsymbol{I}_{TF}}$ images were noisy and blurred, and hence the segmented image, ${\tilde{\boldsymbol{I}}_{TF}}$ were used instead to perform the registration process. Figure 4(b) shows the calculated dice scores before and after registration at different depth. Notably, the registration process not only increases the dice score, but also induces a shift of the dice score curve to the left, indicating a translation of the ground truth images in the z-direction during the registration process, which is approximately 5 µm in the case of Fig. 4. It is noted that the dice scores are lower in the depth range of 25 µm to 55 µm due to a lack of features in the original PSMPEM images. 20 image pairs were chosen at random to analyze the registration results. As shown in Fig. 4(c), the average dice score after the registration process was enhanced from 0.29 to 0.63, while the standard deviation reduced from 0.14 to 0.07. In other words, the registration process enhanced the overlapping of the images obtained from the two different modalities and showed good reliability. Following the registration process, the middle 96 layers were retained for further processing while the remainder were discarded in order to reduce the computation cost.

Fig. 4. Registration results for the MB structures at different depth after linear ANTs and VoxelMorph procedures. (a) Overlap of TFMPEM images (${\boldsymbol{I}_{TF}}$) shown in green and PSMPEM images (${I_{reg}}$) shown in magenta at depths of 5 µm, 17.5 µm, 30 µm, 42.5 µm, 55 µm, and 67.5 µm (b) Dice score at different depths with z step of 1.25 µm. (c) Statistical dice score distributions of 20 sets of registration results. Scale bar in (a) indicates 50 µm.

Download Full Size | PDF

3.2 In-vitro TFMPEM image enhancement via two-stage U-Net

The applicability of the 3D U-Net model to in-vitro bioimage restoration was investigated using TFMPEM images of the MB structure of the drosophila brain. Here, the designed multi-stage network consisted of two stages, where the U-Net architecture in each stage comprised 5 layers. The training dataset contained 180 TFMPEM-PSMPEM volume pairs, which dimensions of each volume was 256 × 256 × 96. Loss convergence took approximately 30 hours after finishing 300 epochs with 100 iterations per epoch. Additionally, 6 volume pairs were utilized as testing data. Figure 5 shows the restoration results obtained for TFMPEM images acquired with a 10-ms exposure time. Figures 5(a)–5(d) show the MB lobe structure at a shallower depth in the ground truth image ${\boldsymbol{I}_{PS}}$ (PSMPEM), the input image ${\boldsymbol{I}_{TF}}$ (TFMPEM), and the restored images $\boldsymbol{I}_{res}^1$ & $\boldsymbol{I}_{res}^2$ obtained from the first and second stages of the 2-stage network, respectively. Although the input image in Fig. 5(b) is blurred and noisy compared to the ground truth in Fig. 5(a), the morphologies of the MB lobes obtained from the Stage 1 and Stage 2 of the restoration network (Fig. 5(c) and Fig. 5(d), respectively) are still close to that of the ground truth image. Figures 5(e)–5(h) show the restoration of the calyx part of the MB structure located at a greater depth in the volumetric image, where most of the Kenyon cells that participate in olfactory learning and memory are located. The SSIM of the images acquired at depths of 0 to 25 µm were found to be enhanced from 0.38 ± 0.02 in the original condition to 0.92 ± 0.01 and 0.93 ± 0.01 after Stage 1 and Stage 2, respectively. Thus, while both stages yielded an effective improvement in the SSIM of the images collected at a shallow depth, the second stage of the restoration process resulted in only a minor incremental improvement of the SSIM over that achieved in the first stage. A similar tendency was observed for the images acquired at depths of 57.5 to 100 µm, for which the SSIM increased from 0.33 ± 0.01 in the original condition 0.79 ± 0.01 and 0.80 ± 0.01 after Stage 1 and Stage 2, respectively. However, the Stage 2 restoration results show more contextualized features than those of Stage 1 in general and a slightly greater contrast. Figures 5(i)–5(l) show the 3D volumes of the ground truth, original input, Stage 1, and Stage 2, respectively. The morphologies of the MB structure in Stage 1 and Stage 2 are similar to that of the original input.

Fig. 5. In-vitro image restoration results obtained by 2-stage network. (a)-(d) Restoration results for 2-stage restoration of drosophila MB lobes at shallower depth. (e)-(h) Restoration results for calyx structure at the deeper depth. (i)-(l) 3D volume images of ground truth, input, Stage 1, and Stage 2, respectively. (m) and (n) Intensity profiles of arrowed rows in shallow and deep layers, respectively. Scale bar in (a) indicates 10 µm in (a)-(h). Volume size of (i)-(l) is 188 × 188 × 120 µm³.

Download Full Size | PDF

Figures 5(m) and 5(n) show the intensity profiles of the lobe and calyx structure within the rows indicated by the arrows in the shallow- and deep-layer images in Figs. 5(a)–5(d) and Figs. 5(e)–5(h), respectively. It is seen that the contrast variations of the restored lobe and calyx images are not as precise as the ground truth images. For the images acquired at a deeper layer, the presence of scattering noise reduces the SNR of the input image and degrades the restoration performance. As a result, the SSIM scores of the Stage 1 and Stage 2 restoration results are lower than those of the corresponding results for the image acquired at a shallower depth.

3.3 Fast and clear in-vivo MB restoration via three-stage U-Net

To further investigate the feasibility of the proposed network for in-vivo rapid imaging, TFMPEM images acquired with a 1-ms exposure time were obtained for restoration purposes. Since vivisection of the drosophila brain requires sophisticated surgical experience, it is difficult to obtain in-vivo images. Moreover, the small size of drosophila makes it difficult to stabilize the anatomical posture across experiments, especially when two different modalities are involved. Besides, shivering caused by the live drosophila's breathing and agitation during the acquisition of ground truth images resulted in motion artifacts between PSMPEM frames. To ensure the production of accurate and high-quality ground truth images, motion correction was implemented prior to image accumulation. Thus, the proposed network with more stages were built to restore in-vivo images with extremely low SNR. Due to current hardware limitations, each volume in the training dataset, which comprised 150 volume pairs of in-vivo MB images, was disassembled into two 256 × 256 × 48 volumes to train the 3-stage U-Net with 5 layers. The overall training time was approximately 30 hours, with loss convergence obtained after 300 epochs, with each epoch having 100 iterations. Figures 6(a)–6(d) show the restoration results obtained in the shallower layer using the ground truth image ${\boldsymbol{I}_{PS}}$ acquired via PSMPEM and the in-vivo input image ${\boldsymbol{I}_{TF}}$ obtained via TFMPEM. As shown in Figs. 6(c) and 6(d), the morphologies of the MB lobe obtained from the Stage 2 and 3 of the 3-stage U-Net are not fully restored due to a lack of reliable information in the input image (Fig. 6(b)). Based on a statistical analysis of 20 image pairs selected at depths of 0 to 25 µm, the SSIMs of the input image, Stage 2 image, and Stage 3 image compared with the ground truth image were found to be 0.18 ± 0.01, 0.81 ± 0.04, and 0.85 ± 0.05, respectively. For the images of the calyx obtained in the deeper layer in (Figs. 6(f)–6(i)), the restoration images obtained from Stage2 and Stage3 (Figs. 6(h) & 6(i), respectively) lack contextual details since the SNR of the input image (Fig. 6(g)) is negative and has a value of approximately -10.5 dB. In other words, the intensity level of the meaningful signal is lower than that of the noise. A statistical analysis of 40 image pairs selected at depths in the range of 37.5 to 60 µm showed that the SSIM values of the input, Stage 2, and Stage 3 images compared with the ground truth image were 0.19 ± 0.02, 0.76 ± 0.02, and 0.78 ± 0.02, respectively. The degradation of the restoration results might arise from the nature of in-vivo experiment, which includes posture variation, time-sensitive intensity change, and etc.

Fig. 6. Restoration results obtained for in-vivo imaging using 3-stage network. (a)-(e) Ground truth PSMPEM image, input TFMPEM image, restored images from Stage 2, restored images from Stage 3, and restored image from Stage 3 in transfer learning network. Note that all of the images show the lobe structure of MB brain in the shallow layers. (f)-(j) Ground truth PSMPEM image, input TFMPEM image, restored image from Stage 2, restored image from Stage 3, and restored image from Stage 3 in transfer learning network. Note that all of the images show the calyx structure of the MB brain in the deep layers. (k) and (l) Intensity profiles of arrowed rows indicated in (a)-(e) and (f)-(j), respectively. Scale bar in (a) indicates 10 µm in (a)-(j). See Visualization 1 for 3D images of the whole MB.

Download Full Size | PDF

Thus, a transfer learning approach was applied to improve the restoration accuracy of the 3-stage network for in-vivo MB images by further training the network (pretrained using in-vitro images) with an in-vivo dataset consisting of one PSMPEM volume and nine TFMPEM volumes, where each volume was composed of 120 layers. The overall image acquisition time for the 10 volumes was around 7 min. The training loss converged after 10 epochs, with 100 iterations per epoch. The cross-modality registration and transfer learning processes took approximately 600 sec and 4,000 sec, respectively. Thus, the overall post-hoc process required around 1.5 hours to complete. Figures 6(e) and 6(j) show the restoration result obtained for the MB lobes and calyx, respectively, following the transfer learning process. Comparing Figs. 6(e) and 6(d), it is seen that the transfer learning approach improves both the morphology and the local intensity distribution of the lobe structure. Figure 6(k) confirm that the intensity profile of the image obtained from the 3-stage transfer learning model has better fit to the ground truth profile than the profiles obtained from the original network trained on only in-vitro images. A similar tendency is noted for the restored image of the calyx structure in the deeper layers (Fig. 6(j)). The results presented in Fig. 6(l) show that the intensity profile of the restored image is in close agreement with that of the ground truth image and the contrast in enhanced by more than 2-fold compared to the Stage 2 and Stage 3 images obtained from the original network. Although the contextual information in the individual cell bodies is not obvious, the local intensity shows that the accuracy of the cell position is significantly enhanced. Thus, the SSIMs of the images restored using the 3-stage transfer learning model are further improved to 0.97 ± 0.02 and 0.94 ± 0.02 in the shallow and deep layers, respectively.

4. Conclusions

This study has presented a cross-modality multi-stage 3D U-Net for restoring TFMPEM images of the drosophila brain with scattering effects, signal crosstalk, and background noise. In the proposed method, a cross-modality registration process consisting of global linear transformation and locally fine VoxelMorph transformation are first utilized to align the ground truth volumes obtained from PSMPEM with the segmented TFMPEM volume. A two-stage 3D U-Net with cross-stage feature fusion is then employed to perform image restoration. The experimental results obtained for in-vitro drosophila brain image have shown that proposed network enables the large-scale morphology of the MB structure to be restored and the noise in the original TFMPEM images to be significantly reduced. However, the local details of the images, such as the positions of the cell bodies, are slightly misjudged due to the low SNR. The experimental results have shown that the addition of a further stage to the two-stage 3D U-Net structure enables the restoration of in-vivo drosophila MB images with a higher SSIM that that achieved using the original U-Net model. Transfer learning provides an opportunity to further enhance the SSIM performance of the restoration network. However, training the original network (pretrained on in-vitro images) using in-vivo images requires approximately 1.5 hour, which represented a major drawback when it comes to in-vivo brain imaging of drosophila. Future studies will investigate the feasibility for further improving the restoration performance of the proposed multi-stage 3D U-Net by increasing the number of layers in each individual stage in order to extract more contextual features and utilizing a concatenation approach in the skip connections rather than the pixelwise summation method adopted in the present network. Moreover, in the application of functional imaging, such as calcium signaling, deep reinforcement learning might be utilized to deal with the dynamic change in the fluorescence signal, and an artificial dataset will be built based on theoretical neuroscience to provide a proper reward function for the network training.

Funding

National Yang Ming Chiao Tung University (NYCU) and Ministry of Education (MOE) (VGHUST112-G3-2-2); National Science and Technology Council of Taiwan (110-2221-E-A49-009, 110-2221-E-A49-059-MY3).

Disclosures

The authors declare no conflicts of interest.

Data availability

The data and results presented in this paper are not publicly available currently, but are available from the authors upon reasonable request.

References

1. W. Denk, J. H. Strickler, and W. W. Webb, “Two-photon laser scanning fluorescence microscopy,” Science 248(4951), 73–76 (1990). [CrossRef]

2. P. J. Campagnola and L. M. Loew, “Second-harmonic imaging microscopy for visualizing biomolecular arrays in cells, tissues and organisms,” Nat. Biotechnol. 21(11), 1356–1360 (2003). [CrossRef]

3. M.-R. Tsai, Y.-W. Chiu, M. T. Lo, and C.-K. Sun, “Second-harmonic generation imaging of collagen fibers in myocardium for atrial fibrillation diagnosis,” J. Biomed. Opt. 15(2), 026002 (2010). [CrossRef]

4. G. Y. Fan, H. Fujisaki, R.-K. Tsay, R. Y. Tsien, and M. H. Ellisman, “Video-rate scanning two-photon excitation fluorescence microscopy,” Microsc. Microanal. 4(S2), 424–425 (1998). [CrossRef]

5. K. H. Kim, C. Buehler, and P. T. So, “High-speed, two-photon scanning microscope,” Appl. Opt. 38(28), 6004 (1999). [CrossRef]

6. Y. Kremer, J.-F. Léger, R. Lapole, N. Honnorat, Y. Candela, S. Dieudonné, and L. Bourdieu, “A spatio-temporally compensated acousto-optic scanner for two-photon microscopy providing large field of view,” Opt. Express 16(14), 10066 (2008). [CrossRef]

7. K. Svoboda and R. Yasuda, “Principles of two-photon excitation microscopy and its applications to neuroscience,” Neuron 50(6), 823–839 (2006). [CrossRef]

8. D. Oron, E. Tal, and Y. Silberberg, “Scanningless depth-resolved microscopy,” Opt. Express 13(5), 1468–1476 (2005). [CrossRef]

9. G. Zhu, J. v. Howe, M. Durst, W. Zipfel, and C. Xu, “Simultaneous spatial and temporal focusing of femtosecond pulses,” Opt. Express 13(6), 2153–2159 (2005). [CrossRef]

10. L.-C. Cheng, C.-Y. Chang, C.-Y. Lin, K.-C. Cho, W.-C. Yen, N.-S. Chang, C. Xu, C. Y. Dong, and S.-J. Chen, “Spatiotemporal focusing-based widefield multiphoton microscopy for fast optical sectioning,” Opt. Express 20(8), 8939–8948 (2012). [CrossRef]

11. Y. Y. Hu, C.-Y. Lin, C.-Y. Chang, Y.-L. Lo, and S.-J. Chen, “Image improvement of temporal focusing multiphoton microscopy via superior spatial modulation excitation and Hilbert–Huang transform decomposition,” Sci. Rep. 12(1), 10079 (2022). [CrossRef]

12. P. Mahou, J. Vermot, E. Beaurepaire, and W. Supatto, “Multicolor two-photon light-sheet microscopy,” Nat. Methods 11(6), 600–601 (2014). [CrossRef]

13. S. Wolf, W. Supatto, G. Debrégeas, P. Mahou, S. G Kruglik, J.-M. Sintes, E. Beaurepaire, and R. Candelier, “Whole-brain functional imaging with two-photon light-sheet microscopy,” Nat. Methods 12(5), 379–380 (2015). [CrossRef]

14. F.-C. Hsu, C.-Y. Lin, Y.-k. Hu, Y.-K. Hwu, A.-S. Chiang, and S.-J. Chen, “Light-field microscopy with temporal focusing multiphoton illumination for scanless volumetric bioimaging,” Biomed. Opt. Express 13(12), 6610–6620 (2022). [CrossRef]

15. C.-Y. Chang, Y. Y. Hu, C.-Y. Lin, C.-H. Lin, H.-Y. Chang, S.-F. Tsai, T.-W. Lin, and S.-J. Chen, “Fast volumetric imaging with patterned illumination via digital micro-mirror device-based temporal focusing multiphoton microscopy,” Biomed. Opt. Express 7(5), 1727–1736 (2016). [CrossRef]

16. H. Choi, E. Y. Yew, B. Hallacoglu, S. Fantini, C. J. Sheppard, and P. T. So, “Improvement of axial resolution and contrast in temporally focused widefield two-photon microscopy with structured light illumination,” Biomed. Opt. Express 4(7), 995–1005 (2013). [CrossRef]

17. C.-Y. Chang, C.-H. Lin, C.-Y. Lin, Y.-D. Sie, Y. Y. Hu, S.-F. Tsai, and S.-J. Chen, “Temporal focusing-based widefield multiphoton microscopy with spatially modulated illumination for biotissue imaging,” J. Biophotonics 11(1), e201600287 (2018). [CrossRef]

18. K. Isobe, T. Takeda, K. Mochizuki, Q. Song, A. Suda, F. Kannari, H. Kawano, A. Kumagai, A. Miyawaki, and K. Midorikawa, “Enhancement of lateral resolution and optical sectioning capability of two-photon fluorescence microscopy by combining temporal-focusing with structured illumination,” Biomed. Opt. Express 4(11), 2396–2410 (2013). [CrossRef]

19. L.-C. Cheng, C.-H. Lien, Y. Da Sie, Y. Y. Hu, C.-Y. Lin, F.-C. Chien, C. Xu, C. Y. Dong, and S.-J. Chen, “Nonlinear structured-illumination enhanced temporal focusing multiphoton excitation microscopy with a digital micromirror device,” Biomed. Opt. Express 5(8), 2526–2536 (2014). [CrossRef]

20. Y. Xue, K. P. Berry, J. R. Boivin, D. Wadduwage, E. Nedivi, and P. T. So, “Scattering reduction by structured light illumination in line-scanning temporal focusing microscopy,” Biomed. Opt. Express 9(11), 5654–5666 (2018). [CrossRef]

21. C.-Y. Chang, C.-Y. Lin, Y. Y. Hu, S.-F. Tsai, F.-C. Hsu, and S.-J. Chen, “Temporal focusing multiphoton microscopy with optimized parallel multiline scanning for fast biotissue imaging,” J. Biomed. Opt. 26(01), 016501 (2021). [CrossRef]

22. C.-Y. Chang, L.-C. Cheng, H.-W. Su, Y. Y. Hu, K.-C. Cho, W.-C. Yen, C. Xu, C. Y. Dong, and S.-J. Chen, “Wavefront sensorless adaptive optics temporal focusing-based multiphoton microscopy,” Biomed. Opt. Express 5(6), 1768–1777 (2014). [CrossRef]

23. T. Ishikawa, K. Isobe, K. Inazawa, K. Namiki, A. Miyawaki, F. Kannari, and K. Midorikawa, “Adaptive optics with spatio-temporal lock-in detection for temporal focusing microscopy,” Opt. Express 29(18), 29021–29033 (2021). [CrossRef]

24. A. T. Wassie, Y. Zhao, and E. S. Boyden, “Expansion microscopy: Principles and uses in biological research,” Nat. Methods 16(1), 33–41 (2019). [CrossRef]

25. C.-T. Shih, O. Sporns, and A.-S. Chiang, “Toward the drosophila connectome: Structural analysis of the brain network,” BMC Neurosci. 14(S1), 63 (2013). [CrossRef]

26. S. W. Oh, J. A. Harris, L. Ng, et al., “A mesoscale connectome of the mouse brain,” Nature 508(7495), 207–214 (2014). [CrossRef]

27. D. P. Hoffman, I. Slavitt, and C. A. Fitzpatrick, “The promise and peril of deep learning in microscopy,” Nat. Methods 18(2), 131–132 (2021). [CrossRef]

28. B. Kayalibay, G. Jensen, and P. van der Smagt, “CNN-based segmentation of medical imaging data,” arXiv, arXiv:1701.03056 (2017). [CrossRef]

29. R. W. Oei, G. Hou, F. Liu, J. Zhong, J. Zhang, Z. An, L. Xu, and Y. Yang, “Convolutional neural network for cell classification using microscope images of intracellular actin networks,” PLoS One 14(3), e0213626 (2019). [CrossRef]

30. B. Yao, W. Li, W. Pan, Z. Yang, D. Chen, J. Li, and J. Qu, “Image reconstruction with a deep convolutional neural network in high-density super-resolution microscopy,” Opt. Express 28(10), 15432–15446 (2020). [CrossRef]

31. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” Lect. Not. Comp. Sci. 9351, 234–241 (2015).

32. S. Lee, M. Negishi, H. Urakubo, H. Kasai, and S. Ishii, “Mu-net: Multi-scale U-net for two-photon microscopy image denoising and restoration,” Neural Netw. 125, 92–103 (2020). [CrossRef]

33. Y. Huang, H. Zhu, P. Wang, and D. Dong, “Segmentation of overlapping cervical smear cells based on U-Net and improved level set,” Proc. IEEE Int. Conf. Syst. Man Cybern., 3031–3035 (2019).

34. F. Kokkinos and S. Lefkimmiatis, “Deep image demosaicking using a cascade of convolutional residual denoising networks,” Proc. Euro. Conf. Comp. Vis., 317–333 (2018).

35. Y. A. Farha and J. Gall, “MS-TCN: Multi-stage temporal convolutional network for action segmentation,” Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 3575–3584 (2019).

36. S. Voronin, “Multi-stage image restoration in high noise and blur settings,” Comput. Sci. Inf. Syst. 12(1), 72 (2019). [CrossRef]

37. S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, M.-H. Yang, and L. Shao, “Multi-stage progressive image restoration,” Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 14821–14831 (2021).

38. D. G. Lowe, “Object recognition from local scale-invariant features,” Proc. IEEE Int. Conf. Comp. Vis., 1–8 (1999).

39. K. T. Islam, S. Wijewickrema, and S. O’Leary, “A deep learning based framework for the registration of three dimensional multi-modal medical images of the head,” Sci. Rep. 11(1), 1860 (2021). [CrossRef]

40. G. Balakrishnan, A. Zhao, M. R. Sabuncu, J. Guttag, and A. V. Dalca, “VoxelMorph: A learning framework for deformable medical image registration,” IEEE Trans. Med. Imaging 38(8), 1788–1800 (2019). [CrossRef]

41. C.-W. Hsu, C.-Y. Lin, Y. Y. Hu, C.-Y. Wang, S.-T. Chang, A.-S. Chiang, and S.-J. Chen, “Three-dimensional-generator U-net for dual-resonant scanning multiphoton microscopy image inpainting and denoising,” Biomed. Opt. Express 13(12), 6273–6283 (2022). [CrossRef]

42. C.-W. Hsu, C.-Y. Lin, Y. Y. Hu, and S.-J. Chen, “Dual-resonant scanning multiphoton microscope with ultrasound lens and resonant mirror for rapid volumetric imaging,” Sci. Rep. 13(1), 163 (2023). [CrossRef]

43. F. I. Diakogiannis, F. Waldner, P. Caccetta, and C. Wu, “ResUNet-A: A deep learning framework for semantic segmentation of remotely sensed data,” ISPRS J. Photogramm. Remote Sens. 162, 94–114 (2020). [CrossRef]

44. B. B. Avants, N. J. Tustison, J. Wu, P. A. Cook, and J. C. Gee, “An open source multivariate framework for N-tissue segmentation with evaluation on public data,” Neuroinform 9(4), 381–400 (2011). [CrossRef]

45. L. Chen, X. Lu, J. Zhang, X. Chu, and C. Chen, “HINet: Half instance normalization network for image restoration,” Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 182–192 (2021).

46. Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “ECA-net: Efficient channel attention for deep convolutional neural networks,” Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 11534–11542 (2020).

Temporal focusing multiphoton microscopy with cross-modality multi-stage 3D U-Net for fast and clear bioimaging

Abstract

1. Introduction

2. Methodology

2.1 Temporal focusing multiphoton excitation microscopy and point-scanning multiphoton microscopy image as ground truth

2.2 Cross-modality registration process

2.3 Multi-stage image restoration network

3. Experimental results and discussions

3.1 Cross-modality 3D image registration

3.2 In-vitro TFMPEM image enhancement via two-stage U-Net

3.3 Fast and clear in-vivo MB restoration via three-stage U-Net

4. Conclusions

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (6)

Equations (3)

Biomedical Optics Express