Snapshot super-resolution indirect time-of-flight camera using a grating-based subpixel encoder and depth-regularizing compressive reconstruction

Hodaka Kawachi; Tomoya Nakamura; Kazuya Iwata; Kazuya Iwata; Yasushi Makihara; Yasushi Yagi

doi:10.1364/OPTCON.487545

1. Introduction

In recent years, depth cameras have been applied in various vision systems such as authentication systems and automatic driving. The time-of-flight (ToF) sensing method is one method for distance measurement by measuring the ToF of light. To achieve this, the optical system has a light emitter and detector that operate synchronously. Compared to triangulation [1] or depth-from-defocus [2], ToF-based methods, which include LiDAR [3] and ToF camera [4], have several advantages such as superior noise-robustness, texture-independence, high distance accuracy, and low computational cost [5].

Among the ToF-based methods, a ToF camera equipped with a surface-emitting laser and a time-resolved image sensor has merit in terms of its space-bandwidth product compared to LiDAR which involves a laser, a scanner, and a photodetector. ToF cameras can be broadly classified into two types: direct ToF (dToF) cameras and indirect ToF (iToF) cameras. dToF cameras directly measure the ToF with nanosecond-width pulse emission and time-gated detection [6]. The detector is an array of avalanche photodiodes, in particular, a single-photon avalanche diode (SPAD) [7,8]. At the time a photon is received, the detector outputs a digital pulse, and the ToF is measured via a time-to-digital converter (TDC).

Unlike the dToF camera, the iToF camera measures the ToF by sensing time-correlation data between temporally-coded light arrival from the objects’ reflection and temporally-coded detection, and the depth is calculated from the data. The top row of Fig. 1 shows a schematic diagram of the conventional pulse iToF camera [9,10]. The camera emits a sequence of rectangular light pulses (red waveform in the figure) in the time domain, and a photodiode array receives reflected light that temporally encodes the scene depth. Each photodiode is connected to two or more taps (each consisting of a charge transfer gate and a charge accumulator), and the transferred charge is accumulated with rectangular time-window functions (purple and green waveform in the figure) without temporal overlap for each tap [11]. This gives the time-correlation value for each tap as the number of electrons. The image sensor has independent readout circuits for each tap and outputs independent time-correlation intensity images (two images on the right in the figure) at once. By using the images, the depth map can be uniquely calculated. Compared to the dToF method, the advantage of iToF is that the number of elements in a pixel can be reduced mainly because there is no need for a time calculation circuit such as a TDC. As a result, large-pixel-count ToF cameras are easier to implement.

Fig. 1. Top: an overview of a conventional pulse iToF camera. Surface-emitted pulse is radiated to the scene, and objects reflect it back to the camera. A photodetector array receives it, and each photodetector transfers charge to at least two accumulators with non-overlapping different rectangular time windows. Each tap (a charge transfer gate and a charge accumulator forming a pair) independently reads out two or more time-correlated intensity images at once. A depth map is obtained by simple calculation using the obtained time-correlated images. Bottom: The proposed iToF camera. A slanted diffraction grating is attached in front of the lens to engineer the PSF to optically encode subpixel information in time-correlated images. The spatially-coded time-correlated images are input to a compressive reconstruction algorithm using depth regularization to output an SR depth map.

Download Full Size | PDF

As mentioned above, the iToF method can implement a relatively large pixel count; however, the spatial resolution is still limited compared to ordinary cameras. This is due to the fact that multiple taps and their readout circuits must be implemented in a pixel for time-resolved measurement. Therefore, it is difficult to achieve further higher resolution by pixel miniaturization alone.

To solve this limitation, the application of digital super-resolution (SR) methods, where resolution is up-converted beyond the sampling theorem through signal processing, is promising [12]. A typical model-driven SR approach is a multi-frame SR method that assumes multiple shots with subpixel shifts and application of registration processing [13]. Although its effectiveness has been well demonstrated in the past, it sacrifices temporal resolution in sensing. In recent years, many learning-based SR methods using only a single image input have also been proposed [14,15] and applied to the SR dToF cameras [16,17]. Although deep-learning-based SR works very well in the learned domain, the fundamental problem on domain restriction and explainability of outputs still pose challenges.

To realize snapshot and model-driven SR, compressive sensing (CS) [18,19] with point-spread-function (PSF) engineering [20–23] has been studied [24–28]. To achieve SR, the CS approach engineers the PSF for encoding sub-pixel information in a single measurement. Since a captured coded image is at the sensor resolution, the acquisition of an SR image needs compressive reconstruction processing that solves the ill-posed inverse problem by exploiting sparse-modeling of the scene using the regularization technique.

In this study, we propose a novel snapshot SR iToF camera based on a CS framework with PSF engineering. A conceptual diagram of the proposed method is shown in the bottom part of Fig. 1. To achieve PSF engineering and optical subpixel encoding in an inexpensive and easily available way, a phase-type diffraction grating is attached in front of a lens. Then low-resolution (LR) time-correlated intensity images are spatially coded. By applying compressive reconstruction to them, an SR depth map and luminance map are obtained. To make SR successful, we also proposed a regularization method of the depth map in CS reconstruction by exploiting the design features of the iToF camera, which is expected to work better than conventional regularization in general scenarios where the depth map is sparser than the luminance map.

Our work provides the following contributions:

• We proposed the first snapshot SR iToF camera and experimentally verified it. The method uses the CS framework with PSF engineering.
• We proposed an inexpensive, easily available, and easily removable way to realize PSF engineering for CS just by using a diffraction grating placed in an inclined position.
• Exploiting the design features of the iToF camera, we also proposed a regularization method for the depth map, which is suitable for the compressive reconstruction of natural scenes with a depth map that is sparser than the luminance map.

As related studies in the field of CS-based ToF imaging, there have been several studies on compressive LiDAR [5,29,30] based on multiple measurements using a single photodiode. In these methods, CS helps to decrease the total number of scans to obtain a depth map by using random intensity modulation using a digital micromirror device (DMD). Since the methods assume the use of essentially a single photodiode, the same method cannot be applied to ToF cameras. CS has also been utilized to improve temporal SR in LiDAR and ToF cameras [31,32]; however, this work focuses on the spatial SR problem. In research on dToF cameras, Ref. [33] proposed a CS-based SR method using a coded aperture; however, the coded-aperture approach sacrifices light energy, and the applicability of the method to iToF cameras is not obvious. In research on iToF cameras, Ref. [34] also proposed the use of CS for spatial SR. The objective is the same as in the present study; however, it requires multiple measurements with DMD-based random intensity modulations at a relayed image plane. Since our method proposes snapshot realization using PSF engineering, our method has merits in terms of temporal resolution and light-use efficiency.

2. Method

2.1 Conventional iToF

The forward model of lens imaging is represented by a two-dimensional (2D) convolution of a scene luminance and a PSF, multiplied by a downsampling matrix [35]. By using downsampling matrix with equally spaced sampling ($\boldsymbol {\Phi }\in \mathbb {R}^{HW \times mHnW}$), this process can simply be expressed as follows:

(1)$$\begin{aligned} \boldsymbol{g} &= \mathrm{Down}_\downarrow \left(\mathrm{conv}_\mathrm{2D}\left(\boldsymbol{f}*\mathrm{PSF}\right)\right)\\ &= \boldsymbol{\Phi}\boldsymbol{f}, \end{aligned}$$

where $\boldsymbol {g}\in \mathbb {R}^{HW \times 1}$ and $\boldsymbol {f}\in \mathbb {R}^{mHnW \times 1}$ denote an observed image whose pixel count is $H \times W$, and an object whose pixel count is $mH \times nW$, respectively. These matrices are non-negative. $\mathrm {Down}_\downarrow$ represents downsampling function and $\mathrm {conv}_{\mathrm {2D}}$ represents 2D convolution. Note that $m,n\in \mathbb {N}$ denote the magnification factors of SR along two orthogonal axes, respectively and in this study, incoherent light is used. Here we define two time-correlated images with different time windows in pulse iToF as $\boldsymbol {g}_\mathrm {1}$ and $\boldsymbol {g}_\mathrm {2}$, which are output from pairs of taps in a sensor. The intensity ratio data $\boldsymbol {g}_\mathrm {ToF}$ of two time-correlated scene images, which is proportional to the depth map, can be obtained by pixel-wise calculations using the two time-correlated images as follows [9,10]:

(2)$$\begin{aligned} \boldsymbol{g}_\mathrm{ToF} &= \frac{\boldsymbol{g}_\mathrm{2}}{\boldsymbol{g}_\mathrm{1} + \boldsymbol{g}_\mathrm{2}}\\ &= \frac{\boldsymbol{\Phi}\boldsymbol{f}_\mathrm{1}}{\boldsymbol{\Phi}\left(\boldsymbol{f}_\mathrm{1}+\boldsymbol{f}_\mathrm{2}\right)}. \end{aligned}$$

Note that the fractions of the matrices and vectors represent the element-wise division in this paper.

Considering that the ratio of the time of flight of light to the pulse width corresponds to the ratio data, the time of flight of light $\boldsymbol {t}_{\mathrm {flight}}\in \mathbb {R}^{W \times 1}$ can be expressed as follows:

(3)$$\begin{aligned}\frac{\boldsymbol{t}_{\mathrm{flight}}}{t_{\mathrm{pulse}}} &= \frac{1}{2}\, \boldsymbol{g}_{\mathrm{ToF}}\\ \boldsymbol{t}_{\mathrm{flight}}&= \frac{1}{2}\,t_{\mathrm{pulse}}\, \boldsymbol{g}_{\mathrm{ToF}} \end{aligned}$$

where $t_{\mathrm {pulse}}$ is the pulse width used in a pulse iToF camera. Therefore an actual depth map $\boldsymbol {d}\in \mathbb {R}^{HW \times 1}$ can be calculated from the ratio data as follows:

(4)$$\boldsymbol{d} = \frac{1}{2}\, t_{\mathrm{pulse}}\,\boldsymbol{g}_{\mathrm{ToF}}\,c ,$$

where $c$ is the speed of light in air. In a conventional iToF camera, the system matrix $\boldsymbol {\Phi }$ is just the downsampling matrix so that captured time-correlated images no longer contain subpixel information.

2.2 Snapshot SR iToF by CS with PSF engineering

In the PSF-engineered iToF camera, the system matrix $\boldsymbol {\Phi }$ can be designed and can record subpixel information. For the naive application of CS to an iToF camera, after PSF-engineered imaging, it is enough to solve the inverse problem of Eq. (1) for time-correlated images independently. The top of Fig. 2 shows such a pipeline. The corresponding compressive reconstruction algorithm for each SR scene luminance image is as follows:

(5)$$\boldsymbol{f}^\star_i = \mathop{\arg\, \min\,} \limits_{\boldsymbol{f}_i} \left\lVert \boldsymbol{g}_\mathrm{i}-\boldsymbol{\Phi}\boldsymbol{f}_i\right\rVert_2^2 + \tau_i R(\boldsymbol{f}_i),$$

where $\boldsymbol {f}^\star _i$ is the estimated SR scene luminance image, $i$ is the index of the tap, $\tau _i\in \mathbb {R}$ is a coefficient of the regularization term, and $R(\cdot )$ is the regularizer. In this study, we use 2D total variation (TV) as a regularizer in CS [36]. The computation of TV is defined as follows:

(6)$$\mathrm{TV}\left(\boldsymbol{I}\right) = \sum_\mathrm{pixels}\sqrt{ (G_x (\boldsymbol{I}_i))^2 + (G_y(\boldsymbol{I}_i))^2 },$$

where $\sum _\mathrm {pixels}$ denotes the sum of all pixels, and $(G_x (\boldsymbol {I}))^2,(G_y(\boldsymbol {I}))^2$ denote the first derivative of an image $\boldsymbol {I}$ along $x$ and $y$ axes, respectively. After reconstructing SR scene luminance images, the SR depth map can be calculated simply and as well as with the conventional iToF camera.

Fig. 2. Reconstruction pipelines. Top: a pipeline that naively implements the conventional compressive super-resolution model. The area to the left of the lens represents the reconstruction result. This reconstruction process is achieved by repeating forward and backward. This pipeline performs super-resolution independently for each of the two time-windowed luminance images. Therefore, the regularization targets are the luminance images. Bottom: proposed method. The pipeline backpropagates the reconstruction loss to the estimated depth map and the estimated single luminance image. This pipeline allows direct reconstruction of the depth map with regularization, which is the main objective of the ToF camera.

Download Full Size | PDF

2.3 Snapshot SR iToF by CS with PSF engineering and depth regularization

As in the above formulation, the naive application of CS to an iToF camera is based on the independent regularization of multiple luminance maps; however, the major interest for ToF cameras is the depth map. Furthermore, in some transformed spaces, the depth map is expected to be sparser than the luminance map in natural scenes. Thus, if the regularization of the depth map could be aggressively utilized, it would be better for SR.

Based on this idea, we also propose a compressive reconstruction method that regularizes the depth map. The bottom of Fig. 2 shows the concept. Contrary to the naive method, in this reconstruction algorithm, the reconstruction error is backpropagated not only to the estimated time-correlated luminance maps of a scene but also to the estimated depth map. With the benefit of this backpropagation, regularization can be processed on the depth map so that the processing can aggressively use the sparseness of the scene depth map.

In the formulation of the proposed method, we define an original SR scene luminance image $\boldsymbol {p} \in \mathbb {R}^{mHnW \times 1}$ as follows:

(7)$$\boldsymbol{p}=\boldsymbol{f}_\mathrm{1}+\boldsymbol{f}_\mathrm{2}.$$

In addition, we also define a luminance ratio image $\boldsymbol {r} \in \mathbb {R}^{mHnW \times 1}$ as follows:

(8)$$\boldsymbol{r}=\frac{\boldsymbol{f}_\mathrm{2}}{\boldsymbol{p}},$$

where the ratio image corresponds to a constant-multiplied depth map. The formulation of the proposed method is as follows:

(9)$$\begin{aligned} \boldsymbol{r}^\star,\boldsymbol{p}^\star{=} \mathop{\arg\, \min\,}_{\boldsymbol{r},\boldsymbol{p}} &\,\left\lVert \boldsymbol{g}_\mathrm{1}-\boldsymbol{\Phi} \left(\boldsymbol{p}\odot\left(1-\boldsymbol{r}\right)\right) \right\rVert_2^2\\ +&\, \left\lVert \boldsymbol{g}_\mathrm{2}-\boldsymbol{\Phi} \left(\boldsymbol{p}\odot\boldsymbol{r}\right) \right\rVert_2^2\\ +&\,\tau\mathrm{R}\left(\boldsymbol{r}\right), \end{aligned}$$

where $\boldsymbol {r}^\star$ and $\boldsymbol {p}^\star$ are the estimated SR luminance ratio and original SR luminance images, respectively. $\odot$ denotes a Hadamard product. This study utilizes TwIST [37] with TV as the regularization method and UDN [38,39] with CNN as the generator instead of using a regularization term. The latter method is expressed as follows:

(10)$$\begin{aligned}\boldsymbol{r}^\star,\boldsymbol{p}^\star{=} \mathop{\arg\, \min\,}_{\boldsymbol{r},\boldsymbol{p}} &\,\left\lVert \boldsymbol{g}_\mathrm{1}-\boldsymbol{\Phi} \left(\mathrm{G}\left(\boldsymbol{z};\boldsymbol{W}_{\mathrm{p}}\right)\odot\left(1-\mathrm{G}\left(\boldsymbol{z};\boldsymbol{W}_{\mathrm{r}}\right)\right)\right) \right\rVert_2^2\\ +&\, \left\lVert \boldsymbol{g}_\mathrm{2}-\boldsymbol{\Phi} \left(\mathrm{G}\left(\boldsymbol{z};\boldsymbol{W}_{\mathrm{p}}\right)\odot\mathrm{G}\left(\boldsymbol{z};\boldsymbol{W}_{\mathrm{r}}\right)\right) \right\rVert_2^2, \end{aligned}$$

where our network $\mathrm {G}\left (\boldsymbol {z};\boldsymbol {W}_{\mathrm {r}}\right )$ and $\mathrm {G}\left (\boldsymbol {z};\boldsymbol {W}_{\mathrm {p}}\right )$ have a fixed input z and randomly initialized weights $\boldsymbol {W}_{\mathrm {r}}$ and $\boldsymbol {W}_{\mathrm {p}}$. These networks, when used as generators, function as regularizers.

2.4 Ambient-light subtraction

In practical use, a ToF camera is sometimes affected by ambient light other than the emitted pulse, though it can be suppressed by a bandpass filter to some extent. If the signal of only the ambient light is measured before imaging, the effect of it can be subtracted from the detected images. Here we denote the pre-measured image of the ambient light as $\boldsymbol {a}$ and the two captured time-correlated scene images including the ambient light as $\boldsymbol {f}_1'$ and $\boldsymbol {f}_2'$. The subtraction of the ambient light in a conventional iToF camera is modeled as follows:

(11)$$\begin{aligned} \boldsymbol{f}_1 &= \boldsymbol{f}_1' - \boldsymbol{a},\\ \boldsymbol{f}_2 &= \boldsymbol{f}_2' - \boldsymbol{a}. \end{aligned}$$

Now we consider the ambient light removal in the PSF-engineered observation. As a conventional method, here $\boldsymbol {g}_0'$, $\boldsymbol {g}_1'$, and $\boldsymbol {g}_2'$ denote the ambient light and the time-correlated images from tap 1 and 2 obtained by a PSF-engineered camera. The ambient-light-less observation $\boldsymbol {g}_1,\boldsymbol {g}_2$ can be obtained as follows:

(12)$$\begin{aligned} \boldsymbol{g}_0'&= \boldsymbol{\Phi}\boldsymbol{a},\\ \boldsymbol{g}_1 &= \boldsymbol{\Phi}\boldsymbol{f}_1\\ &= \boldsymbol{\Phi}\left(\boldsymbol{f}_1 + \boldsymbol{a}\right) - \boldsymbol{\Phi}\boldsymbol{a}\\ &= \boldsymbol{g}_1' - \boldsymbol{g}_0',\\ \boldsymbol{g}_2 &= \boldsymbol{\Phi}\boldsymbol{f}_2\\ &= \boldsymbol{\Phi}\left(\boldsymbol{f}_2 + \boldsymbol{a}\right) - \boldsymbol{\Phi}\boldsymbol{a}\\ &= \boldsymbol{g}_2' - \boldsymbol{g}_0'.\\ \end{aligned}$$

The above Eq. (12) indicates that ambient light can be removed even with the PSF-engineered iToF camera in the same way as with the conventional one.

3. Simulation

3.1 Setup

We first verified and analyzed the proposed method by numerical simulation. In this simulation, we assumed that the ambient light does not change temporally and can be subtracted for simplicity. In the simulation, a 3D model of the target scene was created using computer graphics (CG) rendering software. Figure 3 shows the rendered target scenes. As targets, we used two scenes: one includes a floating resolution chart and another includes trees with a complicated depth map. The resolution-chart scene was used for evaluating the spatial resolution after SR, and the tree scene was used for clarifying the limitation of depth regularization. The captured time-correlated images were calculated from the 3D model. Note that the specifications and schematics of the pulse iToF camera in simulations assumed an existing consumer product (BEC80T BLUE by Brookman Technology) used for the optical experiment, described below.

Fig. 3. Luminance and depth map of target scenes for simulations rendered by 3DCG software. Target 1 (floating resolution chart) is for evaluating spatial resolution, while target 2 (tree objects) is for clarifying the limitation of depth regularization.

Download Full Size | PDF

In the simulation, we set the magnification factors of SR to $m=2$ and $n=2$. The imaging process was simulated by convolution of the PSF and a scene at SR resolution, nearest-neighbor-based downsampling by an image sensor, applying the two time-correlated detections, and noise addition. Note that here we approximated the PSF as being invariant to depth and shift for simplicity. The effect of the defocus of the PSF will be analyzed in the Experiment section. The pixel count of each captured image was set to $240\times 320$, and the pixel count of the reconstruction data was set to $480\times 640$. For the noise, we added $40$ dB additive white Gaussian noise to the captured images. The compressive reconstruction process was implemented to solve the CS-based minimization problem of Eqs. (5) and (9). To solve the problem, we used the TwIST algorithm [37] with TV prior and the untrained deep generative network (UDN) method [39] with a deep image prior [38]. These reconstruction methods are widely used as reconstruction means and are iterative algorithms that use gradient methods to minimize the evaluation function. As an initial solution in the reconstruction of TwIST, we used a depth map simply obtained by bicubic interpolation from LR data. For the TwIST, the regularization constant $\tau$ was set to $10^{-4}$, and the iteration count was set to $20000$. For the UDN, the objective function was the mean squared error only, the Adam optimizer [40] was used, the learning rate was $10^{-4}$, and the iteration count was fixed at $20000$.

3.2 Analysis of PSF

First, we examined the effect of the PSF pattern. Arbitrary PSF patterns can theoretically be implemented via the design of a generalized pupil function [35]. The design can be done based on Fourier optics. Even in the case of phase-only modulation that can be implemented by a glass plate having a height map, like a kinoform, the function can be calculated based on a phase-retrieval algorithm [41,42].

In the spatial domain, one of the parameters for evaluating the goodness of the PSF for computational imaging is the autocorrelation [43]. The manually designed ideal PSF for minimizing the autocorrelation is as shown in Figs. 4(a) and (b). The difference between both PSFs is the total spot count. The more spots there are, the more the autocorrelation can be suppressed, but the intensity per spot becomes lower. Figure 4(c) corresponds to a PSF that can be implemented just by using a diffraction grating. The merit of the use of a diffraction grating for a subpixel encoder is the ease of implementation because it has only a spatially periodic and axially binary structure. Furthermore, many types of diffraction gratings are commercially available and easy to get. Therefore, if a PSF using a grating is still useful for compressive SR problems as well as designed theoretically low-autocorrelation PSFs, the use of such a PSF is much more reasonable in practice. Regarding the PSF, compared to other patterns, autocorrelation is naturally larger because the pattern is periodic; however, split patterns seem to also work for encoding subpixel-shift information. Note that the ToF camera uses illumination with a narrow-band spectrum, such as laser light; thus, there is no blur caused by chromatic dispersion. Figures 4(d)–(f) are more practical patterns of Figs. 4(a)–(c) including aberration, respectively. When solving the inverse problem, it is expected that aberrations are compensated to the cutoff frequency at the same time the image is reconstructed. We need to tilt the grating because we need to code observations of sub-pixel information in both the x- and y-directions. In this case, the angle was set to $11.3$ degrees without any particular preference. This is because the value of the autocorrelation depends on the characteristics of the grating itself, regardless of the angle.

Fig. 4. PSFs tested by simulations. Manually designed ideal PSFs to minimize the autocorrelation whose total spot count is (a) 8 and (b) 16. (c) PSF by a tilted diffraction grating, which still encodes the subpixel information. (d)–(f) Aberrated PSFs of those in the above row. Note that the dynamic range of the PSF image differs because the total light energy is fixed.

Download Full Size | PDF

Figure 5 shows the reconstruction results of the depth map with each PSF including quantitative evaluation using peak the signal-to-noise ratio (PSNR) and structural similarity (SSIM) [44]. Here, the TwIST algorithm with a TV prior was chosen for the reconstruction. In the case without aberration, as shown in the top row of Fig. 5, the designed PSFs showed better reconstruction results than the PSF in the case of the diffraction grating as expected. However, in the case of aberration, the PSF by diffraction grating returned better results than others. This result means that the peak intensity of the PSF, i.e., the good optical properties in terms of local SNR, sometimes has a more dominant effect on the reconstruction accuracy than the autocorrelation value of the PSF, i.e., good mathematical properties, in the case of measurements including aberrations that are manifested in the actual optical system. Based on this fact and ease of practical implementation, in this work we adopted a diffraction grating as an optical subpixel encoder for PSF engineering.

Fig. 5. Reconstructed depth maps with changing PSF patterns. The reconstruction algorithm was fixed to the TwIST method. Columns are sorted by PSF patterns, while rows are sorted by the presence/absence of aberration. Quantitative evaluations for each result are included as PSNR/SSIM.

Download Full Size | PDF

3.3 Analysis of reconstruction method

Next, we evaluated the reconstruction results of the depth map for each reconstruction algorithm. Here, for the PSF, we used the diffraction-grating-based one. Figure 6 compares the LR measurement and reconstruction results with TwIST and UDN. For TwIST and UDN, naive and proposed regularization strategies are also compared; the former corresponds to luminance-based regularization and the latter corresponds to depth-map-based regularization.

Fig. 6. Reconstructed depth maps with changing reconstruction methods. For the PSF pattern, a diffraction grating was used, considering the presence of aberration. The top row shows the reconstructed depth maps of target 1. The bottom row shows the results of target 2 involving depth fluctuations. Quantitative evaluations for each result in terms of PSNR/SSIM are also shown.

Download Full Size | PDF

For the resolution-chart scene, all the SR results resolved the higher-frequency structure better than the LR result, particularly seeing the red close-ups on the resolution chart target. For both TwIST and UDN, the proposed depth-regularization results indicate higher PSNR than the naive regularization results, which are effectively represented in the noise of blue close-ups. Comparing the TwIST and UDN, the appearance of the results is not so different, but the UDN methods were superior to TwIST methods in terms of PSNR.

For the natural scene with tree objects, the effectiveness of CS-based SR is more limited than the resolution-chart scene because the depth map is less sparse. This is a reasonable result that indicated a general limitation of the CS-based method whose effectiveness depends on the sparseness of the data. This also suggests that our method is effective when the assumption of the sparseness of the depth map of the scene is valid. This target also returned better reconstruction PSNR by UDN than TwIST, which shows that noise suppression by using a deep-image prior worked better than a TV prior for a certain kind of non-planar object.

In conclusion, for both scenes, the proposed PSF-engineering-based compressive SR iToF with depth regularization improved the spatial resolution and PSNR of the depth map even with the noise, though the magnitude of the effectiveness depended on the sparseness of the scene. In particular, the proposed UDN method correctly reconstructed the depth map of the background brick wall and flower beds, which were textured in the luminance domain. Since depth maps are often sparser than luminance maps in general scenes, there are many applications where the proposed method is effective.

4. Experiment

4.1 Setup

We also verified our method through optical experiments with a prototype. Figure 7(a) shows the experimental setup. The setup consists of the iToF camera and target objects. As the target, a stuffed toy known as SANKEN, which is one of the symbols of Osaka University (SANKEN toy), and a plaster statue of David (David) were placed at different distances in front of the ToF camera. Beside each target, a small ladder was placed to clarify the spatial resolution by SR. As in the simulation, the SR factors were set to $n=2$ and $m=2$, which means a $2 \times 2$ SR problem.

Fig. 7. (a) Experiment Setup. The objects were placed at 1.5 m (David) and 3.0 m (SANKEN toy) in front of an iToF camera. (b) Frontal view of the prototype of the proposed PSF-engineered ToF camera equipped with a tilted diffraction grating acting as a phase mask in front of a lens. Two light sources were installed on the left and right of the lens mount, and emitted a sequence of light pulses. (c) An implemented diffraction grating.

Download Full Size | PDF

Before imaging, the SR PSF, which is necessary for compressive reconstruction, was calibrated by the multi-frame SR method, where a set of subpixel-shifted LR PSFs was measured and registered to form a single SR PSF. The subpixel shift was realized by a 3-axis motorized stage (OSMS20-85(XYZ) by SIGMAKOKI) on which the iToF camera was mounted.

Figure 7(b) shows a frontal view of the prototype of the proposed PSF-engineered ToF camera. The ToF camera (BEC80T BLUE by Brookman Technology) was equipped with two VCSEL light sources whose wavelength was 850 nm. The VCSEL light sources emitted a sequence of light pulses whose width was 34 nanoseconds. The image sensor of the camera consisted of a 3-tap image sensor whose pixel count was $240 \times 320$. The fill factor of that sensor is about 43 ${\%}$. Each tap accumulated signals output from a common photodiode over different time windows, and their final signals were read out as independent time-resolved images via independent circuits. Taps 1 and 2 accumulate charge in a 35 nanoseconds-width rectangular time window, and the time window of tap 2 starts up 35 nanoseconds later than tap 1, as illustrated in Fig. 1. Tap 3 which did not detect any emitted light was used for ambient-light removal. Eventually, two time-correlated images and one ambient-light-only image were readout by a snapshot. In the experiment, we cropped the central $128 \times 128$ pixels from each readout data and used them for experiments. The camera was also provided with a lens system whose focal length was 6 mm (13FM06IR by Tamron).

In front of the lens, we placed a phase-type diffraction grating with a tilt of $11.3$ degree. Figure 7(c) shows the appearance of the grating in a circular mount and an object seen through it. The grating had 1D periodic $30\,\mathrm {\mu m}$-wide grooves with $30\,\mathrm {\mu m}$ intervals on a $2.0$ mm thick quartz glass plate. The depth of each groove was $1.0\,\mathrm {\mu m}$. The grating was made by using a photolithography process, and its mount was fabricated by a 3D printer (Adventurer4 by FLASHFORGE).

Figure 8(a) shows a calibrated engineered SR PSF using a grating. The PSF was experimentally obtained by capturing a point source of a near-infrared light-emitting diode (LED) at a distance of $3.0$ m using the proposed ToF camera. Thanks to the grating, the spot was split into approximately three slightly aberrated spots with tilted alignment that encode subpixel information in a single measurement. Figures 8(b) and (c) are the captured time-correlated images of the targets in Fig. 7(a) by tap 1 and 2 of the iToF camera.

Fig. 8. (a) Calibrated SR PSF via multi-frame SR method. Scale bar indicates 100 $\mu$m at the sensor plane. (b),(c) Two time-correlated images measured by the proposed camera, which correspond to tap 1 and tap 2 of the sensor. Scale bar indicates 100 mm at the front object plane.

Download Full Size | PDF

4.2 Experiment with natural objects

Figure 9 shows the reconstruction results of the depth map. In this experiment, a PSF was acquired at the depth of the SANKEN toy. Looking at the edge of the SANKEN toy and the ladder at the side, the high-frequency structure was successfully resolved by all the SR methods. Compared to naive regularization methods, the proposed TwIST and UDN methods outputted less-noisy depth maps particularly at the ladder gaps. On the other hand, for the David object, the high-frequency structure seemed to be resolved by the SR methods regardless of the regularization methods; however, there were reconstruction errors, particularly around the edge. This is due to the difference between the pre-calibrated PSF at the rear and the defocused PSF at the front, and the fact that the PSF changes steeply at the edges. In other words, this is a general problem of depth-of-field (DoF) in coded imaging. To address this problem, the use of multiple defocused PSFs for reconstruction processing may be one approach. As another approach, engineering a depth-invariant PSF, such as a PSF using a radial mask [45] or an odd-symmetry spiral phase grating [46], can also extend the DoF. In the optical experiment under the real-world environment, we confirmed that the TwIST and UDN did not make a large difference in the reconstructed SR depth maps. On the contrary, from the viewpoint of computation time, the reconstruction took about 60 seconds for TwIST and about 2500 seconds for UDN, using a GPU (GeForce RTX3090 by NVIDIA) and Python environment with 20000 iterations which were performed on a computer with 256GiB RAM and 16-core CPU (Ryzen threadripper pro 3955WX by AMD).

Fig. 9. Left column: the measured LR and interpolated depth maps. Middle column: reconstructed SR depth maps with naive regularization. Right column: the proposed reconstructed SR depth maps with depth regularization. In the middle and right columns, the top row indicates the results using TwIST, and the bottom row indicates the results using UDN. The pixel count of LR measurement was $128 \times 128$, and $2 \times 2$ compressive SR processing was applied to reconstruct $256 \times 256$ data. The scale bar indicates 40 mm at the front object plane.

Download Full Size | PDF

4.3 Experiment with resolution chart

Next, to further clarify the improvement in resolution, a resolution chart fabricated by a 3D printer (Adventurer4 by FLASHFORGE) was also captured by the proposed ToF camera. The obtained LR and reconstructed SR depth maps are shown in Fig. 10. The distance to the object was set to be 1.0 m, and the PSF was calibrated at the same distance. The purple line in the figure shows the vertical line profile of the reconstructed depth maps in a purple rectangle, where horizontal information was averaged. The structure in the purple rectangle corresponds to 0.214 cycles/mm along the vertical direction at the object plane. Compressive SR methods successfully resolved that, whereas LR and interpolated depth maps did not. This result more clearly indicates that our compressive SR iToF imaging method with PSF engineering successfully improved the spatial resolution even in the real-world optical experiment with our prototype system.

Fig. 10. The LR measurement and SR reconstructed depth maps with a resolution-chart target. The pixel count of the raw LR measurement was $128 \times 128$, and $2 \times 2$ SR was applied. Purple rectangles indicate close-ups and purple lines indicate the line profile along the vertical direction inside the rectangle, where horizontal information was averaged. Scale bar indicates 40 mm at the object plane.

Download Full Size | PDF

5. Conclusion

In this study, we proposed a snapshot SR iToF camera using a diffraction-grating-based optical subpixel encoder and compressive reconstruction processing that regularizes the depth map. We numerically and experimentally verified the validity and effectiveness of the proposed method. In conclusion, we confirmed that the SR reconstruction of the depth map worked well. A structure that the conventional method did not resolve was successfully resolved even in real-world optical experiments. We also confirmed that the depth regularization improved the reconstruction accuracy of the SR depth map.

The hardware of the proposed SR iToF camera can be implemented simply by adding a diffraction grating in front of a lens. A further advantage of our proposed method is that the optical elements for PSF engineering can be realized using only readily available and inexpensive diffraction gratings. The proposed method is effective for depth-imaging applications where spatial resolution is important, such as remote surveillance. Using our current method, experimental results revealed that reconstruction errors may occur, especially in areas where the depth change outside the DoF is abrupt. However, even in applications where depth precision has priority over spatial resolution, this can easily be addressed just by removing the grating.

Future work is to deal with the DoF-limitation problem with a single PSF calibration. Technically, the application of multiple PSF calibrations or depth-invariant PSF design will be able to deal with this problem.

Funding

Fusion Oriented REsearch for disruptive Science and Technology (JPMJFR206K).

Acknowledgements

This work was partly supported by Sony Semiconductor Solutions Corporation and Konica Minolta Imaging Science Encouragement Award.

Disclosures

The authors have no conflicts of interest to declare.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: dense tracking and mapping in real-time,” in 2011 International Conference on Computer Vision, (IEEE, 2011), pp. 2320–2327.

2. A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth from a conventional camera with a coded aperture,” ACM Trans. Graph. 26(3), 70 (2007). [CrossRef]

3. B. Schwarz, “LiDAR: mapping the world in 3D,” Nat. Photonics 4(7), 429–430 (2010). [CrossRef]

4. S. Foix, G. Alenya, and C. Torras, “Lock-in time-of-flight (ToF) cameras: a survey,” IEEE Sens. J. 11(9), 1917–1926 (2011). [CrossRef]

5. A. Colaco, A. Kirmani, G. A. Howland, J. C. Howell, and V. K. Goyal, “Compressive depth map acquisition using a single photon-counting detector: parametric signal processing meets sparsity,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, (IEEE, 2012), pp. 96–102.

6. C. Niclass, A. Rochas, P. A. Besse, and E. Charbon, “Design and characterization of a CMOS 3-D image sensor based on single photon avalanche diodes,” IEEE J. Solid-State Circuits 40(9), 1847–1854 (2005). [CrossRef]

7. K. Morimoto, A. Ardelean, M.-L. Wu, A. C. Ulku, I. M. Antolovic, C. Bruschini, and E. Charbon, “Megapixel time-gated SPAD image sensor for 2D and 3D imaging applications,” Optica 7(4), 346–354 (2020). [CrossRef]

8. F. Piron, D. Morrison, M. R. Yuce, and J.-M. Redouté, “A review of single-photon avalanche diode time-of-flight imaging sensor arrays,” IEEE Sens. J. 21(11), 12654–12666 (2021). [CrossRef]

9. K. Yasutomi and S. Kawahito, “Lock-in pixel based time-of-flight range imagers: an overview,” IEICE Trans. Electron. E105.C(7), 301–315 (2022). [CrossRef]

10. K. Yasutomi, Y. Okura, K. Kagawa, and S. Kawahito, “A sub-100 μ m-range-resolution time-of-flight range image sensor with three-tap lock-in pixels, non-overlapping gate clock, and reference plane sampling,” IEEE J. Solid-State Circuits 54(8), 2291–2303 (2019). [CrossRef]

11. K. Kagawa, “Functional imaging with multi-tap CMOS pixels,” ITE Trans. on Media Technol. Appl. 9(2), 114–121 (2021). [CrossRef]

12. S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: a technical overview,” IEEE Signal Process. Mag. 20(3), 21–36 (2003). [CrossRef]

13. F. Li, P. Ruiz, O. Cossairt, and A. K. Katsaggelos, “Multi-frame super-resolution for time-of-flight imaging,” ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings2019-May, 2327–2331 (2019).

14. R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,” in 2017 IEEE International Conference on Computer Vision (ICCV), vol. 2017-Octob (IEEE, 2017), pp. 5449–5458.

15. Y. Rivenson, Z. Göröcs, H. Günaydin, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica 4(11), 1437–1443 (2017). [CrossRef]

16. V. Poisson, V. T. Nguyen, W. Guicquero, and G. Sicard, “Luminance-depth reconstruction from compressed time-of-flight histograms,” IEEE Trans. Comput. Imaging 8, 148–161 (2022). [CrossRef]

17. G. Mora-Martín, S. Scholes, A. Ruget, R. Henderson, J. Leach, and I. Gyongy, “Video super-resolution for single-photon lidar,” Opt. Express 31(5), 7060–7072 (2023). [CrossRef]

18. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

19. E. Candès and M. Wakin, “An introduction to compressive sampling,” IEEE Signal Process. Mag. 25(2), 21–30 (2008). [CrossRef]

20. A. Greengard, Y. Y. Schechner, and R. Piestun, “Depth from diffracted rotation,” Opt. Lett. 31(2), 181–183 (2006). [CrossRef]

21. S. R. P. Pavani, M. A. Thompson, J. S. Biteen, S. J. Lord, N. Liu, R. J. Twieg, R. Piestun, and W. E. Moerner, “Three-dimensional, single-molecule fluorescence imaging beyond the diffraction limit by using a double-helix point spread function,” Proc. Natl. Acad. Sci. U. S. A. 106(9), 2995–2999 (2009). [CrossRef]

22. Y. Shechtman, “Recent advances in point spread function engineering and related computational microscopy approaches: from one viewpoint,” Biophys. Rev. 12(6), 1303–1309 (2020). [CrossRef]

23. Y. Kozawa, T. Nakamura, Y. Uesugi, and S. Sato, “Wavefront engineered light needle microscopy for axially resolved rapid volumetric imaging,” Biomed. Opt. Express 13(3), 1702–1717 (2022). [CrossRef]

24. A. Ashok and M. A. Neifeld, “Pseudorandom phase masks for superresolution imaging from subpixel shifting,” Appl. Opt. 46(12), 2256–2268 (2007). [CrossRef]

25. T. Niihara, R. Horisaki, M. Kiyono, K. Yanai, and J. Tanida, “Diffraction-limited depth-from-defocus imaging with a pixel-limited camera using pupil phase modulation and compressive sensing,” Appl. Phys. Express 8(1), 012501 (2015). [CrossRef]

26. A. Stern, Optical Compressive Imaging (CRC Press, 2017).

27. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graph. 37(4), 1–13 (2018). [CrossRef]

28. K. Monakhova, K. Yanny, N. Aggarwal, and L. Waller, “Spectral DiffuserCam: lensless snapshot hyperspectral imaging with a spectral filter array,” Optica 7(10), 1298–1307 (2020). [CrossRef]

29. G. A. Howland, P. Zerom, R. W. Boyd, and J. C. Howell, “Compressive sensing LIDAR for 3D imaging,” in CLEO:2011 - Laser Applications to Photonic Applications, (OSA, Washington, D.C., 2011), 2, p. CMG3.

30. D. J. Lum, S. H. Knarr, and J. C. Howell, “Frequency-modulated continuous-wave LiDAR compressive depth-mapping,” Opt. Express 26(12), 15420–15435 (2018). [CrossRef]

31. F. Mochizuki, K. Kagawa, S. Okihara, M.-W. Seo, B. Zhang, T. Takasawa, K. Yasutomi, and S. Kawahito, “Single-event transient imaging with an ultra-high-speed temporally compressive multi-aperture CMOS image sensor,” Opt. Express 24(4), 4155–4176 (2016). [CrossRef]

32. M. Horio, Y. Feng, T. Kokado, T. Takasawa, K. Yasutomi, S. Kawahito, T. Komuro, H. Nagahara, and K. Kagawa, “Resolving multi-path interference in compressive time-of-flight depth imaging with a multi-tap macro-pixel computational CMOS image sensor,” Sensors 22(7), 2442 (2022). [CrossRef]

33. A. Kadambi and P. T. Boufounos, “Coded aperture compressive 3-D LIDAR,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (IEEE, 2015), pp. 1166–1170.

34. F. Li, H. Chen, A. Pediredla, C. Yeh, K. He, A. Veeraraghavan, and O. Cossairt, “CS-ToF: High-resolution compressive time-of-flight imaging,” Opt. Express 25(25), 31096–31110 (2017). [CrossRef]

35. J. W. Goodman, Introduction to Fourier Optics (McGraw-Hill, 1996).

36. L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Phys. D 60(1-4), 259–268 (1992). [CrossRef]

37. J. M. Bioucas-Dias and M. A. Figueiredo, “A new TwIST: Two-step iterative shrinkage/thresholding algorithms for image restoration,” IEEE Trans. on Image Process. 16(12), 2992–3004 (2007). [CrossRef]

38. D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” Int. J. Comput. Vis. 128(7), 1867–1888 (2020). [CrossRef]

39. K. Monakhova, V. Tran, G. Kuo, and L. Waller, “Untrained networks for compressive lensless photography,” Opt. Express 29(13), 20913–20929 (2021). [CrossRef]

40. D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings pp. 1–15 (2014).

41. J. R. Fienup, “Phase retrieval algorithms: a comparison,” Appl. Opt. 21(15), 2758–2769 (1982). [CrossRef]

42. V. Boominathan, J. K. Adams, J. T. Robinson, and A. Veeraraghavan, “PhlatCam: designed phase-mask based thin lensless camera,” IEEE Trans. Pattern Anal. Mach. Intell. 42(7), 1618–1629 (2020). [CrossRef]

43. Y. Lee, H. Chae, K. C. Lee, N. Baek, T. Kim, J. Jung, and S. A. Lee, “Fabrication of integrated lensless cameras via UV-imprint lithography,” IEEE Photonics J. 14(2), 1–8 (2022). [CrossRef]

44. Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

45. T. Nakamura, S. Igarashi, S. Torashima, and M. Yamaguchi, “Extended depth-of-field lensless camera using a radial amplitude mask,” in Imaging and Applied Optics Congress, (OSA, Washington, D.C., 2020), p. CW3B.2.

46. P. R. Gill and D. G. Stork, “Lensless ultra-miniature imagers using odd-symmetry spiral phase gratings,” in Imaging and Applied Optics, (OSA, Washington, D.C., 2013), p. CW4C.3.

Snapshot super-resolution indirect time-of-flight camera using a grating-based subpixel encoder and depth-regularizing compressive reconstruction

Abstract

1. Introduction

2. Method

2.1 Conventional iToF

2.2 Snapshot SR iToF by CS with PSF engineering

2.3 Snapshot SR iToF by CS with PSF engineering and depth regularization

2.4 Ambient-light subtraction

3. Simulation

3.1 Setup

3.2 Analysis of PSF

3.3 Analysis of reconstruction method

4. Experiment

4.1 Setup

4.2 Experiment with natural objects

4.3 Experiment with resolution chart

5. Conclusion

Funding

Acknowledgements

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Equations (12)

Optics Continuum