Motion deblurring using spatiotemporal phase aperture coding

Shay Elmalem; Raja Giryes; Emanuel Marom

doi:10.1364/OPTICA.399533

1. INTRODUCTION

Finding the proper exposure setting is a well-known challenge in photography. In general, one has to balance aperture size, exposure time, and gain to achieve a good image (the trade-off between these factors is sometimes referred to as the “exposure triangle”). This balancing process involves many trade-offs and, therefore, requires complex skills and rich experience. In many cases, a large exposure is necessary to allow a sufficient amount of light to reach the sensor in order to achieve a good image with respect to the lighting condition, which usually is not controllable. To increase the amount of light in the sensor plane, one may increase the aperture size. However, large aperture results in a shallow depth-of-field and increased sensitivity to optical aberrations. Increasing the sensor gain can intensify the image signal, with the price of a higher noise level. Increasing the exposure time allows more light to be integrated into the image sensor, but introduces motion blur, caused by either movements of objects or camera shake (or both).

Various efforts have been dedicated to automatically balance the exposure parameters [1]. Yet, such solutions either are very specific to the scenario or provide unsatisfactory performance. A different approach tries to eliminate one (or more) of the exposure triangle vertices, by developing methods that can restore the artifacts introduced by a non-balanced exposure. For example, one may apply a large gain and then perform a denoising operation [2–5], increase the aperture size and restore the blurred image using out-of-focus deblurring algorithms [6], or take long exposures and revert the motion-related blur [7], which is the focus of this work.

In addition to the pure post-processing methods, solutions based on computational imaging [8] attempt to globally analyze the scenario and then redesign the whole imaging system. In such an approach, the image acquisition is manipulated in a way that (generally) leads to an intermediate image with low quality. However, this image is distorted in a very specific way such that it encodes information acquired during the exposure. Such information encoding is designed so that it can be employed in the post-processing stage for the final image restoration. Recently, the utilization of deep learning (DL) as a framework for both image processing and end-to-end design became widespread [9]. Such methods have been demonstrated for various applications, including extended depth-of-field (EDOF) [10–15], hyper/multi-spectral imaging [16,17], lensless imaging [18–20], depth estimation [11,21–23], computational microscopy [24–26], and motion deblurring [27–32], to name a few.

A. Previous Work

Various works have been proposed to manipulate the imaging process during exposure to allow better motion deblurring. Such approaches include the use of hybrid imaging [31], a light field camera [32], or the rolling shutter effects [33,34]. In this work, our focus is on spatiotemporal coding schemes. We detail now two such prior frameworks.

Fig. 1. Motion deblurring using spatiotemporal phase aperture coding: a moving scene is captured using a camera with spatiotemporal phase aperture coding, which generates a motion-color-coded PSF. The PSF coding serves as a prior for the CNN that performs blind spatially varying motion deblurring.

Download Full Size | PDF

Raskar et al. developed a temporal amplitude coding of the aperture to counteract motion blur [27]. A conventional continuous exposure is analyzed as a wide temporal box filter with a narrow frequency response, which limits the motion deblurring performance. Using this analysis, the authors propose a temporal amplitude coding of the aperture (referred to as “fluttered shutter”), by interchangeably closing and opening it during exposure in some pre-determined timing (i.e., in some “temporal code”). Such coding generates a much wider frequency response, which in turn is utilized for improved motion deblurring performance. While achieving very good results, it requires prior knowledge of the motion direction and extent. In addition, it suffers from reduced light efficiency due to the (interchangeable) closing of the aperture during half of the exposure. A follow-up work analyzes the case of fluttered shutter point-spread function (PSF) estimation as part of the deblurring process, thus avoiding the requirement for prior knowledge of the motion parameters [28]. However, such code design requires a compromise in the deblurring performance, to achieve both PSF invertibility and estimation abilities. Moreover, since it also relies on temporal amplitude coding, light efficiency is still decreased. A rigorous analysis of the design and implementation of a fluttered shutter camera appears in [35,36]. Jeon et al. extended the method to multi-image photography using complementary fluttering patterns [37].

Another approach by Levin et al. searches for a sensor motion that leads to a motion invariant PSF [29]. Assuming such a PSF, one may perform non-blind deconvolution (using the known kernel) on the entire image at once, without the requirement to estimate each object motion trajectory. After a rigorous analysis, it is shown that a parabolic motion of the image sensor during exposure leads to the desired motion invariant PSF. Intuitively, one may think of this image acquisition technique as a process in which every moving object, at least for a fraction of the exposure, is in the same velocity of the sensor (assuming the velocity is inside a predefined range). Since each object is “tracked” by the camera for one brief moment, and in the rest of the exposure it is moving relative to the camera, the blur of all objects turns out to be similar (for the full analysis, see [29]). This allows application of a conventional deblurring approach that assumes a uniform blur. While this is a major advantage, this approach has a serious limitation: the PSF encoding is limited to the axis in which the parabolic motion took place. If an object moves in different directions, the motion invariant PSF assumption no longer holds, and the performance degrades. In the case of movement to an orthogonal direction, the deblurring ability is completely lost. To solve this issue, a follow-up work [30] proposed an advanced solution based on two images taken with two orthogonal parabolic motions. Such a solution allows deblurring of motion in all directions. Yet, it requires a more complex setup with acquisition of two consecutive images or the use of multiple lenses. Note that both of these methods are designed to restore motion blur caused due to movement of objects, and have limited ability to mitigate blur originated in camera shake.

B. Proposed Solution

In this work, a computational imaging approach for multi-directional motion deblurring from a single image is introduced. The innovation enabling such process is the encoding scheme that embeds in the intermediate image dynamic cues for both the motion trajectory and its extent (see Fig. 1). These cues serve as a strong prior for both shift-variant PSF estimation and the deblurring operation.

Our encoding is achieved by performing spatiotemporal phase coding in the lens aperture plane during the image acquisition. The PSF of the coded system induces a specific chromatic-temporal coupling, which (unlike in a conventional camera) results in a color-varying spatial blur (see Fig. 2). Such a PSF encodes the different motion trajectory of each object. As we show in a Fourier analysis, the encoding is performed in the phase domain. A convolutional neural network (CNN) is trained to analyze the embedded cues and use them to reconstruct a deblurred image.

Fig. 2. Motion-blurred PSF simulation: left, conventional camera; middle, gradual focus variation in conventional camera; right, the proposed camera—gradual focus variation with phase aperture coding.

Download Full Size | PDF

The encoding design is performed for a general case, without assumptions on the motion blur profile. Therefore, it can encode blur caused due to both objects moving in different directions/velocities, as well as camera shake. Such encoding allows blind deblurring of the intermediate image, since the required prior information for both PSF estimation and image restoration is embedded in it. The deblurring CNN is trained to implicitly estimate the spatially varying PSF using the encoded cues and, thus, reconstruct a sharp image. An experiential setup based on a conventional camera performing the designed spatiotemporal blur is presented. We demonstrate its motion deblurring performance in the presence of both uniform and non-uniform multi-directional motion, where it has its greatest advantage. We show also its generalization ability: a method tuned in simulation using synthetic data provides good performance in a real-world experiment with our designed prototype.

The rest of the paper is organized as follows. Section 2 presents the proposed spatiotemporal aperture coding and the corresponding post-processing model. Section 3 demonstrates the advantage of the proposed solution in both simulation and real-world experiments. Section 4 discusses the trade-offs and concludes the work.

2. SPATIOTEMPORAL APERTURE CODING BASED DEBLURRING

In order to achieve blind deblurring of motion-blurred images, the blur kernel has to be estimated and thereafter inverted (even if both of these operations are jointly performed). In a general scene, objects move in different directions and various velocities, making the blur kernel shift-dependent. Therefore, linear shift invariant deconvolution operations cannot be used. Yet, one may encode cues in the acquired image to mitigate some of the hurdles. To this end, we aim at encoding the intermediate image with enough information that allows both estimating and inverting the spatially varying PSF of the acquired image, such that improved motion deblurring of a general scene is achieved.

A. Spatiotemporally Coded PSF Design

The design goal of such a PSF is to encode the object trajectory during image acquisition. To achieve this task, the PSF has to vary along the trajectory in some way that provides cues for both the motion direction and extent. One may suggest spatial variations of the PSF along the motion trajectory (i.e., during the exposure time); however, such a variation introduces a spatial blur. Since a trade-off of motion blur with spatial blur is not desired, the PSF variation has to take place in another dimension.

In our proposed design, the motion variations are projected onto the color space. By generating a PSF whose color is changing during the image exposure, the motion (both direction and extent) is encoded in the intermediate image, as a colored trace. Generally, color coding requires color filtering, which results in loss of light and requires some mechanism for filter replacement (either mechanically or electronically); both of these issues are not desired. Therefore, to achieve motion-color coding, a phase-mask is used. In various works [12–14,22,38–40], phase-masks incorporated in the lens aperture plane are used for PSF engineering. The advantage in increased light throughput of phase over amplitude aperture coding is significant (in many amplitude-coding based systems the light throughput is reduced by ${\sim}50\%$; see, for example, [11,27,41]).

In several previous works [12–14,22], phase-masks formed of several concentric rings are used for PSF engineering. The phase-mask function $\text{PM}({\textbf{r}},{\phi _{{\rm ring}}})$ can be expressed as (assuming a single phase-ring and polar coordinates)

(1)$${\rm PM}({\textbf{r}},{\phi _{{\rm ring}}}) = \left\{{\begin{array}{*{20}{l}}{\exp \{j{\phi _{{\rm ring}}}\}}&{{r_1} \lt \rho \lt {r_2}}\\1&{\rm otherwise}\end{array}} \right.,$$

where ${\textbf{r}} = [{r_1},{r_2}]$ represents the normalized coordinates of the ring location, and ${\phi _{{\rm ring}}}$ is the phase-shift introduced by a phase-ring given by

(2)$${\phi _{{\rm ring}}} = \frac{{2\pi}}{\lambda}[n - 1]{h_{{\rm ring}}},$$

where $\lambda$ is the illumination wavelength, ${h_{{\rm ring}}}$ is the ring height, and $n$ is the refractive index of the mask substrate (this example can be easily extended to a multiple rings pattern). As ${\phi _{{\rm ring}}} \propto \frac{1}{\lambda}$, incorporating such mask in the aperture plane can introduce a predesigned and controlled axial chromatic aberration, which engineers the PSF to have a joint defocus-color dependency.

In previous studies, such joint dependency was utilized for both EDOF [13,14] and depth estimation [22] by focusing the lens to a specific plane in a scene containing objects located at various depths. In such a configuration, each object is blurred by a different blur kernel, according to its defocus condition $\psi$, defined as

(3)$$\begin{split}\psi &= \frac{{\pi\! {R^2}}}{\lambda}\left({\frac{1}{{{z_{\rm{o}}}}} + \frac{1}{{{z_{{\rm{img}}}}}} - \frac{1}{f}} \right)\\& = \frac{{\pi\! {R^2}}}{\lambda}\left({\frac{1}{{{z_{{\rm{img}}}}}} - \frac{1}{{{z_{\rm{i}}}}}} \right) = \frac{{\pi\! {R^2}}}{\lambda}\left({\frac{1}{{{z_{\rm{o}}}}} - \frac{1}{{{z_{\rm{n}}}}}} \right),\end{split}$$

where ${z_{{\rm{img}}}}$ is the sensor plane location for an object in the nominal position (${z_{\rm{n}}}$), ${z_{\rm{i}}}$ is the ideal image plane for an object located at ${z_{\rm{o}}}$, $f$ and $R$ are the imaging system focal length and exit pupil radius, and $\lambda$ is the illumination wavelength. The $\psi$ parameter indicates the maximum of the quadratic phase error (due to defocus) in the pupil function [42], so the pupil function ${P_{\rm PM,OOF}}$ of a lens with both defocus error and phase-mask is

(4)$${P_{\rm PM,OOF}} = P(\rho ,\theta)PM({\textbf{r}},{\phi _{{\rm ring}}})\exp \{j\psi {\rho ^2}\} ,$$

where $P(\rho ,\theta)$ is the in-focus pupil function. The PSF is calculated using the pupil function by the relation [42]

(5)$${\rm PSF} = |{\cal F}\{{P_{\rm PM,OOF}}{)\} |^2}.$$

This color-depth encoding of the blur kernels allows high-quality EDOF (which requires blind shift-variant convolution in the general case) and single image monocular depth estimation.

In [13,14], this predesigned chromatic aberration serves as a cue to estimate the local depth-dependent PSF. Thereafter, deblurring can be done using the sharp color channel (i.e., the color in which the PSF is narrow) that “carries” the image information (as in most natural images, objects always have some color content in all channels, and pure monochromatic objects are rare). Therefore, the color-dependent blur is mitigated by transferring resolution from channel to channel. In the proposed case of motion blur encoding, the color cues are designed to indicate the motion trajectory for shift-dependent PSF estimation, and thereafter the PSF information is used for deblurring. Since strong defocus-color dependency is desired, a phase-mask with two rings (similar to the one presented in [22]) is used, with ${\textbf{r}} = [0.55,0.8,0.8,1]$ and ${\phi _{{\textbf{ring}}}} = [6.2,12.3]\,\,{\rm rad}$ (measured for $\lambda = 455\,\,{\rm nm}$).

An infinite-conjugate imaging setting (which is widespread for various applications, e.g., security cameras and smartphone cameras) is assumed. By adding the color-defocus phase-mask, different focus/defocus settings modulate the PSF to be “colored” (by narrow PSF in a certain color band and the opposite in the other bands, and not by chromatic filtering). Therefore, by gradually changing the focus setting during the exposure, the desired spatiotemporal dependency is achieved: the “color” of the PSF (i.e., the ratio between the PSF width in the different color channels) varies during the exposure, and as every object moves, its motion trace is blurred differently (in the chromatic dimension) along the trajectory. Following Eq. (3), when the lens is focused properly, $\psi = 0$. If a focus variation is introduced, then $\psi$ changes, and the PSF is modulated accordingly. The $\psi$ variation domain is a design parameter, with trade-off between motion extent and sensitivity. We concentrate in the domain of $0 \lt \psi \lt 8$ (calculated for $\lambda = 455\,\,{\rm nm}$), as for it the mask provides the strongest chromatic separation. For an exposure time of ${T_{{\exp}}}$, the $\psi$ variation is set to $\psi (t) = \frac{8}{{{T_{{\exp}}}}}t$, and the spatiotemporally coded pupil function is

(6)$${P_{{\rm coded}}} = P(\rho ,\theta){\rm PM}({\textbf{r}},{\phi _{{\rm ring}}})\exp \{j\psi (t){\rho ^2}\} ,$$

with similar notation to Eq. (4). The static phase-mask introduces a color-defocus coupling; this coupling generates motion cues by a composition of the dynamic focus setting $\psi (t)$ with the object’s movements. The proposed encoding is illustrated in Fig. 2. The left of Figure 2 presents blur of a horizontally moving point source as captured by a conventional camera, which results in a blurred white line. If gradual focus variation of $\psi (t)$ is performed to a clear aperture lens during exposure (middle of Fig. 2), the PSF gets wider in all the colors simultaneously and, thus, introduces a considerable spatial blur in the last parts of the motion. However, if the same focus variation is performed to a lens equipped with a ring phase-mask (right of Fig. 2), the PSF colors change along the motion line, from blue through green to red, thus encoding both the motion extent and velocity. Such color encoding can cue either motion of objects, camera shake during exposure time, or a composition of both.

Fig. 3. Simulation of the different coding methods: the imaging is performed on single pixel dots to simulate point sources (for visualization purposes, dilation and gamma correction are applied). (a) First frame (arrows indicate dots path and velocity), (b) last frame, (c) conventional static camera, (d) fluttered shutter camera [27], (e) parabolic motion camera [29], and (f) our proposed camera.

Download Full Size | PDF

To further illustrate the motion encoding ability of our method, we simulate imaging of moving point sources using our method and compare it to a conventional camera, the fluttered shutter camera [27], and the parabolic motion camera [29]. Figure 3 presents the PSF encoding performed by the different methods (this is an extension of a similar comparison shown in Fig. 3 of [29]).

The original scene is formed of two sets of point sources arranged in two orthogonal lines. While the joint dot stays in place, all the other dots are moving in different velocities, as illustrated by the arrows in Fig. 3(a). Imaging simulation of this scene is performed using the four methods.

Using a conventional camera, the stationary dot stays “as is,” and all the other dots are blurred according to their motion trajectory. Using the fluttered shutter camera, parts of the dots’ trace are blocked, and the code can be clearly seen. As suggested in [27], such a code generates an easy-to-invert PSF, assuming the motion direction and extent are known. Indeed, some PSF estimation can be done for blind deblurring, but an inherent invertibility/estimation trade-off exists, as discussed in [28]. In addition, the light throughput loss caused by fluttered shutter is clearly seen (for the code proposed in [27] and used here the loss is 50%).

Using the parabolic motion camera (with parabolic motion in the horizontal direction), the PSF is roughly motion invariant in the direction of the sensor motion, as clearly seen in all horizontal dots. Yet, in any other direction, and most significantly in the orthogonal one (in this case, vertical), the motions of the dots (linear) and the sensor (parabolic) are composed, making the PSFs highly motion variant.

In the proposed joint phase-mask and focus variation coding, each PSF is colored according to the different motion trajectory. The direction is encoded by the blue–green–red transition, and the extent of this transition indicates the velocity of the motion.

1. Spectral Analysis

To analyze the motion encoding ability of our scheme, a spectral analysis of the PSF is carried using the spatiotemporal Fourier analysis model proposed in [29]. In this model, a single spatial dimension is examined versus the temporal dimension, and a 2D Fourier transform (FT) is carried on the $(x,t)$ slice of the full $(x,y,t)$ space. In such setting, different velocities of a point source form lines at different angles in the $(x,t)$ plane. The analysis in [29] included only the spectrum amplitude, but in our case, we include also the phase, since our encoding is also phase dependent. We compare our method with a conventional camera in Fig. 4 (a full analysis including the fluttered shutter and parabolic motion cameras appears in Supplement 1). For the conventional static camera, the $(x,t)$ slice of the PSF has a Sinc-function spectrum amplitude, which allows good reconstruction of object at this velocity [represented by the angle of the $(x,t)$ PSF]. Since the PSF is “gray” (i.e., has no chromatic shift along its trajectory), its spectrum phase is also gray. This “gray phase” feature is common also to the fluttered shutter and parabolic motion cameras.

Fig. 4. PSF spectral analysis. PSFs and the corresponding spectra of a (top) static camera and (bottom) our method. (a) $(x,t)$ slice of PSF and its (b) amplitude and (c) phase in Fourier domain.

Download Full Size | PDF

Our proposed PSF can be considered as an infinite sequence of smaller PSFs, each one of a different color. As all PSFs have a similar spatial shape, but each has a different color and different location in the $(x,t)$ plane; the spectrum amplitude is “white” and similar to the spectrum amplitude of the conventional PSF. Yet, the phase (which holds the shift information) is colored, according to the shift (i.e., spatiotemporal location) of each color. Our spatiotemporal chromatic coupling can be considered as utilization of the spectrum phase as a degree of freedom for the coding. The color variations in the phase indicate the coupling between the color and the trajectory, as can be seen in Fig. 4 (for similar analysis of additional cases, see Supplement 1).

B. Color-Coded Motion Deblurring Neural Network

The dynamic phase aperture coding generates color variations in the spatiotemporal blur kernel. These chromatic cues encode the motion trajectories in all directions. They serve as prior information for shift-variant PSF estimation, which allows the performance of an effective non-homogeneous motion deblurring. Traditionally, spatially varying deblurring is performed in two stages: PSF estimation for the different objects/segments, and then deblurring of each of them. As presented in [14,43], this task can be solved using a single CNN, trained with a dataset containing the various possibilities of the shift-variant blur. One may treat the CNN operation as an end-to-end process that extracts the cues that allow the PSF estimation, and then utilizes the acquired PSF information for image deblurring.

1. Training Data

To train such a CNN for our motion deblurring process, images containing moving objects blurred with our spatiotemporal varying blur kernel (and their corresponding sharp images) are required. Experimentally acquiring a motion-blurred image and its pixel-wise accurate sharp image is very complex (even without the dynamic aperture coding). Therefore, an imaging simulation is used. Using the GoPro dataset [43] that contains high frame-rate videos of various scenes, images with the motion-color-coded blur were generated, by blurring consecutive frames using the coded kernel, and then summing them up. The GoPro dataset contains various dynamic scenes captured with a handheld camera; therefore, both moving objects and camera shake exist in it. Sequences of nine frames are used, and a dataset containing 2,500 images is created; 80% of it is used for training, and the rest is used for validation (the original test set is used for testing). Since our deblurring process is based on local cues encoded by our spatiotemporal kernel, and not on the image statistics, a CNN trained on this synthetic data generalizes well to real-world images (as shown hereafter).

2. Deblurring Network Architecture

Since image restoration is sought, a fully convolutional network (FCN) architecture is considered. As shown in the work of Nah et al. [43], multiscale processing is an efficient tool to grasp the structure of motion-blurred objects. Therefore, the network architecture we use is based on the popular U-Net structure [44], as it is one of the leading multiscale FCN architectures. A skip connection is added between the output and the input, leaving the “U” structure to estimate the residual correction for the input image. Empirically, this simplifies the convergence (the full structure and details of the network are presented in Supplement 1).

The U-Net architecture is trained using patches of size ${{128}} \times {{128}}$ taken from the dataset described above. Since the final goal is to present the performance on images taken with a real camera, noise augmentation is used, with similar noise to the one observed in real images taken using the target camera [additive white Gaussian noise (AWGN) with $\sigma = 9$]. The network is trained using the Huber loss [45], and the average reconstruction results on the test set are peak signal-to-noise ratio (PSNR) = 29.5, structural similarity index measure (SSIM) = 0.93. Examples of the reconstruction performance achieved on images from the test set in different cases are presented in Supplement 1.

To quantify the benefit of the proposed PSF encoding, we generated a version of the same dataset without our spatiotemporal coding and trained the same architecture using it. In this case, a significant over-fitting occurred, resulting in poor results on the test set (${\rm PSNR} = 24.6,{\rm SSIM }= 0.84$).

In another test, we evaluated another network structure, which is similar to the one presented in [14]. Consecutive blocks of Conv-BN-ReLU (no pooling) with a direct skip connection from the input to the output are used. Such an architecture also learns the residual correction of the image as the deblurring operation, but only in the original scale (additional details of this model and its test appear in Supplement 1). Inferior performance is achieved with this network structure (${\rm PSNR} = 27.5,{\rm SSIM} = 0.9$) as multiscale information is important for this task. However, this architecture is much shallower and contains just 2% of the weights of the full U-net model, and still achieves comparable results to the model of [43] (see comparison in Section 3). In addition, the encoded cues also benefit the processing time; for a reference image of $1280 \times 720$ pixels, the proposed U-Net and shallow deblurring model processing times are $174\;{\rm ms}$ and $73\;{\rm ms}$ respectively, while for the same image size Nah et al. [43] processing requires $330\;{\rm ms}$ (all timings were performed on an NVIDIA RTX2080Ti GPU). The spatiotemporally coded acquisition is a strong guidance for the deblurring operation, as it enables both improved performance and faster processing time.

Fig. 5. Simulation results of rotating target: (a) rotating target and the reconstruction results for (b) fluttered shutter, (c) parabolic motion camera, and (d) our method.

Download Full Size | PDF

3. EXPERIMENTS

We start by evaluating our proposed method in simulation. Two different comparisons are presented. The first is to other computational imaging methods: the fluttered shutter camera [27] and the parabolic motion camera [29], demonstrating the advantages of our dynamic aperture phase coding versus other coding methods. The second comparison is to the deblurring CNN presented by Nah et al. [43], which is designed for conventional cameras. Such a comparison illustrates the benefits of coded aperture. Following that, we present real-world results acquired using a prototype of the spatiotemporal coded camera.

A. Comparison to Other Coding Methods

In order to demonstrate our PSF estimation ability in the motion deblurring process versus the motion direction sensitivity of the other methods, a scene with a rotating spoke resolution target is simulated. Such a scene contains motion in all directions and in various velocities (according to the distance from the center of the spoke target) simultaneously.

The synthetic scene serves as an input to the imaging simulation for the three different methods (fluttered shutter, parabolic motion, and our method). The fluttered shutter code being used (in both the imaging and reconstruction) is for motion to the right, in the extent of the linear motion of the outer parts of the spoke target. The parabolic motion takes place on the horizontal direction. Each imaging result is noised using AWGN with $\sigma = 3$ to simulate a real imaging scenario in good lighting conditions (since the fluttered shutter coding blocks 50% of the light throughput, the noise level of its image is practically doubled). Figure 5 presents the deblurring results of the three different techniques (full details on the simulation and deblurring process of the different methods is presented in Supplement 1). Since our deblurring is CNN-based and the reference coding methods processing is (originally) based on classical image restoration techniques, a straightforward comparison will not be fair. To compensate this gap, in addition to the original processing, we tested the reference methods using a novel non-blind deblurring method based on a CNN [46], which achieved improved results.

The fluttered-shutter-based reconstruction restores the general form of the area with the corresponding motion coding (outer lower part, moving right), and some of the opposite direction (outer upper part, moving left), and fails on all other directions/velocities. This can be partially solved using a different coding that allows both PSF estimation and inversion. Yet, such scheme introduces an estimation-invertibility trade-off. However, a rotating target is a challenging case for shift-variant PSF estimation, and in case a restoration with incorrect PSF is performed, it leads to poor results (as can be seen in Fig. 5). Moreover, the increased noise sensitivity of this approach is apparent, as it blocks 50% of the light throughput.

The parabolic motion method achieves good reconstruction for the horizontal motion (both left and right) as can be seen in the upper and lower parts of the spoke (which move horizontally). Yet, notice that its performance are not the same for left/right (as any practical finite parabolic motion cannot generate a true motion-invariant PSF). Also, both vertical motions are not coded properly and, therefore, are not reconstructed well. Using our method, motion in all directions can be estimated, which allows a shift-variant blind deblurring of the scene.

B. Comparison to CNN-Based Blind Deblurring

To analyze the advantages in motion cue coding, our method is compared to the multiscale motion deblurring CNN presented by Nah et al. [43]. The test set of the GoPro dataset is used as the input. Since Nah et al. trained their model on sequences of between 7–13 frames, similar scenes were created using both our coding method and simple frame summation (as used in [43], with the proper gamma-related transformations). Note that in our case, a spatial (diffraction related) blur is added with the motion blur, so our model is handling a more challenging task.

The reconstruction results are compared for several noise levels: $\sigma = [0, 3]$ on a $[0,255]$ scale (the reference method was trained with $\sigma = 2$). The measures on each motion length are averaged over the different noise levels, and the results are displayed in Table 1. As can be clearly seen, our method provides an advantage in the recovery error over the method of Nah et al. [43] in both PSNR and SSIM (visual reconstruction results are presented in Supplement 1). In small motion lengths, both methods provide visually pleasing restorations (though our method is more accurate in terms of PSNR/SSIM). Yet, as the motion length increases, our improvement becomes more significant. This can be explained by the fact that the architecture used in [43] is trained using an adversarial loss, and therefore inherent data hallucination occurs in its reconstruction process. As the motion length gets larger, such data hallucination is less accurate, and therefore the reduction in their PSNR/SSIM performance is more significant. Our method employs the encoded motion cues for the reconstruction, therefore providing more accurate results (i.e., our method is designed for lower distortion in the perception-distortion trade-off [47]).

Table 1. Quantitative Comparison to Blind Deblurring: PSNR/SSIM Comparison between the Method Presented in [43] and Our Method, for Various Lengths of Motion (${N_{{\rm frames}}}$)^a

View Table

Note also that although our model is trained only on images generated using sequences of nine frames, its deblurring performance for shorter/longer sequences is superior than the performance of [43], which is trained on both shorter/longer sequences (7–13 frames). This clearly shows that our model has learned to extract the color-motion cues and utilize them for the image deblurring task, beyond the specific extent present in the training data.

Fig. 6. Table-top experimental setup: the liquid-lens and phase-mask are incorporated in the C-mount lens. The micro-controller synchronizes the focus variation to the frame exposure using the camera flash signal.

Download Full Size | PDF

Fig. 7. Experimental validation of PSF coding: a moving white LED captured with our camera validates the required PSF encoding.

Download Full Size | PDF

In addition, in our dataset, an additional diffraction-related spatial blur is added (as described above), so in a case that a similar spatial blur is added to the original GoPro dataset (without the motion-color cues), our advantage over [43] is expected to be even larger. Note also that our method is more robust to the level of noise in the image. The results presented here are limited to the range $\sigma = [0,3]$ to make a fair comparison to [43], which is trained for the level $\sigma = 2$. In Supplement 1, we present the results per noise level and additional results for higher noise levels.

C. Table-Top Experiment

Following the simulation results, a real-world setup is built (see Fig. 6). A C-mount lens with $f = 12\;{\rm mm}$ is mounted on an 18MP camera with pixel size of $1.25\,\,\unicode{x00B5}{\rm m}$. A similar phase-mask to the one used in [22] and a liquid focusing lens are incorporated in the aperture plane of the main lens, and the composed lens $F\# = 6.5$. A signal from the camera indicating the start of the exposure (originally designed for flash activation) is used to trigger the liquid lens to perform the focus variation (a detailed description of the experimental setup is presented in Supplement 1). The liquid lens is calibrated to introduce a focus variation equivalent to $\psi = \frac{8}{{{T_{{\rm exp}}}}}t$ during the exposure, as presented in Section 2.

The first experiment validates the desired PSF spatiotemporal encoding. Two white LEDs are mounted on a spinning wheel and act as point sources, similar to the point sources simulated in Fig. 3. A motion-blurred image of the spinning LEDs is acquired, with the phase-mask incorporated in the lens and the proper focus variation during exposure. Zoom-in on one of the LEDs is presented in Fig. 7. The gradual color changes along the motion trajectory are clearly visible (the full image including both LEDs, and an additional camera shake PSF example, are presented in Supplement 1)

Fig. 8. Rotating image experiment: reconstruction results of (top) rotating photo and (bottom) zoom-ins, using (a) our method and (b) Nah et al. reconstruction [43].

Download Full Size | PDF

Fig. 9. Train experiment: recovery results of a moving train using (a) our method and (b) Nah et al. reconstruction [43].

Download Full Size | PDF

.

Following the PSF validation experiment, a deblurring experiment on moving objects is carried. In order to examine various motion directions and velocities at once, an image of a rotating photo is captured; the rotation angular velocity is $\omega = 6\,\,{\rm deg}/{\rm s}$, and exposure of ${T_{{\exp}}} = 0.5\;{\rm s}$ results in a blur of up to 55 pixels in the image plane. The coded image is then processed using the CNN described in Section 2 (note that the network is trained purely on simulated images, and no fine-tuning to the experimental PSF was carried). For reference, image of the same rotating object is captured with a conventional camera (i.e., the same camera with a fixed focus and without the phase-mask), and then deblurring is applied using the multiscale motion deblurring CNN of Nah et al. [43]. Results of a rotating photo are presented in Fig. 8 (the full results with the intermediate images are presented in Supplement 1). In addition to the rotating target test, a linearly moving object (toy train) is also captured. The train speed is $v = 3\,\,{\rm deg}/{\rm s}$, and since it contains much less texture comparing to the rotating photo, it was positioned to result a blur trace of 150 pixels in an image of ${T_{{\exp}}} = 0.5\;{\rm s}$ exposure time. The results are presented in Fig. 9. As can be clearly seen, our camera provides much better results in both cases. The full results of these experiments, along with a camera shake deblurring example and outdoor scenes, are provided in Supplement 1.

4. CONCLUSION

A computational imaging approach for blind motion deblurring is presented. The method is based on spatiotemporal phase coding of the lens aperture, to achieve a multi-directional motion variant PSF. The phase coding is achieved using two components: (i) the static/spatial part—a phase-mask designed to code the PSF to have a joint color-defocus dependency; and (ii) the dynamic/temporal part—a gradual variation of the focus setting performed during the image exposure. Jointly, these coding mechanisms achieve a motion variant PSF, exhibited in a gradual color change of the blur along the motion trajectory. Such a PSF encodes cues of the motion extent and velocity in the acquired image. These cues are then utilized in the motion deblurring process, implemented using a CNN model. The CNN operation encapsulates both the PSF estimation and the spatially variant motion deblurring, which allows it to generalize very well across different conditions.

Our approach is compared to blind deblurring methods and computational imaging-based strategies. Its shift-variant PSF estimation ability and generalization potential to real-world scenes are analyzed and discussed. Our technique achieves better performance compared to the other solutions in various scenarios, without imposing a limitation on the motion direction. An experimental setup implementing the proposed method is presented, and the spatiotemporal PSF color encoding is validated in a real-world experiment. In addition, as our encoding provides cues to the entire motion trajectory, our approach holds potential for video-from-motion and temporal super-resolution applications, similar to [48–52].

Funding

H2020 European Research Council (757497).

Acknowledgment

The authors wish to thank Mr. Tal Tayar for his help with various electronics related issues, and NVIDIA for its generous GPU grant. Shay Elmalem is partially supported by The Yitzhak and Chaya Weinstein Research Institute for Signal Processing.

Disclosures

The authors declare no conflicts of interest.

See Supplement 1 for supporting content.

REFERENCES

1. B. London, J. Upton, and J. Stone, Photography (Pearson, 2013).

2. S. Lefkimmiatis, “Non-local color image denoising with convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).

3. E. Schwartz, R. Giryes, and A. M. Bronstein, “DeepISP: toward learning an end-to-end image processing pipeline,” IEEE Trans. Image Process. 28, 912–923 (2018). [CrossRef]

4. C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018), pp. 3291–3300.

5. O. Liba, K. Murthy, Y.-T. Tsai, T. Brooks, T. Xue, N. Karnad, Q. He, J. T. Barron, D. Sharlet, R. Geiss, S. W. Hasinoff, Y. Pritch, and M. Levoy, “Handheld mobile photography in very low light,” ACM Trans. Graphics 38, 1–16 (2019). [CrossRef]

6. K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNN denoiser prior for image restoration,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 3929–3938.

7. W. Lai, J. Huang, Z. Hu, N. Ahuja, and M. Yang, “A comparative study for single image blind deblurring,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016), pp. 1701–1709.

8. J. N. Mait, G. W. Euliss, and R. A. Athale, “Computational imaging,”Adv. Opt. Photon. 10, 409–483 (2018). [CrossRef]

9. G. Barbastathis, A. Ozcan, and G. Situ, “On the use of deep learning for computational imaging,” Optica 6, 921–943 (2019). [CrossRef]

10. E. R. Dowski and W. T. Cathey, “Extended depth of field through wave-front coding,” Appl. Opt. 34, 1859–1866 (1995). [CrossRef]

11. A. Levin, R. Fergus, F. Durand, and W. T. Freeman, “Image and depth from a conventional camera with a coded aperture,” ACM Trans. Graphics 26, 70 (2007). [CrossRef]

12. Z. Zalevsky, A. Shemer, A. Zlotnik, E. B. Eliezer, and E. Marom, “All-optical axial super resolving imaging using a low-frequency binary-phase mask,” Opt. Express 14, 2631–2643 (2006). [CrossRef]

13. H. Haim, A. Bronstein, and E. Marom, “Computational multi-focus imaging combining sparse model with color dependent phase mask,”Opt. Express 23, 24547–24556 (2015). [CrossRef]

14. S. Elmalem, R. Giryes, and E. Marom, “Learned phase coded aperture for the benefit of depth of field extension,” Opt. Express 26, 15316–15331 (2018). [CrossRef]

15. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, “End-to-end optimization of optics and image processing for achromatic extended depth of field and super-resolution imaging,” ACM Trans. Graphics 37, 1–13 (2018). [CrossRef]

16. M. E. Gehm, R. John, D. J. Brady, R. M. Willett, and T. J. Schulz, “Single-shot compressive spectral imaging with a dual-disperser architecture,” Opt. Express 15, 14013–14027 (2007). [CrossRef]

17. M. A. Golub, A. Averbuch, M. Nathan, V. A. Zheludev, J. Hauser, S. Gurevitch, R. Malinsky, and A. Kagan, “Compressed sensing snapshot spectral imaging by a regular digital camera with an added optical diffuser,” Appl. Opt. 55, 432–443 (2016). [CrossRef]

18. M. S. Asif, A. Ayremlou, A. Sankaranarayanan, A. Veeraraghavan, and R. G. Baraniuk, “Flatcam: thin, lensless cameras using coded aperture and computation,” IEEE Trans. Comput. Imaging 3, 384–397 (2017). [CrossRef]

19. N. Antipa, G. Kuo, R. Heckel, B. Mildenhall, E. Bostan, R. Ng, and L. Waller, “DiffuserCam: lensless single-exposure 3D imaging,” Optica 5, 1–9 (2018). [CrossRef]

20. V. Boominathan, J. Adams, J. Robinson, and A. Veeraraghavan, “PhlatCam: designed phase-mask based thin lensless camera,”IEEE Trans. Pattern Anal. Mach. Intell. 42, 1618–1629 (2020). [CrossRef]

21. C. Zhou, S. Lin, and S. K. Nayar, “Coded aperture pairs for depth from defocus and defocus deblurring,” Int. J. Comput. Vis. 93, 53–72 (2011). [CrossRef]

22. H. Haim, S. Elmalem, R. Giryes, A. Bronstein, and E. Marom, “Depth estimation from a single image using deep learned phase coded mask,” IEEE Trans. Comput. Imaging 4, 298–310 (2018). [CrossRef]

23. Y. Wu, V. Boominathan, H. Chen, A. Sankaranarayanan, and A. Veeraraghavan, “PhaseCam3D—learning phase masks for passive single view depth estimation,” in IEEE International Conference on Computational Photography (ICCP) (2019), pp. 1–12.

24. E. Nehme, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Deep-storm: super-resolution single-molecule microscopy by deep learning,” Optica 5, 458–464 (2018). [CrossRef]

25. E. Hershko, L. E. Weiss, T. Michaeli, and Y. Shechtman, “Multicolor localization microscopy and point-spread-function engineering by deep learning,” Opt. Express 27, 6158–6183 (2019). [CrossRef]

26. M. Kellman, E. Bostan, N. A. Repina, and L. Waller, “Physics-based learned design: optimized coded-illumination for quantitative phase imaging,” IEEE Trans. Comput. Imaging 5, 344–353 (2019). [CrossRef]

27. R. Raskar, A. Agrawal, and J. Tumblin, “Coded exposure photography: motion deblurring using fluttered shutter,” ACM Trans. Graphics 25, 795–804 (2006). [CrossRef]

28. A. K. Agrawal and Y. Xu, “Coded exposure deblurring: optimized codes for PSF estimation and invertibility,” in IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 2066–2073.

29. A. Levin, P. Sand, T. S. Cho, F. Durand, and W. T. Freeman, “Motion-invariant photography,” ACM Trans. Graphics 27, 1–9 (2008). [CrossRef]

30. T. S. Cho, A. Levin, F. Durand, and W. T. Freeman, “Motion blur removal with orthogonal parabolic exposures,” in IEEE International Conference on Computational Photography (ICCP) (2010), pp. 1–8.

31. M. Ben-Ezra and S. K. Nayar, “Motion deblurring using hybrid imaging,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2003), Vol. 1, p. I-I.

32. P. P. Srinivasan, R. Ng, and R. Ramamoorthi, “Light field blind motion deblurring,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 3958–3966.

33. M. M. R. Mohan, A. N. Rajagopalan, and G. Seetharaman, “Going unconstrained with rolling shutter deblurring,” in IEEE International Conference on Computer Vision (ICCV) (2017), pp. 4010–4018.

34. N. Antipa, P. Oare, E. Bostan, R. Ng, and L. Waller, “Video from stills: lensless imaging with rolling shutter,” in IEEE International Conference on Computational Photography (ICCP) (2019), pp. 1–8.

35. Y. Tendero, J. Morel, and B. Rougé, “The flutter shutter paradox,”SIAM J. Imaging Sci. 6, 813–847 (2013). [CrossRef]

36. Y. Tendero and S. Osher, “On a mathematical theory of coded exposure,” Res. Math. Sci. 3, 4 (2016). [CrossRef]

37. H. Jeon, J. Lee, Y. Han, S. J. Kim, and I. S. Kweon, “Multi-image deblurring using complementary sets of fluttering patterns,” IEEE Trans. Image Process. 26, 2311–2326 (2017). [CrossRef]

38. E. R. Dowski and W. T. Cathey, “Extended depth of field through wave-front coding,” Appl. Opt. 34, 1859–1866 (1995). [CrossRef]

39. O. Cossairt, C. Zhou, and S. Nayar, “Diffusion coded photography for extended depth of field,” ACM Trans. Graphics 29, 39 (2010). [CrossRef]

40. H. Nagahara, S. Kuthirummal, C. Zhou, and S. K. Nayar, “Flexible depth of field photography,” in European Conference on Computer Vision (ECCV) (Springer, 2008), pp. 60–73.

41. P. A. Shedligeri, S. Mohan, and K. Mitra, “Data driven coded aperture design for depth recovery,” in 2017 IEEE International Conference on Image Processing (ICIP) (IEEE, 2017), pp. 56–60.

42. J. Goodman, Introduction to Fourier Optics, 2nd ed. (MaGraw-Hill, 1996).

43. S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017), pp. 3883–3891.

44. O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI), vol. 9351of LNCS (Springer, 2015), pp. 234–241 [available on arXiv:1505.04597 (cs.CV)].

45. P. J. Huber, “Robust estimation of a location parameter,” Ann. Math. Stat. 35, 73–101 (1964). [CrossRef]

46. T. Tirer and R. Giryes, “Image restoration by iterative denoising and backward projections,” IEEE Trans. Image Process. 28, 1220–1234 (2019). [CrossRef]

47. Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018).

48. M. Jin, G. Meishvili, and P. Favaro, “Learning to extract a video sequence from a single motion-blurred image,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018).

49. M. Gupta, T. Mitsunaga, Y. Hitomi, J. Gu, and S. K. Nayar, “Video from a single coded exposure photograph using a learned over-complete dictionary,” in IEEE International Conference on Computer Vision (ICCV) (IEEE Computer Society, 2011), pp. 287–294.

50. D. Liu, J. Gu, Y. Hitomi, M. Gupta, T. Mitsunaga, and S. K. Nayar, “Efficient space-time sampling with pixel-wise coded exposure for high-speed imaging,” IEEE Trans. Pattern Anal. Mach. Intell. 36, 248–260 (2014). [CrossRef]

51. J. Holloway, A. C. Sankaranarayanan, A. Veeraraghavan, and S. Tambe, “Flutter shutter video camera for compressive sensing of videos,” in IEEE International Conference on Computational Photography (ICCP) (2012), pp. 1–9.

52. P. Llull, X. Liao, X. Yuan, J. Yang, D. Kittle, L. Carin, G. Sapiro, and D. J. Brady, “Coded aperture compressive temporal imaging,” Opt. Express 21, 10526–10545 (2013). [CrossRef]

$N_{f r a m e s}$	Nah et al.	Ours
$N = 7$	$28.1 / 0.93$	$30.9/0.95$
$N = 9$	$27 / 0.91$	$30/0.94$
$N = 11$	$25.9 / 0.89$	$28.9/0.92$
$N = 13$	$24.9 / 0.87$	$28/0.91$

Motion deblurring using spatiotemporal phase aperture coding

Abstract

1. INTRODUCTION

A. Previous Work

B. Proposed Solution

2. SPATIOTEMPORAL APERTURE CODING BASED DEBLURRING

A. Spatiotemporally Coded PSF Design

1. Spectral Analysis

B. Color-Coded Motion Deblurring Neural Network

1. Training Data

2. Deblurring Network Architecture

3. EXPERIMENTS

A. Comparison to Other Coding Methods

B. Comparison to CNN-Based Blind Deblurring

C. Table-Top Experiment

4. CONCLUSION

Funding

Acknowledgment

Disclosures

REFERENCES

Supplementary Material (1)

Cited By

Figures (9)

Tables (1)

Equations (6)

Optica