Spatially varying defocus map estimation from a single image based on spatial aliasing sampling method

Peng Yang; Ming Liu; Ming Liu; Liquan Dong; Liquan Dong; Liquan Dong; Lingqin Kong; Lingqin Kong; Lingqin Kong; Yuejin Zhao; Yuejin Zhao; Mei Hui; Mei Hui

doi:10.1364/OE.519059

1. Introduction

Defocus map estimation is a technique employed to quantify the extent of defocus blur at each pixel in an image [1–3]. Defocus blur is a prevalent challenge in photography, particularly in scenarios with a wide lens aperture and extensive scene depth. Natural lighting conditions are characterized by spatially incoherent light, making it crucial to process visual information effectively in various imaging and sensing applications. During imaging, light from a single object point propagates outward in the form of a light cone, with a portion of the light reaching the photosensitive device after traversing the optical system. Only light rays from object points within the depth of field can converge to a single point, producing a clear and sharp image. The light of a single object points outside the depth of field cannot perfectly converge to a photosensitive point and scatter into a circle of confusion (CoC) whose radius is linearly related to the object distance [4]. Thus, in a linear imaging system, the brightness intensity of the object point on the image is weighted and combined with surrounding points, resulting in blur. The degree of blur in the image is measured as the radius of the CoC or the standard deviation of its distribution.

However, reducing the aperture results in decreased incoming light, and the depth range of the scene remains unalterable. Additionally, devices like dual-pixel detectors [5,6] have high manufacturing costs and pose challenges in miniaturization. Therefore, the utilization of image processing technology for calculating the amount of defocus has garnered widespread attention [7–9]. Furthermore, numerous methods [1,10–13] possess the capability to calculate the defocus amount at each pixel position based on a single image without the need for additional calibration equipment. Based on the incoherent characteristics of the light field, the basic physical quantity transmitted and processed by the system is the intensity distribution of the light field. In space-invariant systems, the intensity distribution of the image results from the convolution of the intensity distribution of the ideal image and the point spread function (PSF) [1,14]. In a 3D scene, variations in the depth of each object point result in differences in the defocus amount, indicating the spatial variability of the system’s PSF [13]. Hence, the defocus amount is represented in the form of a pixel-aligned defocus map. It is worth noting that the PSF calculated from the defocus amount is a bridge connecting the sharp image and the defocus image. After calculating the defocus amount at each pixel position, full-pixel autofocus can be achieved. Furthermore, the defocus map can be applied to monocular depth estimation, deblur, image focus editing, etc. Unfortunately, the human eye cannot number the amount of defocus. Consequently, the defocus image can’t be manually labeled, unlike target detection or target classification and semantic segmentation.

To conduct a fair and statistical comparison, Jianping S [15] asked his assistant who has a good understanding of defocus to cross-label the blur regions in each image with a binarization mask. However, the labeling method of this binary mask can only distinguish the in-focus area and the out-of-focus area, and cannot label the value of the defocus amount, and the division of out-of-focus and focus is unavoidable subjectivity. In order to accurately obtain the defocus amount at each pixel position, earlier researchers [11,16] used artificially designed PSFs to blur the all-in-focus image, but the all-in-focus image is difficult to obtain. When re-blurring an image with a certain degree of blur, the variance of the synthetic PSF of the generated blur image is the sum of the squares of the blur kernel and the unknown original PSF. Hence, high-quality all-in-focus images serve as a prerequisite for simulating blurry images.

Simultaneously, specialized equipment is employed to generate defocus datasets. In [5,6], light field cameras and dual-pixel cameras are used for defocus image shooting and defocus amount labeling. The light field camera can obtain 2D images and Calibrated 3D depth information. The dual-pixel camera realizes the recording of light information by arranging mutually independent pairs of photodiodes on the photosensitive device, facilitating the acquisition of defocus amounts. However, this device-limited method can only capture restricted scenes and is inadequate for constructing datasets used in large-scale training.

On limited data, convolutional neural networks (CNN) are still used for defocus map estimation. The lack of samples inevitably causes the model to show poor generalization ability, which is manifested as excellent performance only on a single dataset. Therefore, some domain adaptation methods [10,17] and self-supervised methods [18] are applied to improve the generalization ability of the model. Considering the shortage of existing data, some researchers [1,12,19] applied hand-designed features to defocus map estimation. Blur is manifested as the mixing of pixels in the area. This mixing effect has a more significant impact on the high-frequency information of the image, and it is easier to distinguish the corners of the image or places with rich texture information. Therefore, a method called Edge-based performs edge detection [20,21] through an edge detector, calculates the gradient ratio between the original image and the re-blurred image at each edge position, histogram of higher order derivative, etc. to obtain a sparse defocus map, and then obtains the defocus amount of each pixel by linear interpolation location. Such methods rely on the performance of texture detectors and are not suitable for weakly textured and densely textured regions. So defocus map estimation from a single image, one of the classic inverse problems, remains unsolved.

In this paper, (1) we propose a histogram-invariant spatial aliasing sampling method that can generate large batches of images with realistic scene domains and a dataset with pixel-level defocus amount annotations to solve the problem of missing samples and domain adaptability. Defocus is unrelated to the brightness of the light; but is just a phenomenon of energy diffusion causing light from different positions to mix. Therefore, this sampling method achieves the decoupling of the light at each pixel position by randomly shuffling the light rays. After random arrangement and combination, a high-definition global image can be obtained, addressing the challenge of acquiring a full-focus image. Subsequently, the Gaussian blur kernel with a standard deviation range of [0,10] is used to generate blurred images, and finally, a total of 118,287 defocused images are obtained. (2) On the basis of obtaining sufficient samples, we propose a CNN-based model that estimates the spatially varying defocus map from a single image. Simulation and physical experimental results demonstrate that our model exhibits good generalization and surpasses other state-of-the-art methods significantly.

The rest of this paper is organized as follows. The optical model and the histogram-invariant spatial aliasing sampling method are presented in section 2. In section 3, we first investigate the impact of image sharpness on defocus estimation; and then conducted detailed defocus estimation experiments following the technical route from spatial invariance to regional variation to spatial variation. The conclusion is represented in section 4.

2. Theory and method

2.1 Optical model

An incoherent system is a linear system of light intensity. The light intensity distribution of an object can be regarded as a weighted sum of a series of two-dimensional pulse functions $\delta (x,y)$. The image of each $\delta$ in the object plane is called the point spread function (PSF), reflecting the light intensity distribution of a point object. Due to the non-interacting property of photons, when an object is divided into discrete point objects of different intensities, the image is computed as the sum of the impulse responses of each point. Therefore, by constructing the relationship between PSF and the imaging system, the distribution of defocus is calculated.

We assume that the optical system can be simplified to a thin lens model shown in Fig. 1. Thin lenses focus all the rays from an object point within the depth of field to one pixel of the sensor and a sharp image is formed. But as the distance from the depth of field increases, a cone of light rays from the lens is not in perfect focus when imaging a point source, the rays emitted by an object point will spread to multiple sensor points, resulting in an optical spot and blurred image. The optical spot is often referred to as the circle of confusion(CoC) where the PSF is approximated by a Gaussian filter [22–25]. Object points have different object distances in the spatial extent, resulting in CoCs with different radii. In Euclidean geometry, similar triangles have the following relationship:

(1)$$\frac{2R}{D} = \frac{s-v}{v}$$

Combining the thin lens imaging formula with the geometric similarity relation in Eq. (1), the radius of CoC ($R$) can calculated from the following equation:

(2)$$R = \frac{fs}{2F}(\frac{1}{f} - \frac{1}{u} - \frac{1}{s})$$

where $F = f/D$ denotes the aperture stop number. The PSF of an object in a microscope or conventional camera as a non-coherent imaging system can be modeled as an isotropic 2D Gaussian function [22–25]:

(3)$$g(x,y;\sigma ) = \frac{1}{2\pi \sigma ^2} exp(-\frac{x^2+y^2}{2\sigma ^2} )$$

where the standard deviation $\sigma$ denotes the defocus amount and is related to the radius of the CoC by $R=\sqrt {2}\sigma$ [12,26]. This means that defocus map estimation is equivalent to PSF’s parameter estimation. According to Eq. (2), the spatially varying object distance produces a spatially varying $R$, which means that PSF is spatially variable. After the PSF model is established, the mechanism of image degradation is described mathematically.

Fig. 1. Illustration of thin lens model. The lens can only gather light to a certain fixed distance, and it will gradually blur away from this point. $u, s, v, D$ denote the object distance, the distance between the lens and image plane, and the image distance and diameter of the aperture, respectively.

Download Full Size | PDF

2.2 Image degradation model

The degradation of the image can be summarized as that the initial all-in-focus image $S(x, y)$ is added with random noise $\eta (x', y')$ after passing through the degradation function $g(x, y)$, and finally the blurred image $B(x', y')$ is constructed. The mathematical form is described as:

(4)$$B(x',y')=\eta (x',y') + \iint_{-\infty }^{+\infty} S(x, y)g(x'-x,y'-y) \, dxdy$$

where $\eta (x,y)$ models the sensors noise [23]. According to the Eq. (2), (3), (4), a defocus image is constructed from an all-in-focus image and defocus map. However, since the objects in the natural scene are not in the same plane and the depth of field of the camera is limited, it is difficult to obtain an all-in-focus image by taking a photo. When re-blurring the defocus image, the superimposed defocus amount is defined as:

(5)$$\begin{aligned} f(u,v) & = g_{1}(x,y)\ast g_{2}(x,y)\\ & = \iint_{-\infty}^{+\infty} g_{1}(x,y)\cdot g_{2}(u-x,v-y) \,dxdy\\ & = \iint_{-\infty}^{+\infty} e^{-\frac{x^2+y^2}{2\sigma _{1} ^{2}} -\frac{(u-x)^2+(v-y)^2}{2\sigma _{2} ^{2}}}\, dxdy\\ & = \frac{\sqrt{2\pi }\sigma _{1} \sigma _{2} }{\sigma _{3}} e^{-\frac{u^2+v^2}{2\sigma ^{2}_{3}} } \end{aligned}$$

where $\sigma ^{2}_{3} = \sigma _{1}^{2} + \sigma _{2}^{2}$.

If an image with a certain degree of defocus $\sigma _2$ is mistakenly regarded as an all-in-focus image, the error as shown in Fig. 2 increases exponentially with the increase of $\sigma _2$. That is the reason why the all-in-focus image is essential. So we propose a histogram-invariant spatial aliasing sampling method shown in Fig. 3(a) for constructing all-in-focus images $S(x, y)$. We use the variance of the Laplacian of an image [27,28] as a measure of the sharpness, which is denoted as:

(6)$$VAR(I)=\frac{1}{NM}\sum_{m}^{M}\sum_{n}^{N}[{\bigtriangledown} lv (m,n)-\overline{\bigtriangledown lv} ] ^2$$

where $\bigtriangledown$ is the Laplacian operator. After using the histogram-invariant spatial aliasing sampling method, the Laplacian variance of the image shown in Fig. 3(b)-(d) is increased from 4.417 to 40808.075. It is worth noting that our method does not have the domain difference problem of simulated images, and the image sources are not limited, which means that all-in-focus images can be mass-produced. With the all-in-focus image, defocus images can be produced in large quantities by giving the defocus map.

Fig. 2. The relationship between defocus kernel error and sharpness of sharp image.

Download Full Size | PDF

Fig. 3. The histogram-invariant spatial aliasing sampling method. (a) Given any RGB image, the all-in-focus image is generated by randomly disrupting the position of each pixel while keeping the RGB channels unchanged. (b) Arbitrary RGB image to be processed. (c) All-in-focus image $S(x, y)$. (d) Histogram before and after spatial aliasing sampling.

Download Full Size | PDF

3. Experiment

3.1 Overview

The optical model shown in Fig. 1 reveals that the out-of-focus images are degenerated by degrading the all-in-focus images which are generated by our proposed spatial aliasing sampling method. The model is expressed by the following formula:

(7)$$D_{\sigma} = \Psi(S, M_{\sigma})$$

where $D_{\sigma }$ denotes the defocus image with defocus amount $\sigma$, $\Psi$ is defined as the propagation process of light in the optical system, $S$ and $M_{\sigma }$ are the all-in-focus images and defocus map respectively. Given a single defocus image $D_{\sigma }$ as input, our model estimates defocus map $M_{\sigma }$ corresponding to each pixel location, which is a kind of inverse problem.

3.2 Influence of sharpness

Sharpness determines the amount of detail an image can reproduce. It is defined by the boundaries between zones of different tones or colors. The out-of-focus phenomenon reflects the mixing relationship between image points and has nothing to do with the image content. By shuffling the position of pixels, it is possible to reconstruct an all-in-focus image where each pixel is independent of other pixels. We collected and organized datasets that are widely used for defocus map estimation and image deblurring. The images can be considered to have a high level of sharpness and regarded as all-in-focus images only when the $VAR(I)$ calculated by Eq. (6) is greater than 1000 [28].

The number of samples of each dataset and the minimum, maximum, mean and variance of its Laplacian variance are listed in Table 1. As shown in Table 1, origin and shuffle respectively represent before and after using the histogram-invariant spatial aliasing sampling method we proposed. Except for DED_sharp, the rest of the datasets are images with a certain degree of defocus. Therefore, the average Laplacian variance of DED_sharp is larger than the Laplacian variance of CUHK, Flickr, DED_defocus and SYNDOF. Unfortunately, there are very few public fully focused image datasets, and the scenes covered by these fully focused images are very rare compared to the COCO. Based on the number of samples and Laplacian variance, the COCO dataset is the best choice for resampling. All in all, the Laplacian variance of all datasets increases significantly, which means that the image generated by using our resampling method is sharper than the original image and can be used as an all-in-focus image.

Table 1. The Laplacian variance of datasets.

View Table | View all tables in this article

To further verify the importance of sharpness and the effectiveness of our proposed resampling method, we use Gaussian kernels with known defocus amounts to blur the samples in COCO and DED, and use ResNet18 [31] for defocus amount estimation where the defocus map is spatial invariant. This method is implemented in Pytorch. We use Adam [32] as the optimization algorithm with a batch size of 32, and the learning rate is initially set to 0.001. We train the ResNet18 for 100 epochs on NVIDIA GeForce RTX 2080Ti. Each dataset is divided into the training set, validation set and test set according to the ratio of 8:1:1. After completing the model training, the absolute relative error (AbsRel), square relative error (SqRel), root mean square error (RMSE), root mean square logarithmic error (RMSElog) and thresholded accuracy ($\delta$) is used to measure the performance of the model. The thresholded accuracy [33,34] is denoted as

(8)$$\delta = \max \{ \frac{y_{pred}}{y_{gt}}, \frac{y_{gt}}{y_{pred}}\} < t\_{thresholded}$$

where $y_{pred}$ is denoted as the defocus amount predicted by ResNet18; $y_{gt}$ is defined as the standard deviation of the Gaussian kernel when generating a defocused image, which is the ground truth of the defocus amount; $t\_{thresholded}$ is denoted as a threshold, to be consistent with previous studies [33,34], the thresholds were set to $1.25$, $1.25^{2}$, and $1.25^{3}$.

As shown in Table 2, the error of the model becomes smaller after using the resampling method. The histogram-invariant spatial aliasing sampling method we proposed greatly improves the sharpness of the image by decoupling the positional relationship between each image point, even if the original image is a blurred image such as COCO and DED_defocus. According to the theoretical analysis of Eq. (5), the sharper the all-in-focus image in defocus simulation, the higher the accuracy of defocus estimation, which is also proved in this experiment.

Table 2. Quantitative evaluation results of datasets with different sharpness.

View Table | View all tables in this article

3.3 Single image defocus map estimation through patch blurriness regression

Since defocus map estimation is a pixel-level regression task and the range of object point spread is only within a few pixels to a dozen pixels, it is particularly important to maintain high-resolution features. Higherhrnet [35] maintains high-resolution representations by connecting high-resolution to low-resolution convolutions in parallel, and enhances high-resolution representations by repeatedly performing multi-scale fusion across parallel convolutions. Therefore, we modify the Higherhrnet architecture with slight changes where the head is set to output single-channel parameters of PSF. It is worth mentioning that our model is easy to use and can be trained end-to-end, any pixel-level regression network such as U-net [36] can be used as the backbone.

Firstly, our model was tested for its ability to estimate the defocus map through patch blurriness regression, and then conduct experiments on samples where the amount of defocus changes with pixel position, and finally conduct experiments on actual collected images. In previous research, one of the technical routes for defocus map estimation is patch blurriness regression, in which a full-size image is uniformly cropped into several small image patches considered to have the same amount of defocus approximately. Given any RGB image shown in Fig. 4(a), the all-in-focus image shown in Fig. 4(b) is generated by the histogram-invariant spatial aliasing sampling method we proposed. After the all-in-focus image is generated, we randomly select a point in the area more than 50 pixels from the boundary as the segmentation point to divide the image into four areas. Then, the four sub-regions are divided again. The difference from the previous step is that the restriction condition becomes an arbitrary selection of dividing points in the area more than 20 pixels from the boundary. The defocus amount is randomly assigned to a different patch shown in Fig. 4(c), and the image patch is convolved with a Gaussian kernel shown in Fig. 4(d). Finally, blurred images shown in Fig. 4(e) with different defocus amounts in different areas are generated. Considering the sharpness and numbers of samples, we chose COCO_shuffle dataset to train the CNN-based models according to the Sec. 3.2, and the comparative results are shown in Fig. 5.

Fig. 4. (a) Arbitrary RGB image to be processed. (b) All-in-focus image $S(x, y)$. (c) Defocus map. (d) Blur the image based on the amount of defocus. (e) Blurred image $B(x, y)$.

Download Full Size | PDF

Fig. 5. Comparative results of backbone. The first line is the input image, the second line is the ground truth and predicted value of the defocus amount in line 128 of the input image, and the third line is the error of the models.

Download Full Size | PDF

Consistent with the previous patch defocus estimation method, the all-in-focus image is convolved with a Gaussian kernel. The defocus image with a size of $256\times 256$ is sent to the CNN-based models named Higherhrnet (HRNet) and UNet, the output is the defocus map. Both HRNEt and UNet are qualified for the defocus map estimation task, but HRNet performs better relatively. Although the method of generating defocus images using Gaussian kernel convolution reduces the complexity of calculation, it isolates the diffusion of light between patches with different defocus amounts. As a result, the UNet model has a large error at the junction of patches. On the contrary, the HRNet has the characteristic of maintaining high-resolution feature maps, so its error fluctuation range is small. In subsequent pixel-level spatial variation defocus estimation experiments, we modify the decoder of the HRNet and design our HDME-Net.

3.4 Pixel-level spatial variation defocus map estimation

Pixel-level spatial variation defocus estimation is to simultaneously estimate the defocus amount at each pixel position based on a single defocus image. According to Eq. (4), when generating a defocus image, a Gaussian PSF is generated based on the defocus amount at each pixel position, and the pixel values of each object point are diffused to the affected pixel positions, along with salt-and-pepper noise and Poisson noise being added to the image. The defocus map shown in Fig. 6(c) records the defocus amount at each pixel position, determines the blur distribution of the defocused image shown in Fig. 6(d), and is also the ground truth compared with the output of CNN model. The distribution of the defocus map should be as close as possible to the depth distribution of the natural scene. However, the mask of semantic segmentation still has the problem of obvious boundaries. In order to simulate a scene with continuous changes in depth, 50% of the defocus map is set to a Gaussian distribution, and the remaining defocus maps are divided by COCO’s semantic segmentation mask. The variation range of all defocus amounts is [0,10]. According to the Eq. (3) and Eq. (4) the defocus images are generated with pixel-level defocus labels. After that, our HDME-Net is trained end-to-end, with the input being a single defocus image $D_\sigma$ and the output being the defocus map $M_\sigma$.

Fig. 6. Two ways to generate defocus image in COCO_shuffle dataset. The first line illustrates the mask division of different defocus areas based on coco semantic segmentation, the second row describes the Gaussian distribution defocus map with variances between 100 and 200. (a) Images collected from COCO. (b) All-in-focus image $S$ generated by the histogram-invariant spatial aliasing sampling method. (c) Defocus map $M_\sigma$. (d) Defocus image $D_\sigma$.

Download Full Size | PDF

We compared our proposed HDME-Net with state-of-the-art defocus map estimation methods from a single image qualitatively and quantitatively. To quantitatively assess, the amount of defocus is normalized. We calculated the mean squared error (MSE) and mean absolute error (MAE) of each method for 11758 test images from COCO_shuffle dataset. The quantitative evaluation results are shown in Table 3. It is observed that our method achieves the best performance in terms of MSE and MAE. With the benefit of the high-resolution feature map network and the histogram-invariant spatial aliasing sampling method, our HDME-Net performs favorably compared with edge-based and other CNN-based methods. This is because defocus map estimation is a pixel-level regression task, and maintaining high-resolution feature maps in the network is beneficial to reducing information loss. Furthermore, an unlimited number of all-in-focus images can be generated by the histogram-invariant spatial aliasing sampling method, which solves the problem of the lack of large-scale data sets to train neural networks.

Table 3. Results on COCO_shuffle dataset in terms of MSE and MAE.

View Table | View all tables in this article

We also report qualitative results compared with other defocus map estimation methods from a single image. The current common practice is that the value of the defocus map output by the model ranges from 0 to 1. When visualizing the defocus map, all defocus amounts are multiplied by 255 to obtain a single-channel grayscale image.

Figure 7 visually reports results generated by our model against previous methods. First, in scenes where depth continuously changes, such as img 0 and img 2, the defocus map generated by our model is continuous. Second, when the object distance of two adjacent object points is greatly different, the defocus amount changes stepwise, and our model still performs well as shown in the second and third rows. In particular, the boundaries of targets at different object distances are clear and distinct. Lastly, the defocus maps estimated by our model are less noisy and more robust. Our method predicts a wider range of defocus amounts and is suitable for a variety of scenes.

Fig. 7. Qualitative comparison between our HDME-Net and other methods. The GT is denoted as the ground truth of the defocus map. In the last row, we report the amount of defocus at line 128 of the input image.

Download Full Size | PDF

When it comes to the real defocused images, the DED [11], CUHK [30], RTF [16], SYNDOF [10] are the most commonly used datasets in the previous spatially varying defocus estimation methods. However, in the DED dataset, the label for the test set is unavailable. The CUHK dataset only has annotation information of focused areas and lacks the defocus amount of each pixel. As for SYNDOF, it consists of simulated pictures without real scenes. Therefore, we quantitatively evaluate the performance of our method in real scenarios through the test results of the model on the RTF dataset which are not used for training our model. The results are shown in Table 4

Table 4. Quantitative evaluation results on RTF dataset in terms of MSE and MAE.

View Table | View all tables in this article

On the RTF dataset, our method is second only to DMENet which performs additional domain adaptation training on defocus images of real scenes. However, when only trained on simulated datasets, our method surpasses most mainstream methods and is close to DMENet. On the one hand, our proposed defocus map estimation model and the histogram-invariant spatial aliasing sampling method have good generalization capabilities and can be applied to spatially varying defocus map estimation in natural scenes. On the other hand, the point spread function is affected by factors such as the shape of the aperture, lens assembly error, and lens material, causing the object points to spread into circles of confusion of different shapes and sizes instead of a strict Gaussian circle of confusion. Nonetheless, by using Fine-tuning we can adapt our model to other dataset sources. In deep learning, Fine-tuning is a kind of transfer learning in which the model is trained on the new dataset with pre-trained weight.

As shown in Table 5, after being fine-tuned using the DED dataset, the MSE and MAE of our model are only 0.0035665955 and 0.0402438257 respectively, which is far better than DMENet(MSE:0.012, MAE:0.088). It is particularly worth noting that the performance of the model without pre-training is slightly worse. Pre-training on the COCO_shuffle dataset first and then fine-tuning on other datasets can optimize the model in the right direction and converge faster, greatly reducing the number of labeled samples required. This laid the foundation for the widespread application of spatially varying defocus estimation technology.

Table 5. Ablation experiment results on RTF in terms of MSE and MAE.^a

View Table | View all tables in this article

A qualitative comparison between our HDME-Net and other methods is shown in Fig. 8. First of all, our method is more robust in homogeneous regions, the defocus map predicted by our proposed model has less noise, and the defocus amount of different objects is estimated to be smoother and the transition is natural. Secondly, as shown in columns 1, 2 and 4, the scene in the input image has areas where the object distance changes drastically, and the defocus amount also changes drastically in this area. Compared with the others, the defocus map predicted by our model more clearly shows the changes in different defocus amounts, and the dividing lines of areas with drastic changes are more obvious. Finally, in general, the defocus map predicted by our model is more delicate and can truly reflect the changes in the focus area without being affected by the image content.

Fig. 8. Qualitative comparison between our HDME-Net and other methods. The GT is denoted as the ground truth of the defocus map. The three columns on the left show images in DED, where the GT is not provided. The following three columns show images in RTF.

Download Full Size | PDF

We further demonstrate the effectiveness of the proposed spatially varying defocus map estimation model, HDME-Net, and the histogram-invariant spatial aliasing sampling method. As shown in Fig. 9, we have built an optical system that consists of an electronically controlled translation stage with a single-step accuracy of 0.001 mm, 3 3D printed letters, and a fixed focus lens. The images with different degrees of defocus were obtained by changing the position of the image plane at equal intervals (0.02mm).

Fig. 9. Experimental device, blur images taken at different image distances and corresponding defocus map calculated by our method.

Download Full Size | PDF

Although we cannot calculate the theoretical value of defocus without knowing the precise object distance and lens parameters, we can evaluate our method by observing the change in defocus amount at a certain point on the plastic parts "B", "I", and "T". As shown in the upper right subfigure of Fig. 9, the defocus amount curve predicted by our model has obvious unimodality, which means that our method can still be well applied to autofocus even if the lens parameters are not known. When the parameters of the lens are known, the image distance $v$ is calculated by reading the value of $s$ in the encoder according to Eq. (1), which means that autofocus can be achieved at any pixel position with a single image and a single adjustment of the lens. In addition, the defocus curve has an obvious offset in the x-axis direction, corresponding to the three plastic parts located at different object distances. Our method can also be applied to monocular depth estimation from a single image at a certain distance according to Eq. (2).

4. Conclusion

In this paper, we propose a robust and accurate defocus map estimation method from a single image based on the spatial aliasing sampling method. For the first time, we created a defocused dataset with more than 10,000 images and pixel-level annotations. Existing datasets primarily suffer from the problems of small sample size and lack of universality. The limited number of samples results from the difficulty of manually annotating defocus and the low accuracy of existing detectors. The lack of universality is because of the different shapes of apertures, which directly affect the shape of the PSF. In addition, the PSF of the system is superimposed with the effects of aberrations such as distortion, chromatic aberration and tilt error, resulting in the lack of universality of data produced using a single device. Through the spatial aliasing sampling method our proposed, high-definition all-focus images are reconstructed in large batches. This serves as a data foundation for training models that estimate spatially varying Gaussian-distributed defocused PSFs and acts as a pre-training dataset for models estimating other types of PSFs. Moreover, we introduce a high-resolution network for spatially varying defocus map estimation. Any image whose size is divisible by 16 can be transported to our model to obtain a pixel-by-pixel defocus map. Experiments conducted on the defocus map estimation have verified the effectiveness of the proposed method. The accuracy of our defocus map estimation model is significantly higher than that of state-of-the-art models.

Disclosures

The authors declare no conflicts of interest.

Data availability

Our code and dataset underlying the results presented in this paper are available from Github [42]. Other data come from benchmark datasets and do not raise any ethical issues. Use of images abides by the Flickr Terms of Use.

References

1. S. Zhuo and T. Sim, “Defocus map estimation from a single image,” Pattern Recognit. 44(9), 1852–1858 (2011). [CrossRef]

2. E. Lee, E. Chae, H. Cheong, et al., “Depth-based defocus map estimation using off-axis apertures,” Opt. Express 23(17), 21958–21971 (2015). [CrossRef]

3. Y. Cao, Z. Ye, Z. He, et al., “Multi-channel residual network model for accurate estimation of spatially-varying and depth-dependent defocus kernels,” Opt. Express 28(2), 2263–2275 (2020). [CrossRef]

4. A. Shajkofci and M. Liebling, “Spatially-variant cnn-based point spread function estimation for blind deconvolution and depth estimation in optical microscopy,” IEEE Trans. on Image Process. 29, 5848–5861 (2020). [CrossRef]

5. S. Xin, N. Wadhwa, T. Xue, et al., “Defocus map estimation and deblurring from a single dual-pixel image,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 2228–2238.

6. A. Abuolaim, M. Afifi, and M. S. Brown, “Improving single-image defocus deblurring: How dual-pixel images help through multi-task learning,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2022), pp. 1231–1239.

7. A. Zhang and J. Sun, “Joint depth and defocus estimation from a single image using physical consistency,” IEEE Trans. on Image Process. 30, 3419–3433 (2021). [CrossRef]

8. Q. Ye, M. Suganuma, and T. Okatani, “Accurate single-image defocus deblurring based on improved integration with defocus map estimation,” in International Conference on Image Processing (IEEE, 2023), pp. 750–754.

9. K. Xin, S. Jiang, X. Chen, et al., “Low-cost whole slide imaging system with single-shot autofocusing based on color-multiplexed illumination and deep learning,” Biomed. Opt. Express 12(9), 5644–5657 (2021). [CrossRef]

10. J. Lee, S. Lee, S. Cho, et al., “Deep defocus map estimation using domain adaptation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2019), pp. 12222–12230.

11. H. Ma, S. Liu, Q. Liao, et al., “Defocus image deblurring network with defocus map estimation as auxiliary task,” IEEE Trans. on Image Process. 31, 216–226 (2022). [CrossRef]

12. S. Liu, F. Zhou, and Q. Liao, “Defocus map estimation from a single image based on two-parameter defocus model,” IEEE Trans. on Image Process. 25(12), 5943–5956 (2016). [CrossRef]

13. A. Karaali, N. Harte, and C. R. Jung, “Deep multi-scale feature learning for defocus blur estimation,” IEEE Trans. on Image Process. 31, 1097–1106 (2022). [CrossRef]

14. Q. Ye, M. Suganuma, and T. Okatani, “Accurate single-image defocus deblurring based on improved integration with defocus map estimation,” in International Conference on Image Processing (IEEE, 2023), pp. 750–754.

15. J. Shi, L. Xu, and J. Jia, “Discriminative blur detection features,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2014), pp. 2965–2972.

16. L. D’Andrès, J. Salvador, A. Kochale, et al., “Non-parametric blur map regression for depth of field extension,” IEEE Trans. on Image Process. 25(4), 1660–1673 (2016). [CrossRef]

17. J. Hoffman, E. Tzeng, T. Park, et al., “CyCADA: Cycle-consistent adversarial domain adaptation,” in Proceedings of the 35th International Conference on Machine Learning, vol. 80 of Proceedings of Machine Learning ResearchJ. DyA. Krause, eds. (PMLR, 2018), pp. 1989–1998.

18. Y. Lu and G. Lu, “Self-supervised single-image depth estimation from focus and defocus clues,” IEEE Robot. Autom. Lett. 6(4), 6281–6288 (2021). [CrossRef]

19. J. Park, Y.-W. Tai, D. Cho, et al., “A unified approach of multi-scale deep and hand-crafted features for defocus estimation,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 1736–1745.

20. X. Yu, X. Zhao, Y. Sui, et al., “Handling noise in single image defocus map estimation by using directional filters,” Opt. Lett. 39(21), 6281–6284 (2014). [CrossRef]

21. C. Tang, C. Hou, and Z. Song, “Defocus map estimation from a single image via spectrum contrast,” Opt. Lett. 38(10), 1706–1708 (2013). [CrossRef]

22. T. Sakamoto, “Model for spherical aberration in a single radial gradient-rod lens,” Appl. Opt. 23(11), 1707–1710 (1984). [CrossRef]

23. Y.-W. Tai and M. S. Brown, “Single image defocus map estimation using local contrast prior,” in 16th International Conference on Image Processing (IEEE, 2009), pp. 1797–1800.

24. J. Zhang, B. Luo, Z. Xiang, et al., “Deep-learning-based adaptive camera calibration for various defocusing degrees,” Opt. Lett. 46(22), 5537–5540 (2021). [CrossRef]

25. J. Zhang, B. Luo, X. Su, et al., “Depth range enhancement of binary defocusing technique based on multi-frequency phase merging,” Opt. Express 27(25), 36717–36730 (2019). [CrossRef]

26. T.-C. Wei, Three Dimensional Machine Vision Using Image Defocus (State University of New York at Stony Brook, 1994).

27. J. L. Pech-Pacheco, G. Cristóbal, J. Chamorro-Martinez, et al., “Diatom autofocusing in brightfield microscopy: a comparative study,” in 15th International Conference on Pattern Recognition, vol. 3 (IEEE, 2000), pp. 314–317.

28. F. Galetto and G. Deng, “Single image defocus map estimation through patch blurriness classification and its applications,” Vis. Comput. 39(10), 4555–4571 (2023). [CrossRef]

29. T.-Y. Lin, M. Maire, S. Belongie, et al., “Microsoft coco: Common objects in context,” in Computer Vision: 13th European Conference, Part V 13 (Springer, 2014), pp. 740–755.

30. J. Shi, L. Xu, and J. Jia, “Discriminative blur detection features,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2014), pp. 2965–2972.

31. K. He, X. Zhang, S. Ren, et al., “Deep residual learning for image recognition,” in Conference on computer vision and pattern recognition (IEEE, 2016), pp. 770–778.

32. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, arXiv:1412.6980 (2014). [CrossRef]

33. S. Gur and L. Wolf, “Single image depth estimation trained via depth from defocus cues,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 7683–7692.

34. H. Si, B. Zhao, D. Wang, et al., “Fully self-supervised depth estimation from defocus clue,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023), pp. 9140–9149.

35. B. Cheng, B. Xiao, J. Wang, et al., “Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020), pp. 5386–5395.

36. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention: 18th International Conference, Proceedings, Part III 18 (Springer, 2015), pp. 234–241.

37. D.-J. Chen, H.-T. Chen, and L.-W. Chang, “Fast defocus map estimation,” in International Conference on Image Processing (IEEE, 2016), pp. 3962–3966.

38. B. Su, S. Lu, and C. L. Tan, “Blurred image region detection and classification,” in 19th ACM international conference on Multimedia (2011), pp. 1397–1400.

39. A. Karaali and C. R. Jung, “Edge-based defocus blur estimation with adaptive scale selection,” IEEE Trans. on Image Process. 27(3), 1126–1137 (2018). [CrossRef]

40. J. Park, Y.-W. Tai, D. Cho, et al., “A unified approach of multi-scale deep and hand-crafted features for defocus estimation,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2017), pp. 2760–2769.

41. J. Shi, L. Xu, and L. Jia, “Just noticeable defocus blur detection and estimation,” in Conference on Computer Vision and Pattern Recognition (IEEE, 2015).

42. P. Yang, M. Liu, J. Dong, et al., “HDME-Net,” Github (2024), https://github.com/67689E4F/HDME-Net.git.

Datasets	Samples	Min	Max	Mean	Sth
Datasets	Samples	Origin / Shuffle	Origin / Shuffle	Origin / Shuffle	Origin / Shuffle
COCO [29]	118287	3.56 / 0.00	48154.60 / 206261.64	2771.59 / 47690.27	2964.26 / 22697.97
CUHK [30]	504	13.87 / 7412.92	9389.04 / 129600.15	1009.54 / 51349.12	1291.57 / 20910.01
DED_defocus [11]	1112	4.01 / 1919.44	5048.90 / 104556.91	462.10 / 36630.67	572.08 / 13164.71
DED_sharp [11]	1112	26.23 / 3769.09	7503.90 / 113594.68	1464.64 / 40475.24	1230.58 / 13376.81
Flickr [10]	2142	1.26 / 2418.79	17828.72 / 137386.76	526.54 / 42319.04	1051.89 / 21320.69
SYNDOF [10]	9231	2.00 / 2903.00	1577.24 / 112961.11	190.71 / 31125.82	150.71 / 19273.89

Datasets	AbsRel↓	SqRel↓	RMSE↓	RMSElog↓	$δ < 1.25$ ↑	$δ < {1.25}^{2}$ ↑	$δ < {1.25}^{3}$ ↑
COCO	0.146	0.047	1.047	0.198	0.802	0.975	0.993
COCO_shuffle	0.133	0.029	1.197	0.184	0.812	0.973	0.996
DED_defocus	0.800	2.298	2.750	0.689	0.283	0.511	0.707
DED_defocus_shuffle	0.605	1.045	2.402	0.610	0.348	0.663	0.804
DED_sharp	0.506	0.650	2.570	0.633	0.239	0.587	0.750
DED_sharp_shuffle	0.479	0.913	2.461	0.524	0.326	0.696	0.848

	DFE [1]	FDM [37]	SVD [38]	DID-ANet [11]	DMENet [10]	Ours
MSE	0.38958	0.14706	0.22008	0.18673	0.01538	0.00051
MAE	0.57893	0.32491	0.41792	0.33701	0.09674	0.00842

	[39]	[40]	[16]	[1]	[41]	[11]	[10]	HDME-Net
MSE	0.064	0.024	0.033	0.037	0.082	0.0985	0.012	0.0398
MAE	0.199	0.129	0.106	0.143	0.241	0.2392	0.088	0.1645

	DED L1	Fine-tuning L1	DED L2	Fine-tuning L2
MSE	0.0058466330	0.0040898231	0.0049015883	0.0035665955
MAE	0.0453767378	0.0408832504	0.0469117243	0.0402438257

Spatially varying defocus map estimation from a single image based on spatial aliasing sampling method

Abstract

1. Introduction

2. Theory and method

2.1 Optical model

2.2 Image degradation model

3. Experiment

3.1 Overview

3.2 Influence of sharpness

3.3 Single image defocus map estimation through patch blurriness regression

3.4 Pixel-level spatial variation defocus map estimation

4. Conclusion

Disclosures

Data availability

References

Data availability

Cited By

Figures (9)

Tables (5)

Equations (8)

Optics Express