Joint estimation of depth and motion from a monocular endoscopy image sequence using a multi-loss rebalancing network

Shiyuan Liu; Jingfan Fan; Jingfan Fan; Dengpan Song; Tianyu Fu; Yucong Lin; Deqiang Xiao; Hong Song; Yongtian Wang; Yongtian Wang; Jian Yang

doi:10.1364/BOE.457475

1. Introduction

Minimally invasive surgery (MIS) is extensively applied in clinical surgery, such as head and neck, cardiovascular, anorectal, and urological surgeries [1,2]. To avoid the lack of perception of surgical space in endoscopy imaging, 3D reconstruction technology has been applied to the MIS [3]. This application helps them avoid critical structures close to the path of surgical instruments [4]. however, the accuracy of the actual location information in 3D space is insufficient [5,6], such as the loss of depth perception and the difficulty in assessing the trajectory of endoscopy motion [7,8]. To provide doctors with a more realistic surgical information, estimating the depth of endoscopy images and reconstructing camera motion trajectory are necessary to provide a more accurate operating environment for clinical surgery [9].

Depth information is difficult to estimate from a single image in the case of high precision and low resource investment in clinical surgery [10]. To obtain accurate depth and pose information whilst observing with an endoscopy, the following two methods can be used. First, a binocular endoscopy can obtain accurate depth information by using the intrinsic epipolar geometric parameters of the equipment [11]. Second, photometric stereoscopes can capture minute details by using programmable light sources [12,13]. However, the hardware volume of these two methods is unsuitable for the small space of clinical surgery. Accordingly, positioning technology based on visual information and electromagnetic tracking has been extensively studied in the field of medical endoscopy. The development of vision based simultaneous localization and mapping (SLAM) systems can provide powerful support for endoscopy localization and 3D surface reconstruction, which is crucial to realize functions in augmented reality systems [14,15]. Approaches such as vision-based reconstruction techniques have been successfully applied to a variety of anatomical settings such as skull base surgery, abdominal surgery, and intestinal surgery [16,17]. In contrast, the dense multi-view stereo method attempt to reconstruct highly detailed 3D geometry in the images using known camera poses [18–20]. However, monocular endoscopy data sets cannot support effective visual feature extraction, and inadequate camera positioning leads to fuzzy estimation at the global scale [21]. The visual SLAM addresses this shortcoming by enabling camera localization and soft tissue mapping in the model dataset, where continuous frames of the endoscopy surgical video have been validated in robotic surgical systems, such as the Leonardo Da Vinci surgical robot [22]. A deformation field based on embedded deformation nodes is introduced to build 3D models of soft tissues with different degrees of deformation by gradually integrating new observations [23–25]. However, some inevitable technical challenges such as motion blur and resolution limitations difficultly meet real time surgical accuracy [26,27].

In recent studies, deep learning methods can achieve better results in complex scene estimation, which benefits from local variation between images and function fitting [28]. Accordingly, monocular endoscopy images are capable of intensive depth estimation through an end-to-end network model [29]. Unsupervised monocular depth estimation based on view synthesis, which performs well in the field of computer vision, is usually difficult to use in obtaining good depth estimation owing to the inherent unstable illumination conditions and similar textures of endoscopy images [30,31]. Several works based on feature descriptor learning that attempt to solve this challenge through the training branch network learns key point detectors and dense descriptors [32]. Moreover, the fusion of texture-less information usually has a large acceptance domain by using the matching point pairs projected from structure from motion (SFM) reconstruction results [33]. Therefore, the method based on feature descriptor learning can obtain more dense features matching by obtaining dense feature descriptors with a certain degree of stability to perform feature matching [34].

In single frame image depth estimation, the corresponding ground truth of a medical endoscopy image is difficult to obtain, thereby preventing effective supervision network training [35]. Some researchers have attempted to solve this problem through self-monitoring learning method. Mahmood et al. [31] monitored network training by using a colonic model dataset and applied it to real colonic images for depth estimation. However, this method requires that the anatomical structure model be similar to the real endoscopy image to obtain a more accurate estimate. Furthermore, Turan et al. [10] designed an unsupervised approach for a monocular endoscopy capsule robot. However, this can result in network training to get trapped in the gradient locality is lost, especially for texture-less regions. Liu et al. [36] also proposed a dense depth estimation method for monocular endoscopy. In this method, a sparse point cloud is reprojection to obtain the sparse depth map and sparse flow map to supervise network training. Considering the timing information between successive frames, Dusmanu et al. [37] used the estimated depth value of the previous frame to estimate the depth of the next frame. They achieved a better result of pose estimation and improved the accuracy of depth estimation to a limited extent [38]. The endoscopy motion and final refinement and global optimization also is usually performed off-line, in which technique, features were matched by extracting normalized cross correlation between images [2]. Those features were used to reconstruct the 3D structure [2,39]. Other methods for endoscopy localization also came up a method that took in raw image data instead of geometric data for motion estimation [29,40]. Methods for obtaining endoscopy motion from visual input in both monocular and stereo systems was proposed [41,42]. They are also capable of avoiding error prone feature extraction and matching techniques [43]. In addition, Deep Learning technologies have achieved significant results on localization related applications [44]. The visual odometry function was utilized after estimation of depth and velocity to predict the discretized changes of direction and velocity. However, the problem with the blurred and under-exposed images, which one of the biggest hindrance for applying endoscopy images [22]. To overcome this problem, another alternative method was proposed where the CNNs were provided with dense optical flow instead of RGB images [45].

The present study extends a preliminary version of a previous work. In the preliminary version, we proposed a “feature descriptor learning based on sparse feature matching” method. In this paper, the sparse feature matches are mapped to dense feature descriptors by learning network, and use dense feature descriptors to provide supervision for depth and motion estimation networks. The main contributions of this work are summarized as follows: First, we construct an end-to-end feature-learning network with sparse feature matching of images, which can achieve stable dense feature descriptor extraction. Second, by considering the epipolar constraints of sequence frame with dense feature descriptors to provide supervisory signals for depth and motion estimation, the accuracy of depth and motion estimation is improved. Third, we use a multi-loss rebalancing framework to estimate the depth and motion of monocular endoscope images. A series of evaluations is conducted on different types of endoscopy images. Our method can reconstruct accurate 3D surface information of operation spatial.

2. Method

Endoscopy depth and motion estimation are essential for constructing an internal 3D surface model, which can feedback information on the internal environment and thus ensure successful assisting surgeons during operations. In this section, we describe methods to train convolutional neural networks for depth estimation in monocular endoscopy using feature descriptor learning method based on video sequences frame. We explain how Retrieval of supervisory signal from dense feature descriptors are extracted, and introduce a multi-information joint learning framework and loss functions to enable self-supervised learning based on these signals. The overall training architecture is shown in Fig. 1, where all concepts are introduced in this section.

Fig. 1. Network architecture. Our network includes depth and motion estimation of images.

Download Full Size | PDF

2.1 Network architecture

The overall training architecture is shown in Fig. 1. We describe methods of training convolutional neural networks for depth estimation in monocular endoscopy through the feature descriptor learning method based on video sequence frames. We explain how the retrieval of supervisory signal from dense feature descriptors are extracted and then introduce a multi-information joint-learning framework and loss functions to enable self-supervised learning based on these signals.

We extract two images from the dataset as the source image and the target image into the network. The network architecture consists of dense descriptor extraction and depth pose estimation, as shown in Fig. 1. The purple rectangles are loss modules. In the sparse descriptor extraction step: (1). We perform visual feature matching on the input source-target image pairs to obtain sparse matching. (2). Descriptor map corresponding to input images are obtained by feature learning network. (3). The relative response loss established by sparse matching point pairs and descriptor map. In the depth and motion estimation step: (1). We first perform intensive feature matching on the descriptor map obtained in the sparse descriptor extraction step, and N pairs of dense matching are obtained. (2). Depth estimation network and motion estimation network obtai the depth and trajectory of the input image respectively. (3). to provide supervisory signals for depth and motion estimation networks, descriptor map and dense matching are used to construct feature consistency loss and epipolar consistency loss.

In the image depth estimation and endoscopy motion process, we perform intensive feature matching on the descriptor map obtained by the intensive descriptor extraction module and obtain N pairs of intensive matching point pairs. depth estimation network and motion estimation network obtain supervision signals from descriptor map and dense feature matching, respectively. The epipolar consistency loss and feature consistency loss encourages the geometry consistency between samples in a batch, which minimizes the geometric distance of predicted depths between each consecutive pair and enforces their scale-consistency. With training, the consistency can propagate to the entire video sequence. Finally, the depth estimation of image and the motion of camera are obtained.

2.2 Dense descriptor extraction

In this part, the previous work to extract dense descriptors from input source images and target images and obtain their corresponding descriptor subgraphs by end-to-end feature learning network. Our feature learning network comprises branches of residual blocks with shared weights. As shown in Fig. 2. The encoder comprises a convolution layer and 9 ResNET-v2 [38]. The input image is sampled twice under the average pooling layer to obtain the extracted feature map, and the image size changes from $C \times H \times W$ to $C \times \frac{1}{4}H \times \frac{1}{4}W$, where, the H and W are the height and width of the input image, and C is the number of channels to extract the feature image. The decoder network comprises three convolution layers, and a dense feature descriptor map with size $L \times H \times W$ is obtained after two bilinear up-sampling, where, L is the dimension of the dense feature descriptor. In this paper, our experiment set $\; L = 32$.

Fig. 2. The feature network architecture.

Download Full Size | PDF

It is worth noting that the filling step of image edge zeroing adopted by the convolution layer will produce checkerboard artifacts to the edge region of the dense feature descriptor graph. We use bilinear interpolation for each convolution operation to keep the image size constant and avoid checkerboard artifacts. Similarly, convolution layer and bilinear upsampling layer are used to restore image size in the decoder. Finally, to avoid the influence of brightness difference between images and improve the universality of descriptors, L2 normalization is performed for descriptors along the channel.

2.3 Depth and motion estimation

We predict depth map $({{D_s},{D_t}} )$ from the input source and target image pair $({{I_s},{I_t}} )$ by two branch networks with shared weights, where, each branch is composed of encoder and decoder, the encoder adopts ResNet34 [38], and the decoder is similar to. To limit the range of the predicted depth value, the activation function of the prediction layer is expressed as:$\; y = 1/({ax + b} )$. where, x is the output value of the $sigmoid$ activation function of the prediction layer. $a = ({1/{d_{min}} - 1/{d_{max}}} ),b = 1/{d_{max}}$. ${d_{min}}$ and ${d_{max}}$ represent the minimum and maximum depths of the endoscopy image respectively, y is the predicted depth of the current frame.

For endoscopy motion, we use motion network estimate the 6DoF relative pose ${T_{t \to s}} = [{R_{t \to s}}|{t_{t \to s}}] \in SE(3 )$ of input image to $({{I_s},{I_t}} )$. The motion network comprises seven convolution layers, which output value is multiplied by a proportional coefficient $\rho $ to limit the range of output values. For any pixel coordinate ${p_t}$ in the target image ${I_t}$, the corresponding pixel coordinate ${p_s}$ in the source image can be transformed by the following determination:

(1)$${{p_s}\sim {K_s}[{{R_{t \to s}}\textrm{|}{t_{t \to s}}} ]{D_t}({{p_t}} )K_t^{ - 1}{p_t}}$$

where, the ${\sim} $ represents equal homogeneous coordinates, the ${K_s}$ and ${K_t}$ are the internal parameter matrices of the camera corresponding to the source image and target image, respectively. ${D_t}({{p_t}} )$ is the depth at the coordinate point ${p_t}$ in the target image. Based on the above transformation method, the source descriptor map ${F_s}$ and target descriptor map ${F_t}$ can be obtained from the coordinate correspondence between the target image and source image. Then, a bilinear sampling method is used to generate the synthesize target descriptor map $F_t^s$ from the source descriptor map ${F_s}$. We minimise the reduction error between the original target descriptor map ${F_t}$ and the synthesised target descriptor map $F_t^s$ through the following method:

(2)$${{\mathrm{{\cal H}}_{fc}} = \frac{1}{{\sum M}}\mathop \sum \limits_{p \in M} {{({F_t^s(p )- {F_t}(p )} )}^2}}$$

where, M is the binary mask, which determines whether the coordinate points in the target image fall into the effective region in the source image through transformation, and $\sum M$ defines the number of points in $\; M$.

2.4 Loss function and implementation

We use endoscopy image feature matching method to obtain sparse matching sets $\{{{x_s} \leftrightarrow {x_t}} \}$ between adjacent frames, which are used to construct relative response loss monitoring network training. For the input pair of source image ${I_s}$ and target image ${I_t}$, the feature descriptor learning network generate the corresponding dense feature descriptor map ${F_s}$ and ${F_t}$. We define the relative response loss ${\mathrm{{\cal L}}_{rr}}$ as follows:

(3)$${{\mathrm{{\cal L}}_{rr}} ={-} \log \left( {\frac{{{e^{\sigma {R_t}({{x_t}} )}}}}{{\mathop \sum \nolimits_x {e^{\sigma {R_t}(x )}}}}} \right)}$$

where, ${R_t}$ is the response graph, we use the square of Euclidean distance as the similarity evaluation index for each position on ${F_s}({{x_s}} )$ and ${F_t}$, which can be simplified as: $dis{t^2} = ({2 - 2\cos ({{f_1},{f_2}} )} ),\; {f_1} \in {F_s},\; {f_2} \in {F_t}$. In the experiment, to avoid excessive distance range, we normalise it to $[{0,1} ]$ by $D = {e^{ - dis{t^2}}}$. To make the network pay more attention to the regions with large similarity and reduce the influence of the descriptors with large difference, the scale factor $\sigma $ is used to expand the range of ${R_t}$. To avoid the influence of photometric error on matching, we make a weighted error for relative response loss $\mathrm{{\cal L}}_{rr}^{\prime}$ as follows:

(4)$${\mathrm{{\cal L}}_{rr}^{\prime} = {\gamma _i}{\mathrm{{\cal L}}_{rr}}}$$

(5)$${{\gamma _i} = {e^{ - 2{{({D_{dt}^i - \overline {{D_{dt}}} } )}^2}}}}$$

(6)$${D_{dt}^i = \frac{{\|x_{det}^i - x{{_t^i}\|_2}}}{{\mathop {\max }\limits_k ({\|x_{det}^k - x{{_t^k}\|_2}} )}}}$$

where, $\overline {{D_{dt}}} $ is the average of $D_{dt}^i$. ${\gamma _i}$ is the weighted error for relative response loss, the relative response loss is weighted by the distance between the key points detected and the actual position. For the k group sparse matching pairs $x_s^i \leftrightarrow x_t^i,({i = 1,2, \ldots k} )$, we take the maximum of ${R_t}$ as the key point ${x_{det}}$ detected by the network. The network optimizes most of the correct matching point pairs, to effectively avoid the influence of the wrong matching point pairs and accelerate the network convergence. Then, we use the edge aware smoothness function ${\mathrm{{\cal L}}_s}$ to constrain the smoothness of the depth map:

(7)$${{\mathrm{{\cal L}}_s} = |{{\partial_x}d_t^\ast } |{e^{ - |{{\partial_x}{I_t}} |}} + |{{\partial_y}d_t^\ast } |{e^{ - |{{\partial_y}{I_t}} |}}}$$

where, ${\partial _x}$ and ${\partial _y}$ are the derivatives of x and y directions respectively. $d_t^\ast{=} {d_t}/\overline {{d_t}} $ is the average normalized inverse depth used to avoid deep shrinkage [28]. To ensure consistency of the depth estimation result of each frame image, we transform the target depth map to generate the synthesized depth map $D_s^t$ of the source image. Then, the depth map ${D_s}$ of the original source image is sampled to obtain the sampled source depth map $D_s^{\prime}$ by using the coordinate correspondence ${p_s} \leftrightarrow {p_t}$. Finally, we calculate the consistency between the $D_s^t$ and $D_s^{\prime}$. The feature consistency loss function ${\mathrm{{\cal L}}_{fc}}$ is defined as follows:

(8)$${{\mathrm{{\cal L}}_{fc}} = \frac{1}{{\sum M}}\mathop \sum \limits_{p \in M} \mathrm{\omega }(p )\cdot {\mathrm{{\cal H}}_{fc}}(p )}$$

(9)$${\omega = 1 - {\mathrm{{\cal M}}_d}}$$

(10)$${{\mathrm{{\cal L}}_c} = \frac{1}{{\sum M}}\mathop \sum \limits_{p \in M} \frac{{|{D_s^t(p )- D_s^{\prime}(p )} |}}{{D_s^t(p )+ D_s^{\prime}(p )}} = \frac{1}{{\sum M}}\mathop \sum \limits_{p \in M} {\mathrm{{\cal M}}_d}(p )}$$

where, ${\mathrm{{\cal M}}_d}$ is depth inconsistency diagram, $\mathrm{\omega }$ is the weight mask with a range of [46,1]. ${\mathrm{{\cal L}}_c}$ is the consistency loss. Additionally, to provide a more stable and effective monitoring signal for depth and pose estimation, we use the epipolar consistency loss and the reprojection loss to supervise the pose and depth estimation of the network. As show in Fig. 3. For the feature matching point set $S = \{{p \leftrightarrow p^{\prime}} \}$ between the source and target images, the epipolar consistency loss is established by using the epipolar consistency method combined with the relative pose ${T_{t \to s}} = [{R_{t \to s}}|{t_{t \to s}}] \in SE(3 )$ estimated from the network.

(11)$${{\mathrm{{\cal L}}_{epi}} = \mathop \sum \limits_{\forall p \leftrightarrow p^{\prime} \in S} dist({p^{\prime},Fp} )}$$

where, $dist({ \cdot , \cdot } )$ represents the distance from the point to the epipolar line. $Fp$ is the epipolar line. Then, the reprojection loss is obtained by calculating the error between S and the reprojection coordinates.

(12)$$\begin{aligned}{\mathrm{{\cal L}}_{rep}} &= \mathop \sum \limits_{\forall p \leftrightarrow p^{\prime} \in S} {\|p_s} - {p^{\prime}\|_2}\\ &= {\mathop \sum \limits_{\forall p \leftrightarrow p^{\prime} \in S} {\|K_s}[{{R_{t \to s}}\textrm{|}{t_{t \to s}}} ]{D_t}(p )K_t^{ - 1}p - {{p^{\prime}}\|_2}}\end{aligned}$$

Fig. 3. The schematic of epipolar consistency loss.

Download Full Size | PDF

We design the network structure of the depth and motion estimator based on multi-loss rebalancing. In this way, the stability and consistency of depth prediction in propagation between video sequences can be fully guaranteed. The multi-loss rebalancing function L is defined as follows:

(13)$${L = {k_1}\mathrm{{\cal L}}_{rr}^{\prime} + {k_2}{\mathrm{{\cal L}}_{fc}} + {k_3}{\mathrm{{\cal L}}_s} + {k_4}{\mathrm{{\cal L}}_c} + {k_5}{\mathrm{{\cal L}}_{epi}} + {k_6}{\mathrm{{\cal L}}_{rep}}}$$

where, we use weight parameters of different losses in this paper: ${k_1} = 1,{k_2} = 1,{k_3} = 0.1,{k_4} = 2,{k_5} = 0.001,{k_6} = 0.001$.

3. Experiments and results

3.1 Datasets and experiment

Our training data are generated from two unlabeled endoscopy videos. One is from the AFR dataset [47]. The other is from the EndoSLAM dataset [13]. In AFR, contains 7 different scenario datasets of surface, image resolution is 1280*1024. The ground truth of depth estimation for abdominal surface was obtained by structured light camera, and the ground truth of endoscopy motion was recorded with robotic arm. In EndoSLAM, contains 14 different scenario datasets of internal surface, image resolution is 320*320 and 640*480. The ground truth of depth estimation for cavity inside surface was obtained by 3D scanners and CT reconstruction, and the ground truth of endoscopy motion was recorded with robotic arm. As show in Table 1. we use four video data for training set and one video for verification set. To simulate different pose speeds of the camera, we sample the original video at intervals of 3 and 20 frames to obtain sub-video sequences. We perform gamma transform $({\mathrm{\gamma } = 0.5} )$ on each frame of the image to reduce the influence of the original image light and shade difference. Meanwhile, for training and testing efficiency, we sample the original monocular endoscopy image from $1024 \times 1280$ to $256 \times 320$ resolution. Moreover, we perform sparse feature matching for adjacent frames, and randomly select $k = 20$ pairs for sparse matching $x_s^i \leftrightarrow x_t^i,({i = 1,\; 2,\; \ldots k} )$ to provide supervisory signals for feature learning network. Furthermore, to improve the robustness of the model, we extend the color contrast and saturation of the training data.

Table 1. Dataset details of endoscopy image

View Table | View all tables in this article

All experiments are conducted on a workstation with NVIDIA GeForce RTX 3090 GPU, with 24 GB memory. Pytorch framework is used to build the learning network. The initial learning rate for feature learning is $1{e^{ - 4}}$, and the initial learning rate for depth and pose estimation is $5{e^{ - 4}}$ and $2{e^{ - 4}}$, respectively. We reduced the learning rate update method from 0.2 per 40 epochs. Meanwhile, Adam optimizer with parameter ${\beta _1} = 0.9,{\beta _2} = 0.999$ is selected in this paper.

Furthermore, for the learning of dense feature descriptors, we use two datasets in the experiment of dense feature descriptors to verify the effect of the learned descriptors in scenes with relatively similar textures. On the AFR dataset, we train 20 epochs for feature descriptors network, and the learning rate decreases from $1{e^{ - 4}}$ per 10 Epochs to 0.2. Other data processing and parameter settings are set the same as the joint training. On EndoSLAM dataset, we down-sampled the image to $256 \times 320$ resolution and other data processing and parameter settings are the same as in the AFR dataset.

3.2 Performance evaluation

The performance of the proposed depth estimation and the camera trajectory reconstruction method was evaluated using multiple indicators. The absolute relative error (AbsRel) was an indicator adopted to evaluate the accuracy of depth estimation. Other evaluation indicators included the relative error square (SqRel), the root mean square error (RMSE), the root mean square error log (logRMSE), depth and accuracy. For the accuracy of camera trajectory reconstruction, the absolute translational error (ATE) and relative posture error (RPE) to evaluate the accuracy of the results, using the following formula:

(14)$${AbsRel = \frac{1}{n}\mathop \sum \limits_{i = 1}^n \frac{{|{d_i^{pred} - d_i^{gt}} |}}{{d_i^{gt}}}}$$

(15)$${SqRel = \frac{1}{n}\mathop \sum \limits_{i = 1}^n \frac{{{{|{d_i^{pred} - d_i^{gt}} |}^2}}}{{d_i^{gt}}}}$$

(16)$${RMSE = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^n {{|{d_i^{pred} - d_i^{gt}} |}^2}} }$$

(17)$${logRMSE = \sqrt {\frac{1}{n}\mathop \sum \limits_{i = 1}^n {{|{logd_i^{pred} - logd_i^{gt}} |}^2}} }$$

(18)$${Accuracy = \frac{{N(max\left\{ {\frac{{d_i^{pred}}}{{d_i^{gt}}},\frac{{d_i^{gt}}}{{d_i^{pred}}}} \right\} < \delta )}}{n}}$$

(19)$${ATE = \frac{1}{{l - 1}}\mathop \sum \limits_{j = 1}^{l - 1} \sqrt {{{({\widehat {{x_j}} - {x_j}} )}^2} + {{({\widehat {{y_j}} - {y_j}} )}^2} + {{({\widehat {{z_j}} - {z_j}} )}^2}} }$$

where n is the number of pixels, $d_i^{gt}$ and $d_i^{pred}$ the true depth and predicted depth of the $i$th pixel respectively; $N({\cdot} )$ is the number of pixels that $max\left\{ {\frac{{d_i^{pred}}}{{d_i^{gt}}},\frac{{d_i^{gt}}}{{d_i^{pred}}}} \right\} < \delta $ (the threshold are $\mathrm{\delta } = 1.25,{1.25^2}$); S is the transformation matrix from the predicted pose P to the real pose Q calculated by the least square method; m is the frames of an endoscopy video; the subscript j represents an image frame; $\varDelta $ is the time interval; and $trac({\cdot} )$ is the translation part of the RPE.

The performance of the depth and motion estimation was evaluated by comparing the proposed method to two mainstream algorithms: Mono-depth [26], and SC-SFM [9]. The Mono-depth method proposed two loss functions to robustly handle the camera motion assumptions. The SC-SFM method based on nearby views from the target used depth and pose information to computed loss function. Our method uses feature descriptors to estimate image depth and motion. Meanwhile, the checkerboard artifact is avoided using the convolution layer and up-sampling layer in the encoder and decoder networks.

3.3 Feature descriptor and matching

The construction of high-dimensional feature description is easy to produce uneven overlap, and the multiple convolution operations will produce serious checkerboard effect in the edge region of descriptor graph, which will have a serious impact on feature descriptor extraction. Therefore, we adopt bilinear up-sampling method to achieve the effect of mapping from high-dimensional feature descriptor map to RGB image. We compared the extraction results of the proposed algorithm with those of two state-of-the-art feature description algorithms in three scenes of the AFR dataset, as shown in Fig. 4. The first raw is the original images of AFR datasets. The second raw is the result of DenseNet algorithm [36], which using the SGD optimizer and cyclic learning rate, where the learning rate varies in the range [1e-4,1e-3], and the experimental results obtain a 128-dimensional feature descriptor map. The third raw is the result of SAND algorithm [47], which using sparse relative labels to represent the similarity between image positions. The fourth raw are the result of the algorithm in this paper. It can be seen from the results that the network output results of the comparison method will produce serious checkerboard artifacts. The method in this paper generates a 32-dimensional feature descriptor map to avoid checkerboard effect and its matching effect exceeds the former.

Fig. 4. Visualization of dense feature descriptor map in RGB color space.

Download Full Size | PDF

Compared with the feature matching of different feature descriptor extraction methods, we extract feature descriptors from video sequences sampled at intervals of 20 frames from three data of AFR and perform intensive feature matching. To be specific, the matching of different pixel thresholds is firstly obtained by the circular uniform standard, then, camera pose of each frame provided by the data set is used to calculate the epipolar geometric error for all the matching. Intensive feature matching is performed in three data sets, as shown in Fig. 5. The first column represents the matching precision curve at different pixel thresholds by cyclic consistency standard, and the second column represents the matching recall curve at different pixel thresholds by cyclic consistency standard [36]. In the figure, the blue and yellow solid lines are the results of our method. The green and red dotted lines are the comparative experimental results, where, the (SIFT-based) means that SIFT algorithm is used to detect 1000 feature points in the source image. Then, these feature points are matched with circular consistent criteria. The (SIFT-free) means that 1000 feature points are randomly selected in the source image, and the circular consistent criteria are used to match features. The experimental results show that compared with accuracy curves, the proposed method can maintain a high matching accuracy. Compared with recall rate curves, both methods have obvious inflection points when the pixel threshold is 2 and continue to rise when the pixel threshold exceeds 3. Compared with the comparison methods, our method achieves higher recall rate.

Fig. 5. Quantitative comparison of feature matching between different feature descriptor extraction methods.

Download Full Size | PDF

To conduct qualitative analysis of the matching results, As shown in Fig. 6. The green connectors represent positive group matches, whilst the red connectors represent false matches. The first row, SIFT + NN indicates that the scale-invariant feature transform (SIFT) algorithm is used to extract feature points and carry out nearest neighbor matching. The second row and the third row respectively compare DenseNet method and our method for feature matching based on SIFT. At the same time, we conduct quantitative experiments of feature matching results for 5 groups of images in two kinds of datasets. In the matching of AFR dataset, the epipolar geometric error threshold is 1e-3. In the matching of EndoSLAM dataset, the epipolar geometric error threshold is 5e-4. As shown in Table 2. It can be seen from the experimental results that the SIFT algorithm has smaller amounts of matches and higher amounts of false matches in images with high similarity and texture-less. Both the comparison method and the proposed method can obtain relatively dense feature matching results in the two datasets. Our method can extract more feature points and obtain the optimal accuracy and Recall in the case of texture-less image. Compared with the comparison method, DenseNet method is more prone to false matching in dark and blurred motion regions. Our method can distinguish similar textures well and has higher stability.

Fig. 6. Qualitative comparison of feature matching results.

Download Full Size | PDF

Table 2. Quantitative comparison of feature matching results.

View Table | View all tables in this article

3.4 Evaluation on depth estimation

The robustness of image depth estimation will be greatly affected under different spatial depth differences and different illumination conditions. We verify the image depth estimation results of two common methods and the proposed method in three scenes. We synthesize the image luminosity information with dense feature descriptors, and estimate the image depth by geometric constraints of feature matching, as shown in Fig. 7. The first column is the original image, the second column is the ground truth of the image, the third column is the depth estimation result of Mono-depth [26], the fourth column is the depth estimation result of SC-SFM [9], the fifth column is the method result of appearance flow to the rescue (AFR) [47], and the sixth column is the method result of this study. The experimental results show that the Mono-depth method is insensitive to high illumination areas and produces a large error in areas with wide ranging depth variations. The SC-SFM method can generate more accurate depth information, but the distinction of boundary is vague. The AFR method can produce correct depth estimates, but is insensitive to depth changes over a small region. The proposed method can predict the depth of the region with large depth and we predict sharper edges at object boundaries.

Fig. 7. Qualitative comparison of depth estimates.

Download Full Size | PDF

We showed more evaluation results on more scenes in this paper as shown in Fig. 8 In each scene, the upper and lower rows are the original image and depth estimation results of proposed method respectively. The experimental results show that the proposed method can accurately estimate depth changes in continuous frame images, and the depth changes are continuous and smooth, and the depth estimation result remains unchanged in the images of the same region in different directions. The estimation results are not affected by image scale and rotation. The proposed method is robust to scenarios with different ranges of variation.

Meanwhile, we evaluate the error and accuracy of depth prediction of three methods on three test videos. In the experimental data set, the ground truth of depth is the point cloud data of the same size as the original image, we extract the z coordinate in it to obtain the ground truth of depth map for each pixel. We use the absolute relative error ($AbsRel$), relative error square ($SqRel$), root mean square error ($RMSE$), root mean square error log ($logRMSE$), depth and accuracy to assess the accuracy of the estimation results. In the experiment, the threshold values were $\mathrm{\delta } = 1.25,{1.25^2}$. Since there were no valid values in some region of the ground truth depth map, we only carried out depth evaluation in areas with valid depth values. In addition, due to the inherent scale ambiguity of monocular depth estimation, we scaled the predicted depth of each frame in proportion to the ground truth depth: $scale = median({{D_{gt}}} )/median({{D_{pred}}} )$. ${D_{gt}}$ is the ground truth depth, ${D_{pred}}$ is the predicted depth, $median(. )$ is calculated for the median value. Moreover, we conduct an ablation study of weight parameters of different losses

The Only-${\mathrm{{\cal L}}_{fc}}$ is the result of only using feature consistency loss function, and the Only-${\mathrm{{\cal L}}_{epi}}$ is the result only using epipolar consistency loss function. From the Table 3 results, the error result of AFR method is the best in scene 1, and this method can get better depth estimation in images with smaller depth variation range. Our approach outperforms other methods by a large margin on the other indices. General results can be obtained in the results using ${\mathrm{{\cal L}}_{fc}}$ or ${\mathrm{{\cal L}}_{epi}}$ loss function alone, and the optimal results can be obtained by using the multi-loss rebalancing function. Also, the performance of our model already surpasses most of previous methods.

Table 3. Depth estimates performance evaluation

View Table | View all tables in this article

3.5 Evaluation of motion estimation

Motion Estimation performance is evaluated by comparing the proposed method with three state of the art algorithms, In the training data, to avoid the accumulation of continuous frames at the same position, we verified the pose estimation method with the sampled at intervals of 3 frames. In the experiment, we will estimate the relative posture of the cameras in the three scenes. To obtain the motion track of the camera, we take the first frame of the video sequence as the starting frame, convert the relative position of adjacent frames into the absolute position of each frame in turn, and then align the predicted position trajectory with the real trajectory origin to form the motion trajectory result of the camera. As shown in Fig. 9. The blue line is the ground truth of endoscopy trajectory, the yellow line shows the motion predicted by the comparison method, the first row is the result of mono-depth method, the second row is the result of SC-SfM method, the third row is the result of AFR method, and the fourth row is the result of our method. The experimental results show that in scene 1, the endoscopy motion is flat and gentle. The Mono-depth method loses part of the frames, and the trajectory estimation was drift. In scenario 2, a large gap exists between the Mono-depth, SC-SFM and AFR methods with the ground truth trajectory, and the results of this method still well agree with the ground truth. In scene 3, where the endoscopy motion is complex, Mono-depth and SC-SFM methods produce large position drift phenomenon, and the proposed method can better restore the original motion trajectory of the endoscopy.

Fig. 8. Qualitative comparison of depth estimates.

Download Full Size | PDF

In addition, we also made a quantitative evaluation of the endoscopy motion trajectory of the three scenes in Fig. 8. Absolute trajectory error (ATE) and relative posture error (RPE) are used to evaluate the predicted global pose trajectory and true trajectory. Moreover, we conduct an ablation study of weight parameters of different losses. The Only-${\mathrm{{\cal L}}_{fc}}$ is the result of only using feature consistency loss function, and the Only-${\mathrm{{\cal L}}_{epi}}$ is the result only using epipolar consistency loss function. Table 4 shows that the proposed method has obvious advantages in ATE and RPE. Compared with Mono-depth, SC-SFM and AFR methods, In ATE and RPE, there is little difference between our results and those of AFR algorithm. In RPE (degree), the three scenes decreased by 42.1%,53.6%, and 50.2%, respectively. The comprehensive analysis showed that the multi-loss rebalancing function proposed in this paper can greatly improve the accuracy of motion estimation, and our algorithm has good estimation of motion trajectory ability in three scenes.

Table 4. Comparison of ATE and RPE.

View Table | View all tables in this article

3.6 3D reconstruction results

The 3D displays of three scenes endoscopy image obtained by different methods and the corresponding ground truth are presented in Fig. 10. In the experiment, we only calculate the error distance in the pixel region where the ground truth has valid values. It can be seen from the results. The 3D display predicted by Mono-depth method is show a large pixel offset at the edge, the SC-SfM method have scale drift phenomenon in the dark part of the light, the AFR method produces large errors in a region of great depth variation, and our method can show the 3D effect that is closest to the gold standard. Also, we calculate the distance from ground truth for different methods to measure the mean error. As show in the Table 5. Where, the average error of the Mono-depth, SC-SfM and AFR method is 9.003 ± 2.199mm, 7.978 ± 2.239mm and 8.312 ± 2.581mm, respectively. The average error of our method is 6.456 ± 1.798mm. Experimental results show that the AFR method achieves better results in scene 1 where the object surface structure is simple and the endoscopy motion is smooth. Our method is not different from those of the AFR method in scene 1, and the results of our experiment are reliable and robust in other scenes. The 3D reconstruction intuitively shows that our method was superior to the competing methods and can construct high-quality 3D tissue structures.

Fig. 9. Motion trajectory of endoscopy in different scenes.

Download Full Size | PDF

Fig. 10. 3D reconstruction results of endoscopy image.

Download Full Size | PDF

Table 5. 3D reconstruction error of endoscopy image.

View Table | View all tables in this article

4. Conclusion and discussion

This work presents a feature descriptors learning-based image depth and motion estimation with monocular endoscopy. The depth estimation network and motion estimation network are used to predict the depth of an endoscopy image and reconstruction the motion of an endoscopy camera, respectively. To avoid checkerboard artifacts in the feature descriptor learning proposed in this work, each convolution operation of bilinear interpolation is used. Epipolar constraints are then used in the depth estimation network to improve the accuracy of depth estimation. The multi-information joint-learning framework is used in the motion estimation network to estimate a larger camera translation. Experiments with publish datasets show that the proposed method can improve the accuracy of depth estimation of monocular endoscopy image and reconstruct more accurate endoscopy motion at the same time.

This study strengths the existing research in the MIS. Our method directly obtains accurate depth information and camera trajectories from the monocular endoscopy video without external tracking equipment, and the error is controlled within millimeters. By integrating the depth and motion estimation into the endoscopy surgery navigation system, the 3D reconstruction of the surgical scene performed by the endoscopy is achievable and able to more accurately feedback intuitive 3D visual assist surgeons during operations.

Funding

National Natural Science Foundation of China (61901031, 62025104, 62171039); Beijing Nova Program (Z201100006820004); National Key R&D Program of Zhejiang Province (2019C03009); Beijing Institute of Technology Research Fund Program for Young Scholars (2020CX04075).

Disclosures

The authors declare that there are no conflicts of interest related to this article.

Data availability

Data availability. Data underlying the results presented in this paper are available in Ref. [13] and [47].

References

1. C. Li, X. Gu, X. Xiao, C. M. Lim, and H. Ren, “A robotic system with multichannel flexible parallel manipulators for single port access surgery,” IEEE Trans. Indust. Inform. 15, 1678–1687 (2019). [CrossRef]

2. N. Mahmoud, T. Collins, A. Hostettler, L. Soler, C. Doignon, and J. M. M. Montiel, “Live tracking and dense reconstruction for handheld monocular endoscopy,” IEEE Trans. Med. Imaging 38(1), 79–89 (2019). [CrossRef]

3. Y. Chu, X. Li, X. Yang, D. Ai, Y. Huang, H. Song, Y. Jiang, Y. Wang, X. Chen, and J. Yang, “Perception enhancement using importance-driven hybrid rendering for augmented reality based endoscopic surgical navigation,” Biomed. Opt. Express 9(11), 5205–5226 (2018). [CrossRef]

4. J. Kim, H. Al Faruque, S. Kim, E. Kim, and J. Y. Hwang, “Multimodal endoscopic system based on multispectral and photometric stereo imaging and analysis,” Biomed. Opt. Express 10(5), 2289–2302 (2019). [CrossRef]

5. R. Mur-Artal and J. D. Tardós, “Orb-slam2: an open-source slam system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot. 33(5), 1255–1262 (2017). [CrossRef]

6. K. L. Lurie, R. Angst, D. V. Zlatev, J. C. Liao, and A. K. E. Bowden, “3D reconstruction of cystoscopy videos for comprehensive bladder records,” Biomed. Opt. Express 8(4), 2106–2123 (2017). [CrossRef]

7. W. Zhou, E. Zhou, G. Liu, L. Lin, and A. Lumsdaine, “Unsupervised monocular depth estimation from light field image,” IEEE Trans. on Image Process. 29, 1606–1617 (2020). [CrossRef]

8. S. Lee, S. Shim, H.-G. Ha, H. Lee, and J. Hong, “Simultaneous optimization of patient–image registration and hand–eye calibration for accurate augmented reality in surgery,” IEEE Trans. Biomed. Eng. 67(9), 2669–2682 (2020). [CrossRef]

9. J.-W. Bian, Z. Li, N. Wang, H. Zhan, C. Shen, M.-M. Cheng, and I. Reid, “Unsupervised scale-consistent depth and ego-motion learning from monocular video,” in Thirty-third Conference on Neural Information Processing Systems, (2019).

10. M. Turan, Y. Almalioglu, H. Araujo, E. Konukoglu, and M. Sitti, “Deep endovo: A recurrent convolutional neural network (rcnn) based visual odometry approach for endoscopic capsule robots,” Neurocomputing 275, 1861–1870 (2018). [CrossRef]

11. Y. Zheng and M. S. Asif, “Joint image and depth estimation with mask-based lensless cameras,” IEEE Trans. Comput. Imaging 6, 1167–1178 (2020). [CrossRef]

12. L. Chen, W. Tang, N. W. John, T. R. Wan, and J. J. Zhang, “SLAM-based dense surface reconstruction in monocular minimally invasive surgery and its application to augmented reality,” Comput Methods Programs Biomed 158, 135–146 (2018). [CrossRef]

13. K. B. Ozyoruk, G. I. Gokceler, T. L. Bobrow, G. Coskun, K. Incetan, Y. Almalioglu, F. Mahmood, E. Curto, L. Perdigoto, and M. Oliveira, “EndoSLAM dataset and an unsupervised monocular visual odometry and depth estimation approach for endoscopic videos,” Med. Image Anal. 71, 102058 (2021). [CrossRef]

14. L. Maier-Hein, A. Groch, A. Bartoli, S. Bodenstedt, G. Boissonnat, P.-L. Chang, N. Clancy, D. S. Elson, S. Haase, and E. Heim, “Comparative validation of single-shot optical techniques for laparoscopic 3-D surface reconstruction,” IEEE Trans. Med. Imaging 33(10), 1913–1930 (2014). [CrossRef]

15. J. Lin, N. T. Clancy, J. Qi, Y. Hu, T. Tatla, D. Stoyanov, L. Maier-Hein, and D. S. Elson, “Dual-modality endoscopic probe for tissue surface shape reconstruction and hyperspectral imaging enabled by deep neural networks,” Med. Image Anal. 48, 162–176 (2018). [CrossRef]

16. H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid, “Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018), 340–349.

17. H. Yao, R. W. Stidham, Z. Gao, J. Gryak, and K. Najarian, “Motion-based camera localization system in colonoscopy videos,” Med. Image Anal. 73, 102180 (2021). [CrossRef]

18. L. Xu, J. Li, Y. Hao, P. Zhang, G. Ciuti, P. Dario, and Q. Huang, “Depth estimation for local colon structure in monocular capsule endoscopy based on brightness and camera motion,” Robotica 39(2), 334–345 (2021). [CrossRef]

19. A. R. Widya, Y. Monno, M. Okutomi, S. Suzuki, T. Gotoda, and K. Miki, “Whole stomach 3D reconstruction and frame localization from monocular endoscope video,” IEEE J. Transl. Eng. Health Med. 7, 1–10 (2019). [CrossRef]

20. A. R. Widya, Y. Monno, K. Imahori, M. Okutomi, S. Suzuki, T. Gotoda, and K. Miki, “3D reconstruction of whole stomach from endoscope video using structure-from-motion,” in 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) (IEEE, 2019), 3900–3904.

21. F. Mahmood, R. Chen, and N. Durr, “Unsupervised reverse domain adaptation for synthetic medical images via adversarial training,” IEEE Trans. Med. Imaging 37(12), 2572–2581 (2018). [CrossRef]

22. S. Leonard, A. Sinha, A. Reiter, M. Ishii, G. L. Gallia, R. H. Taylor, and G. D. Hager, “Evaluation and stability analysis of video-based navigation system for functional endoscopic sinus surgery on in vivo clinical data,” IEEE Trans. Med. Imaging 37(10), 2185–2195 (2018). [CrossRef]

23. S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards end-to-end visual odometry with deep recurrent convolutional neural networks,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), (IEEE, 2017), 2043–2050.

24. M. Turan, E. P. Ornek, N. Ibrahimli, C. Giracoglu, Y. Almalioglu, M. F. Yanik, and M. Sitti, “Unsupervised odometry and depth learning for endoscopic capsule robots,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE, 2018), 1801–1807.

25. M. Shen, Y. Gu, N. Liu, and G.-Z. Yang, “Context-aware depth and pose estimation for bronchoscopic navigation,” IEEE Robot. Autom. Lett. 4(2), 732–739 (2019). [CrossRef]

26. C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into self-supervised monocular depth estimation,” in (Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019), 3828–3838.

27. R. Ma, R. Wang, S. Pizer, J. Rosenman, and J. M. Frahm, Real-Time 3D Reconstruction of Colonoscopic Surfaces for Determining Missing Regions (Medical Image Computing and Computer Assisted Intervention – MICCAI 2019, 2019).

28. C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” in (Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018), 2022–2030.

29. L. Li, X. Li, S. Yang, S. Ding, A. Jolfaei, and X. Zheng, “Unsupervised-learning-based continuous depth and motion estimation with monocular endoscopy for virtual reality minimally invasive surgery,” IEEE Trans. Ind. Inf. 17(6), 3920–3928 (2021). [CrossRef]

30. G. A. Puerto-Souza, J. A. Cadeddu, and G. L. Mariottini, “Toward long-term and accurate augmented-reality for monocular endoscopic videos,” IEEE Trans. Biomed. Eng. 61(10), 2609–2620 (2014). [CrossRef]

31. F. Mahmood and N. J. Durr, “Deep learning and conditional random fields-based depth estimation and topographical reconstruction from conventional endoscopy,” Med. Image Anal. 48, 230–243 (2018). [CrossRef]

32. X. Liu, A. Sinha, M. Ishii, G. D. Hager, A. Reiter, R. H. Taylor, and M. Unberath, “Dense depth estimation in monocular endoscopy with self-supervised learning methods,” IEEE Trans. Med. Imaging 39(5), 1438–1447 (2020). [CrossRef]

33. D. Neumann, S. Grbic, M. John, N. Nav Ab, J. Hornegger, and R. Ionasec, “Probabilistic sparse matching for robust 3D/3D fusion in minimally invasive surgery,” IEEE Trans. Med. Imaging 34(1), 49–60 (2015). [CrossRef]

34. C. Sui, J. Wu, Z. Wang, G. Ma, and Y.-H. Liu, “A real-time 3D laparoscopic imaging system: design, method, and validation,” IEEE Trans. Biomed. Eng. 67(9), 2683–2695 (2020). [CrossRef]

35. J. Ma, J. Zhao, J. Tian, A. L. Yuille, and Z. Tu, “Robust point matching via vector field consensus,” IEEE Trans. on Image Process. 23(4), 1706–1721 (2014). [CrossRef]

36. X. Liu, Y. Zheng, B. Killeen, M. Ishii, G. D. Hager, R. H. Taylor, and M. Unberath, “Extremely dense point correspondences using a learned feature descriptor,” in (Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020), 4847–4856.

37. M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” in (Proceedings of the IEEE/CVF Conference On Computer Vision And Pattern Recognition, 2019), 8092–8101.

38. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in (Proceedings of the IEEE Conference On Computer Vision And Pattern Recognition, 2016), 770–778.

39. H. Luo, C. Wang, X. Duan, H. Liu, P. Wang, Q. Hu, and F. Jia, “Unsupervised learning of depth estimation from imperfect rectified stereo laparoscopic images,” Computers in biology medicine 140, 105109 (2022). [CrossRef]

40. K. İncetan, I. O. Celik, A. Obeid, G. I. Gokceler, K. B. Ozyoruk, Y. Almalioglu, R. J. Chen, F. Mahmood, H. Gilbert, and N. J. Durr, “VR-Caps: a virtual environment for capsule endoscopy,” Med. Image Anal. 70, 101990 (2021). [CrossRef]

41. I. N. Figueiredo, C. Leal, L. Pinto, P. N. Figueiredo, and R. Tsai, “Hybrid multiscale affine and elastic image registration approach towards wireless capsule endoscope localization,” Biomedical Signal Processing Control 39, 486–502 (2018). [CrossRef]

42. G. Dimas, E. Spyrou, D. K. Iakovidis, and A. Koulaouzidis, “Intelligent visual localization of wireless capsule endoscopes enhanced by color information,” Computers in biology medicine 89, 429–440 (2017). [CrossRef]

43. G. Bao, K. Pahlavan, and L. Mi, “Hybrid localization of microrobotic endoscopic capsule inside small intestine by data fusion of vision and RF sensors,” IEEE Sens. J. 15(5), 2669–2678 (2015). [CrossRef]

44. F. Mahmood and N. J. Durr, “Deep learning-based depth estimation from a synthetic endoscopy image training set,” in Medical Imaging 2018: Image Processing, (International Society for Optics and Photonics, 2018), 1057421.

45. A. Banach, F. King, F. Masaki, H. Tsukada, and N. Hata, “Visually navigated bronchoscopy using three cycle-consistent generative adversarial network for depth estimation,” Med. Image Anal. 73, 102164 (2021). [CrossRef]

46. X. Ban, H. Wang, T. Chen, Y. Wang, and Y. Xiao, “Monocular visual odometry based on depth and optical flow using deep learning,” IEEE Trans. Instrum. Meas. 70, 1–19 (2021). [CrossRef]

47. S. Shao, Z. Pei, W. Chen, W. Zhu, X. Wu, D. Sun, and B. Zhang, “Self-Supervised monocular depth and ego-Motion estimation in endoscopy: appearance flow to the rescue,” Medical image analysis, 102338 (2021).

Dataset	Category	Ground truth	Describe
AFR [47]	Abdominal	A structured light camera records image depth. A robotic arm records motion.	Contains 7 different scenario datasets of surface, image resolution is 1280*1024.
EndoSLAM [13]	Stomach, colon, small intestine	3D scanners and CT reconstruction models records image depth. A robotic arm records motion.	Contains 14 different scenario datasets of internal surface, image resolution is 320320 and 640480.

	AFR			EndoSLAM
		Scene1	Scene2	Scene3	Scene1	Scene2
SIFT + NN	Matches	486	488	433	353	349
	Inliers	397	416	316	166	162
	Precision	81.7%	85.2%	73.0%	47.0%	46.4%
	Recall	39.7%	41.6%	31.6%	16.6	16.2%
DenseNet (SIFT)	Matches	823	806	609	815	695
	Inliers	818	789	561	803	683
	Precision	99.4%	97.9%	92.1%	98.5%	98.3%
	Recall	81.8%	78.9%	56.1%	80.3%	68.3%
Ours (SIFT)	Matches	967	954	704	936	722
	Inliers	967	953	699	936	722
	Precision	100.0%	99.9%	99.3%	100.0%	100.0%
	Recall	96.7%	95.3%	69.9	93.6%	72.2%

Methods	Scene	Error				Accuracy
Methods	Scene	AbsRel	SqRel	RMSE	logRMSE	$δ = 1.25$	$δ = {1.25}^{2}$
Mono-depth	1	0.071	0.742	7.308	0.112	0.921	0.998
	2	0.106	1.769	13.165	0.137	0.856	1
	3	0.125	1.686	11.64	0.152	0.844	0.995
SC-SfM	1	0.037	0.222	4.401	0.054	0.996	1
	2	0.104	1.622	12.281	0.128	0.875	1
	3	0.123	1.586	11.143	0.148	0.875	0.995
AFR	1	0.032	0.143	3.502	0.043	0.998	1
	2	0.112	2.111	14.365	0.149	0.823	0.997
	3	0.122	1.641	11.57	0.149	0.856	0.998
Only- $L_{f c}$	1	0.030	0.161	3.779	0.045	0.997	1
	2	0.102	1.539	11.747	0.123	0.894	1
	3	0.139	1.939	12.066	0.168	0.799	0.991
Only- $L_{e p i}$	1	0.032	0.173	3.873	0.048	0.997	1
	2	0.093	1.352	10.878	0.115	0.916	1
	3	0.123	1.491	10.169	0.149	0.859	0.993
Ours	1	0.036	0.2	4.061	0.052	0.998	1
	2	0.08	1.074	9.91	0.103	0.94	1
	3	0.099	1.052	8.334	0.125	0.915	0.995

Methods	ATE(mm)			RPE(mm)			RPE(degree)
Methods	Scene1	Scene2	Scene3	Scene1	Scene2	Scene3	Scene1	Scene2	Scene3
Mono-depth	2.023	6.523	9.299	0.687	0.767	0.796	0.714	0.723	0.704
SC-SfM	1.123	3.521	7.064	0.239	0.299	0.447	0.278	0.269	0.307
AFR	0.398	2.024	2.562	0.167	0.301	0.261	0.435	0.577	0.608
Only- $L_{f c}$	0.830	4.12	7.103	0.237	0.349	0.466	0.305	0.302	0.318
Only- $L_{e p i}$	0.948	3.966	4.986	0.304	0.493	0.438	0.292	0.290	0.311
Ours	0.529	2.307	1.566	0.217	0.295	0.354	0.252	0.268	0.303

	Mono-depth	SC-SfM	AFR	Ours
Scene1	6.044 ± 0.333	3.798 ± 0.236	3.235 ± 0.243	3.645 ± 0.276
Scene2	11.312 ± 4.287	10.795 ± 4.537	12.039 ± 4.901	8.769 ± 3.808
Scene3	9.654 ± 1.977	9.340 ± 1.946	9.662 ± 2.600	6.953 ± 1.310

Joint estimation of depth and motion from a monocular endoscopy image sequence using a multi-loss rebalancing network

Abstract

1. Introduction

2. Method

2.1 Network architecture

2.2 Dense descriptor extraction

2.3 Depth and motion estimation

2.4 Loss function and implementation

3. Experiments and results

3.1 Datasets and experiment

3.2 Performance evaluation

3.3 Feature descriptor and matching

3.4 Evaluation on depth estimation

3.5 Evaluation of motion estimation

3.6 3D reconstruction results

4. Conclusion and discussion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Tables (5)

Equations (19)

Biomedical Optics Express