Reconstruction method suitable for fast CT imaging

Xueqin Sun; Xueqin Sun; Yu Li; Yu Li; Yihong Li; Sukai Wang; Sukai Wang; Yingwei Qin; Yingwei Qin; Ping Chen; Ping Chen

doi:10.1364/OE.522097

1. Introduction

Computed tomography (CT) imaging is commonly formulated as an inverse problem of reconstruct images of an unknown object from projected observations. These projections are obtained at different views around the object. To reconstruct high-fidelity, high-quality images, dense sampling in the measurement space is required [1] to satisfy the Shannon-Nyquist theorem [2]. However, practical challenges arise in CT due to the time consumption. In several applications, acquiring numerous projections through the object may be undesirable or impossible. This constraint may be the case in industrial application scenarios where only extremely limited views are possible. There is a compelling need to reconstruct CT images from highly sparse sampled data to expedite the imaging process.

For hundreds of X-ray projections, standard reconstruction algorithms such as filtered back projection [3] or iterative reconstruction [4,5] can accurately reconstruct a CT volume. However, for ultra-sparse sampling, the inversion problem of CT reconstruction becomes severely ill-posed nature due to the violation of the completeness of measurement data. The ultra-sparse X-ray projections may not offer sufficient information to precisely reconstruct images of the unknown object, posing significant challenges for fast CT imaging.

It seems inherently unsolvable if we attempt to look for universal solutions with traditional CT reconstruction algorithms. A widely adopted approach is to introduce compressed sensing (CS) theory [6] into sparse-view CT reconstruction [7–11], optimizing a combination of a data-fidelity term and a regularization term. The data-fidelity term constrains the estimation error between the reconstructed CT images and the measurements, such as the penalized weighted-least-squares (PWLS) [12] and maximum-likelihood (ML) [13] penalty. The regularizer penalty captures prior information of the object, such as the total variation (TV) [14], low rank [15]. Compared to CT reconstruction with complete data, sparse-view reconstruction reduces the dependence on the quantity of input projections to a certain extent. In fact, to ensure that imaging quality remains uncompromised, such sparse sampling is often subject to constraints. Therefore, this motivates us to develop a more efficient and robust reconstruction method that can work with ultra-sparse data.

The task involves 2D-3D reconstruction, which is more challenging than 3D-3D reconstruction, as it necessitates the recovery of 3D information from extremely sparse 2D data. Data-driven deep learning methods [16] can represent features of 3D structures in a standardized manner and infer corresponding 3D volumes from 2D X-ray projections. Recently, end-to-end supervised deep neural networks based on single-view projection have been employed to generate structurally consistent 3D information in cases of irreversible imaging with incomplete information. Henzler et al [17] utilized encoder-decoder framework to reconstruct 3D volumes for the skulls of different mammal species from a single X-ray. Wang et al [18] proposed the DeepOrganNet, based on a trivariate tensor-product deformation technique, to generate fully high-fidelity 3D/4D organ manifold meshes from single-view medical images. Shen et al [19] showed that a deep-learning PatRecon model based on encoder-decoder and transformation module trained to map projection radiographs of a patient to the corresponding 3D anatomy can subsequently generate 3D CT volumes of the patient from a single view. Song et al [20] proposed a GAN (Generative Adversarial Network) framework, named Oral-3D, to reconstruct the 3D oral cavity from a single X-ray panoramic view and prior information of the dental arch. Lei et al [21] proposed a novel GAN model integrated with perceptual supervision to derive instantaneous volumetric images from a single 2D projection. Tan et al [22] presented a deep-learning framework that introduces attention and multi-scale fusion, namely XctNet, to gain prior knowledge from a single 2D pixel and produce volumetric data. Shao et al [23] developed a deformation-driven and deep learning-based framework to reconstruct 3D mesh for tracking liver motion and localizing liver tumors in 3D and real-time, combining the complementary information from optical surface imaging and a single on-board x-ray projection. These deep learning methods focus on learning the mapping from X-ray to CT from a large training set, enabling fast reconstruction of single-view CT under ultra-sparse sampling.

For structurally complex objects with dense internal distributions, a single projection is insufficient to separate all features along the projection direction for subsequent 3D volume reconstruction, leading to issues of projection information occlusion. To alleviate this problem, the 3D reconstruction can be carried out using projections from two orthogonal views. This approach provides complementary information to further enhance the reconstruction of a higher quality CT volume. Ying et al [24] proposed to reconstruct CT anatomy from two orthogonal synthesized X-rays using the GAN framework, namely X2CT-GAN, that combined with a novel feature fusion method. Ratul et al [25] designed CCXrayNet based on GAN architecture for bi-planar X-rays reconstruction, which applies prior semantic constraints in 3D CT reconstruction. Bayat et al [26] identified a fully convolutional network architecture working vertebra-wise, termed TransVert, which fuses orthogonal 2D radiographs and corresponding annotation images and infers the spine’s 3D posture, being supervised by the 3D vertebral masks. Ge et al [27] proposed X-CTRSNet for 2D X-ray images and this is a powerful work that simultaneously and accurately enables 3D C-vertebra CT reconstruction and segmentation directly from orthogonally anteroposterior- and lateral-view 2D X-ray images. Chen et al [28] effectively achieved 3D spine reconstruction based on bi-planar X-ray images using BX2S-Net, which is a dimensionally consistent encoder–decoder architecture in conjunction with a dimensionality enhancement method. Gao et al [29] designed a deep GAN model of converting 2D spinal x-ray images into 3D spinal CT images, named 3DSRNet and adopted transformer [30] to enhance the learning ability of generator network.

Inspired by the aforementioned work, we have developed a data-driven strategy-based 3D reconstruction network for fast CT imaging under ultra-sparse sampling, termed X-CTReNet. The proposed model is based on a GAN architecture. The generator of X-CTReNet is composed of three elements, including Preliminary Feature Mapping Network (PFMNet), Encoding-Decoding Network (EDNet), and Sparse Attention Fusion Network (SAFNet). X-ray projections from two orthogonal views are fed into the PFMNet and EDNet working in parallel. During the decoding stage, the SAFNet is crafted to focuses more on the most contributive valuable features, thus avoiding the intervention of irrelevant information in the fusion process of two-view features, in order to reconstruct a high-quality CT volume. X-CTReNet combines the PFMNet for coarse mapping of 2D X-rays to 3D CT, EDNet for refined mapping of 2D X-rays to 3D CT, and SAFNet for 3D reconstruction. Through end-to-end learning and optimization, a deep neural network is constructed for cross-dimensional feature mapping from 2D X-ray to 3D CT, simplifying the X-ray-based volume reconstruction by conducting volume inference on a collection of 2D X-rays.

To summarize, we make the following contributions:

(1) A novel deep learning-based X-CTReNet is designed to perform 3D reconstruction using 2D orthogonal X-ray projections. The model demonstrates superior performance compared to baseline methods, showcasing significant potential for application in fast CT imaging scenarios.
(2) To address the semantic gap in feature fusion, an Encoding-Decoding Fusion Block (EDFB) is introduced. EDFB is employed to avoid unreasonable predictions caused by the direct fusion of low-level and high-level features, aiming to facilitate a more nuanced deep feature fusion effect.
(3) During the decoding phase of feature reconstruction, a customized sparse attention fusion block (SAFB) based on top-k mechanism is designed. The incorporation of the top-k mechanism ensures that the network prioritizes and attends to the most meaningful features, contributing to an improvement in the overall reconstruction quality.

We give a detailed introduction to the 2D-3D reconstruction framework with the different components that are used during training and inference in Section 2. In Section 3, experiments are performed to validate the effectiveness of the proposed method. The discussion and conclusions are provided in section 4.

2. Method

Although the proposed reconstruction task is an ill-posed problem, leveraging prior knowledge learned from a vast amount of data through deep learning allows us to establish a mapping relationship between 2D X-ray projections and 3D volumes. The proposed method is targeted towards a cross-dimensional reconstruction task for fast CT imaging between two-view X-ray projections and CT images. Aiming at the problem of CT reconstruction, a model that combines GAN framework, namely X-CTReNet, is proposed. The generator and discriminator engage in a competitive process during the training phase, leading to a situation where the discriminator struggles to differentiate between real data samples and those generated by the generator. This adversarial training strategy is advantageous for achieving higher-quality reconstructed CT images.

2.1 Generator

The generator of the X-CTReNet model adopts an encoder-decoder network architecture, specifically designed to perform non-trivial inference from 2D inputs to generate 3D output. As shown in Fig. 1, the generator consists of three modules: PFMNet, EDNet, SAFNet. Firstly, two X-ray projections from orthogonal views concurrently enter the PFMNet, serving for the preliminary exploration of information in the third dimension. This facilitates a shallow cross-dimensional mapping of the 2D to 3D feature space. Subsequently, two EDNets, sharing identical architectures, are responsible for two-view feature extraction and reconstruction, facilitating the learning of a depth cross-dimensional mapping. Finally, SAFNet is utilized for the fusion of two-view information to reconstruct a more refined CT volume. The intuition behind transmitting information about the third dimension from 2D to 3D involves a coarse-refine process, i.e. to explore information in the third dimension by increasing the number of channels, then reduce the spatial dimension of the input to high-level coarse features using an encoder, and recover the spatial dimension by refining the coarse features through a decoder to further reconstruct 3D information.

Fig. 1. The generator workflow structure of X-CTReNet. The section enclosed by the light blue box represents the Preliminary Feature Mapping Network (PFMNet), the portion enclosed by the pale yellow box represents the Encoding-Decoding Network (EDNet), and the area enclosed by the light pink box represents the Sparse Attention Fusion Network (SAFNet).

Download Full Size | PDF

2.1.1 PFMNet

Typically, neural networks are designed to map from 2D to 2D or from 3D to 3D. However, in our case, the requirement is unique as it involves mapping from 2D to 3D. In the task of reconstructing a 3D CT, the crucial aspect lies in the conversion from 2D features to 3D features. Most approaches employ a transformation module, reshaping 2D feature maps into 3D feature maps, and utilizing 3D convolutions in subsequent decoding stages for feature reconstruction, as demonstrated in works such as [19,24]. However, this practice inevitably leads to a significant increase in parameters and GPU memory consumption, severely constraining resolution and speed in the reconstruction process.

In our work, we address this challenge by progressively increasing the number of feature map channels, embedding 3D information into different channels for the preliminary mapping from 2D projection domain features to 3D spatial domain features. This allows for a reduction in the dimensionality of convolutional operations without compromising image quality, leading to decreased computational efforts.

As shown in Fig. 1, PFMNet takes two-view projections as input and concurrently passes through three cascaded dense blocks to explore depth information. Throughout this process, the channel count of the feature maps is sequentially set to 32→64→128, with 128 channels being retained. The introduction of dense blocks aims to integrate feature information across different channels, enhancing the model's parameter efficiency and compactness to prevent the potential loss of information between different feature maps.

2.1.2 EDNet

The parallel EDNet operates with a same architecture, with the encoding network comprising four downsampling blocks and the decoding network comprising four upsampling blocks. Residual learning is introduced within the upsampling and downsampling blocks to expedite the training of the deep network and mitigate gradient vanishing issues. Throughout the encoding-decoding stages, the preservation of the feature map channel count remains constant, facilitating the forwarding of depth features. As shown in Fig. 1, the downsampling block halves the feature map size to make it a lower-dimensional representation, with the aim to extract as many low- and high-level features as possible. The upsampling block recovers the feature map size and fuses the features extracted during the encoding process to complete the feature reconstruction while minimizing information loss.

To enhance the integration of global and local features, skip connections are often employed to bridge the encoder and decoder. However, a challenge arises via skip connections due to the semantic gap between low-level and high-level features, as low-level feature maps may contain irrelevant and ambiguous details. Motivated by [31], we introduce the Encoding-Decoding Fusion Block (EDFB). EDFB utilizes the attention mechanism to promote the fusion of high-level semantic features and low-level fine-grained features while discarding irrelevant information, thereby achieving a more sophisticated deep feature fusion.

As illustrated in Fig. 2, low-level features are used to calculate the spatial attention, while high-level features are used to calculate the channel attention. Channel attention, being more task-relevant and containing richer semantic information [31], is crucial for guiding spatial attention. Therefore, we leverage channel attention based on high-level semantic information to aid in generating spatial attention. It helps select pertinent spatial information to some extent and further guides the fusion of high- and low-level feature maps.

Fig. 2. The low-level features correspond to the spatial attention part and the high-level features correspond to the channel attention part.

Download Full Size | PDF

2.1.3 SAFNet

After undergoing coarse cross-dimensional mapping, the feature maps of orthogonal view are fed into the parallel EDNet. To flexibly integrate feature information from two-view for optimized feature extraction, inspired by the application of top-k selection [32,33] in the vision Transformer domain, we propose the SAFNet based on top-k sparse attention. On one hand, SAFB avoids irrational predictions caused by direct fusion. On the other hand, sparse attention directs the network to focus more on the fusion of useful 3D features from the two-view, resulting in a more effective reconstruction of CT volumes.

The SAFB based on top-k is illustrated in Fig. 3. Specifically, features ${X_1}$ and ${X_2}$ from two-view, where ${X_1},{X_2} \in {{\mathbb R}^{H \times W \times C}}$, are concatenated along the channel dimension to form $Y \in {{\mathbb R}^{H \times W \times 2C}}$. Initially, we first encode channel-wise context by applying 1 × 1 convolutions followed by 3 × 3 depth-wise convolutions, resulting in ${I_1}$, ${I_2}$, ${I_3}$ (${I_1},{I_2},{I_3} \in {{\mathbb R}^{H \times W \times C}}$). Subsequently, the transposed-attention map $A \in {{\mathbb R}^{C \times C}}$ is obtained through reshape and matrix multiplication operations. Assuming higher scores indicate higher relevance, the model evaluates the values of the score matrix A.Sparse attention masking operation $M({\cdot} )$ is applied on A, selecting the top-k contributing elements. Specifically, only the top-k maximum elements in each row of A are chosen, while other elements with scores below the top-k are set to 0 after softmax computation. Here, $M({\cdot} )$ represents a learnable top-k selection operator:

(1)$$M(A,k)_{ij} = \left\{ \begin{matrix}A_{ij},{\rm }i f A_{ij} \ge t_j\left( {t_j:k\textrm{-th largest value of row j}} \right) \\ -\infty ,{\rm }i f A_{ij} \lt t_j\left( {t_j:k\textrm{-th largest value of rowj}} \right)\end{matrix}\right.,$$

Fig. 3. Sparse Attention Fusion Block (SAFB).

Download Full Size | PDF

Ultimately, the output fused feature map is as follows:

(2)$$\textrm{SparseAtt} ({{I_1},{I_2},{I_3}} )= \textrm{softmax} ({M(A,k)} )\textrm{Reshape} ({{I_3}} ).$$

This dynamic selection shifts attention from dense to sparse, achieving adaptive selection of top-k contributing scores on feature A. The objective is to ensure the preservation of the most attention-worthy portions, and avoid the interference of irrelevant information during the feature fusion process.

2.2 Discriminator

PatchGAN [34] has been used frequently in recent works [24,27] due to the good generalization property. We also adopt the PatchGAN discriminator to distinguish real or fake 3D volumes. The discriminator consists of three modules, i.e. the 3D convolution, the instance normalization layer, and the activation function. They are used three times with kernel size 3, followed by a final 3D convolution layer with kernel size 1.

2.3 Loss function

The objective of the proposed network is to ensure semantic consistency between the predicted CT volume and the X-ray projections. To efficiently and stably train the 3D CT reconstruction model, we utilize four loss functions to constrain X-CTReNet: adversarial loss, reconstruction loss, projection loss, and structure similarity index measure (SSIM) loss.

2.3.1 Adversarial loss

We adapt the loss function of Least squares GAN (LSGAN) [35] as the loss function for the X-CTReNet. Typically, the classical GAN use sigmoid crossentropy as objective function. Gradient dispersion inevitably becomes a potential problem. LSGAN replace the original loss function with least squares loss function, offering advantages such as improved stability, better convergence properties, and higher quality image synthesis. The description of LSGAN loss is as follows:

(3)$${\mathrm{{\cal L}}_{LSGAN}}(D )= \frac{1}{2}[{{\mathrm{\mathbb{E}}_{y \sim p({CT } )}}{{({D({y|x} )- 1} )}^2} + { + {\mathrm{\mathbb{E}}_{x \sim p({\textrm{X} - ray} )}}{{({D({G(x )|x} )- 0} )}^2}} ]}$$

(4)$${\mathrm{{\cal L}}_{LSGAN}}(G )= \frac{1}{2}[{ {{\mathrm{\mathbb{E}}_{x \sim p({\textrm{X} - ray} )}}{{({D({G(x )|x} )- 1} )}^2}} ]} ,$$

where, $x = \{{{x_1},{x_2}} \}$ denotes the input from two orthogonal views, and y is the reconstructed CT volume from FDK algorithm. $G(x )$ is the reconstructed 3D CT via the Generator. Discriminator D and generator G are alternately trained to compete with each other.

2.3.2 Reconstruction loss

Adversarial loss aims to align the distribution of the predicted CT volume with that of the ground truth image. However, relying solely on it is insufficient to ensure the generated data closely resembles real-world conditions. Therefore, an additional constraint is necessary to enforce voxel-level proximity between the reconstructed CT and the ground truth. Mean Squared Error (MSE) is employed as the reconstruction loss function to constrain the structural consistency of voxels. The reconstruction loss is defined as follows:

(5)$${\mathrm{{\cal L}}_{re}} = \mathrm{\mathbb{E}}||{G(x )- y} ||_2^2,$$

2.3.3 Projection loss

Inspired by [24], we employ X-ray projections to constrain the predicted volume. The projection loss is defined by comparing the 2D projections of the three orthogonal planes of the CT volume predicted by the generator with the projections from the corresponding ground truth images, aiming to enhance the model's accuracy. The projection loss is defined as follows:

(6)$${\mathrm{{\cal L}}_{pl}} = \frac{1}{3}[{\mathrm{\mathbb{E}}{{||{{P_1}({G(x )} )- {P_1}(y )} ||}_1}} + \mathrm{\mathbb{E}}{||{{P_2}({G(x )} )- {P_2}(y )} ||_1} + {\mathrm{\mathbb{E}}{{||{{P_3}({G(x )} )- {P_3}(y )} ||}_1}} ],$$

where, the ${P_1}$, ${P_2}$, and ${P_3}$ represent the X-ray projections from three orthogonal directions, respectively.

2.3.4 SSIM loss

Both projection loss and reconstruction loss are biased toward pixel-level operation. Therefore, SSIM loss is introduced to constrain the similarity between the reconstructed CT and the ground truth. The description of this loss is as follows:

(7)$${\mathrm{{\cal L}}_{SSIM}} = 1 - SSIM({G(x )} ).$$

2.3.5 Total loss

The total objective, incorporating the four aforementioned losses, is defined as follows:

(8)$$D = \arg \min {\lambda _1}{\mathrm{{\cal L}}_{LSGAN}}(D ),$$

(9)$$G = \arg \min [{{\lambda_1}{\mathrm{{\cal L}}_{LSGAN}}(G )+ {\lambda_2}{\mathrm{{\cal L}}_{re}} + {\lambda_3}\mathrm{{\cal L}}_{pl}^{} + {\lambda_4}\mathrm{{\cal L}}_{SSIM}^{}} ],$$

where ${\lambda _1}$, ${\lambda _2}$, ${\lambda _3}$ and ${\lambda _4}$ mean the weight factors controlling the different loss functions. In this work, the prioritization should lean towards the consistency of voxels. Therefore, we will appropriately increase the weight of reconstruction, projection, and SSIM loss. The final weight is set as follows: ${\lambda _1} = 0.1$, ${\lambda _2} = 1$, ${\lambda _3} = {\lambda _4} = 0.5$.

3. Experimental setup and results

3.1 Dataset

A large amount of paired CT and X-ray projections is required to support the training and tuning of the X-CTReNet. We acquired the projection data of six kinds of assembly fuzes. For each assembly case, full-view projections were acquired using the microfocus CT system YXLON FF 20 (Fig. 4), containing 1800 projections. The acquisition parameters are shown in Table 1. The corresponding CT ground truth images were reconstructed by the FDK algorithm [3] from the full-view projections. We regard two projections from orthogonal views and a corresponding 3D CT image as a data triplet. In total, $6 \times 1800 = 10800$ data triplets were captured. Figure 5 presents the partially acquired X-ray projections and the CT ground truth images. Among the 6 groups of data, one group was used as the validation set and one group was used as the test set.

Fig. 4. The microfocus CT system YXLON FF 20.

Download Full Size | PDF

Fig. 5. The dataset of fuses with different assembly cases. The 1st and 2nd rows are X-ray projections from two orthogonal views. The 3rd row represents the ground truth images reconstructed using the FDK algorithm (The dynamic display of rotating 3D reconstructed images is shown in Visualization 1, Visualization 2, Visualization 3, Visualization 4, Visualization 5, to Visualization 6).

Download Full Size | PDF

Table 1. Data acquisition parameters of the FF20 microfocus CT system

View Table | View all tables in this article

3.2 Experimental implementation details

The projections and CT volumes were resized to 128 × 128 pixels and 128 × 128 × 128 pixels, respectively. The network is trained on a device with a NVIDIA Tesla P100 card, and the training platform is Pytorch. The training process consisted of 200 epochs, with validation performed after each epoch. At the training phase, the Adam optimizer [36] is used to find the optimal solution for the loss functions in Eq. (8) and Eq. (9), momentum parameters $\beta 1 = 0.5$ and $\beta 2 = 0.99$. The batch-size is 4. The initial learning rate was set to 0.0001. After training 30 epochs, we adopt a linear learning rate decay policy to decrease the learning rate. The parametric structures of all layers are shown in Table 2. In the Basic2d and ResBlock modules, all the kernel size of convolution is 3. In the DenseBlock modules, the kernel size of the final convolution is 1. In the DownSampling modules, the Pooling layer adopts the 2 × 2 max Pooling. In the UpSampling modules, the deconvolution layer is performed with 3 × 3 kernel size and 2 × 2 stride.

Table 2. Parametric structures of all layers in the network

View Table | View all tables in this article

3.3 Comparison models

To validate the performance of X-CTReNet, comparative experiments were conducted with well-established models, including Patrecon [19], X2CT-GAN [24], and 3DSRNet [29]. PatRecon, based on encoder-decoder architecture, is trained to map projection radiographs of a patient to the corresponding 3D anatomy. It performs two-view reconstruction by stacking the 2D projections from two views into one input data with two channels, and modifies the first convolution layer to fit the input channel number. X2CT-GAN is an emerging method based on GAN, specifically designed to reconstruct high-quality images from ultra-sparse view data. Leveraging the advantages of GAN framework, it enables the reconstruction of CT scans from two orthogonal biplanar X-rays. X2CT-GAN is the first method to explore biplanar X-ray CT reconstruction using deep learning. 3DSRNet enables the conversion of 2D spinal x-ray images into 3D spinal CT images, and adopts transformer to enhance the learning ability of generator network. Combining transformer and attention mechanisms, 3DSRNet can enhance reconstruction results while preserving image details. PatRecon, X2CT-GAN and 3DSRNet models employ 3D deconvolution and convolution for CT reconstruction in the decoding stage. The compared methods utilize the cross-dimensional feature transformations via the channel concatenation, and infer the 3D CT of a specific resolution, i.e., 128 × 128 × 128, via cascaded 3D deconvolutions.

3.4 Comparison of experimental results

The evaluation results of the qualitative analysis are displayed in Fig. 6. We selected representative slices of the reconstructed CT volumes in three dimensions for visualization. Compared with PatRecon and X2CT-GAN networks, the reconstructed results of the proposed X-CTReNet show less different from the ground truth images and have better visualization effects. Table 3 enumerates the performance of the aforementioned methods across the fuzes dataset. We conduct a comprehensive comparison based on three metrics, which include root mean square error (RMSE), peak signal-to-noise ratio (PSNR), and SSIM. Our method achieves the highest metrics. The RMSE results show that our method outperforms PatRecon by 0.005, X2CT-GAN by 0.0031, and 3DSRNet by 0.0009. For PSNR metric, our method outperforms PatRecon by 3.1192, X2CT-GAN by 2.0703, and 3DSRNet by 0.6459. Meanwhile, our method outperforms PatRecon by 0.0261, X2CT-GAN by 0.0138, and 3DSRNet by 0.0083 in terms of SSIM metric. In any scenario, due to the ultra-sparse sampling of X-ray projections, our reconstructed CT cannot substitute for the ground truth CT. However, in fast CT imaging, the CT volume reconstructed from two-view projections based on the proposed X-CTReNet outperforms X-ray projections, addressing the issue of overlapping present in observing only 2D projections and better reflecting the spatial distribution of the reconstructed object.

Fig. 6. CT reconstruction volume slices in three directions. The 3^rd, 5^th, 7^th, and 9^th columns are the difference images between the reconstructed slices and the ground truth slices.

Download Full Size | PDF

Table 3. Quantitative comparison results of PatRecon, X2CT-GAN, 3DSRNet, and X-CTReNet (Mean ± Standard Deviation)

View Table | View all tables in this article

3.5 High-resolution reconstruction experiment

Due to the memory complexity of 3D convolutional decoders and current hardware computational constraints, the PatRecon model is restricted to low-resolution volumes, with a pixel resolution of 128 × 128 × 128. The proposed X-CTReNet circumvents the costly memory of cascaded 3D convolutions in existing deep learning-based volume reconstruction methods by encoding the channel dimension as the third dimension of the volume image. We constructed a dataset with X-ray projection dimensions of 256 × 256 and CT ground truth image dimensions of 256 × 256 × 256 to train X-CTReNet. The slices of CT reconstruction results at different resolutions and differential images with ground truth are shown in Fig. 7. Four metrics were computed: RMSE was 0.0102 ± 0.0011, SSIM was 0.9967 ± 0.0013, and PSNR was 39.7913 ± 0.0164. The slices of CT reconstructed volumes with resolutions of 128 and 256, along with magnified views of localized regions, are depicted in Fig. 8. It is evident that high-resolution CT images exhibit richer details, better reflecting the geometric features of the original scene. Thus, it provides more information for subsequent image processing and analysis tasks, facilitating observation and evaluation.

Fig. 7. CT reconstruction volume slices in three directions (The dynamic display of rotating 3D reconstructed image is shown in Visualization 7). The 3rd row is the difference images between the reconstructed slices and the ground truth slices. The resolution of the images is 256 × 256 pixels.

Download Full Size | PDF

Fig. 8. Reconstructed slices of CT volumes with resolutions of 128 and 256, along with magnified views of localized regions.

Download Full Size | PDF

3.6 Robustness to noise

In practical applications, photon noise during the imaging process can significantly impact the quality of reconstructed images, highlighting the importance of robustness to photon noise in applications. We evaluated the robustness of the proposed X-CTReNet method to different levels of noise on the fuzes dataset with resolution of 256. At this stage, the deep model was not retrained but directly tested on noisy projection data to better assess its robustness to noise. We introduced various levels of poisson noise into the projections to evaluate the performance of our proposed 3D reconstruction method under different noise environments, with photon intensity ${I_0}$ set as $5 \times {10^6},1 \times {10^6},5 \times {10^5},1 \times {10^5},5 \times {10^4}$.The standard deviation $\sigma _e^2$ in gaussian noise is fixed at 10.

(10)$${N_i} = Poisson\{{{I_0}{e^{ - {x_i}}}} \}+ Gaussion({0,\sigma_e^2} ),$$

where, ${N_i}$ is the energy value of the ith X-ray reaching the detector, ${I_0}$ is the intensity of the incident X-ray, and ${x_i}$ is the line integral of the attenuation coefficient of the ith ray. The Fig. 9 illustrate the 3D reconstruction results slices and absolute difference images under different noise levels. X-CTReNet retains image details and structures while maintaining high clarity and accuracy in reconstructed images. However, as the noise level increases, this performance gradually deteriorates. Experimental results demonstrate that X-CTReNet can handle a wide range of noise intensities, exhibiting good robustness.

Fig. 9. Reconstructed slices of CT volumes with resolutions of 256 under different noise levels. The 2^nd row presents the absolute difference images.

Download Full Size | PDF

3.7 Ablation experiments

To assess the effectiveness of the proposed components, the following experiments were conducted: (a) removal of the EDFB and SAFB modules; (b) removal of the SAFB module, employing direct concatenation and 1 × 1 convolution on two-view features in the decoding stage; (c) exclusion of the top-k sparse mechanism from the SAFB module, utilizing only the front part of the attention mechanism, thereby renamed as the Attention Fusion Network (AFNet), as shown in Fig. 10.

Fig. 10. Attention Fusion Network (AFNet).

Download Full Size | PDF

As depicted in Table 4, the introduction of each component enhances the performance of network, with a greater improvement observed when both components are combined. A comparison between the “EDFB + SAFB” combination and the “EDFB + AFNet” combination demonstrates an enhancement in reconstruction performance. This is primarily attributed to the guidance provided by the top-k sparse attention, further achieving the best the reconstruction performance.

Table 4. Quantitative comparison results of ablation study with X-CTReNet (Mean ± Standard Deviation)

View Table | View all tables in this article

3.8 Model complexity

Reconstruction efficiency also poses a practical challenge in fast CT applications. Therefore, the complexity of the models for two-view projections reconstruction was evaluated, which is related to the reconstruction efficiency. More details on the parameters and floating point operations (FLOPs) are provided in Table 5. All models are performed on an ubuntu 16.04 system with a Nvidia P100 GPU, Intel Xeon CPU E5-2620 v4 @ 2.10 GHz and 32G RAM. All experiments were conducted on the same hardware to ensure consistency and eliminate variations introduced by hardware differences. The proposed X-CTReNet explores the information in the third dimension of CT volumes by increasing the number of feature channels, instead of employing memory-intensive 3D convolutions, and thus tends to avoid creating an overly bloated model. During the testing phase, as the discriminator is not involved in the inference, we individually calculated the parameters and FLOPs for the generator of proposed model and X2CT-GAN. The number of parameters for X-CTReNet's generator is only 1.61 × 10⁷, whereas the number of parameters for X2CT-GAN's generator is 6.17 × 10⁷. The FLOPs for the generators of X-CTReNet and X2CT-GAN are 207.64 × 10⁹ and 311.16 × 10⁹, respectively. X-CTReNet achieves comparable reconstruction performance while maintaining minimal model complexity.

Table 5. Parameters and FLOPs comparison results of the models

View Table | View all tables in this article

3.9 Walnut reconstruction results

We trained the proposed X-CTReNet using publicly available 3D walnut CT datasets [37] to evaluate its robustness in various scenarios. Carefully selected, the dataset comprises 10 groups, each containing X-ray projections with one kind scanning parameters and corresponding CT ground truth images. In total, $10 \times 1200 = 12000$ data instances were included, with a split ratio of 8:1:1 for training, testing, and validation sets, respectively. All data were adjusted to a pixel resolution of 256, and pixel values were normalized to the range [0,1].

As illustrated in Fig. 11, the reconstruction volumes of walnut CT scans demonstrate satisfactory outcomes, affirming the effectiveness of the proposed method. The fidelity exhibited in the reconstructions highlight the robustness of X-CTReNet across diverse imaging scenarios. Subtle details are captured within the reconstructed volume, vividly showcasing X-CTReNet's capability to faithfully reproduce the 3D structural features of walnut samples. The pleasing results not only validate the reliability of our method but also bolster confidence in its potential applications across various real-world settings.

Fig. 11. CT reconstruction volume slices of walnut data (The dynamic display of rotating 3D reconstructed image is shown in Visualization 8). The 3rd row is the difference images between the reconstructed slices and the ground truth slices. The resolution of the images is 256.

Download Full Size | PDF

4. Conclusion and discussion

To address the challenge of fast CT reconstruction when only extremely sparse X-ray projections are available in practical scenarios, we have developed a method that delivers high-quality CT reconstructions from two-view X-ray projections. An end-to-end network model, X-CTReNet was devised, which learns a cross-dimensional inverse mapping from 2D to 3D using X-rays and CT triplets. Specifically, the specially designed PFMNet and EDNet are exploited to increase data dimension from 2D X-rays to 3D CT. A novel SAFNet is proposed to amalgamate valuable information from two-view. Combined projection, reconstruction, SSIM, and adversarial loss are integrated to train the X-CTReNet, resulting in a high-quality CT volume both visually and quantitatively. Moreover, the proposed X-CTReNet infers CT voxel depth from feature channel dimensions, thereby alleviating the memory burden imposed by cascaded 3D convolutions.

We conducted extensive experiments on 3D CT image reconstructions using diverse datasets, affirming the effectiveness and robustness of the proposed X-CTReNet. The advancements in reconstruction accuracy and efficiency make X-CTReNet a promising solution for enhancing the capabilities of CT imaging, particularly in scenarios where fast and high-quality reconstructions are crucial. In summary, the notable characteristics of this framework lay a solid foundation for its potential applications in rapid CT imaging.

In the task of reconstructing 3D CT from 2D X-ray projections, utilizing projections captured from two orthogonal directions, where complementary information can aid the model in achieving more accurate results. In practical applications, a dual-source-dual-detector scanning mode can be employed to simultaneously capture two orthogonal projections. Alternatively, two orthogonal X-ray projections can also be captured by a single-source-single-detector setup through device rotation. In the future, we will focus on extending the proposed method to dynamic imaging applications in CT and introducing the concept of implicit neural learning into the task of ultra-sparse view CT reconstruction.

Funding

National Natural Science Foundation of China (62001429, 62122070, 62201520, 62301508); Natural Science Foundation of Shanxi Province (20210302124190, 20210302124191, 202203021212123, 202203021212455, 202303021211149, 202303021222096); Foundation of State Key Laboratory of Dynamic Measurement Technology, North University of China (2022-SYSJJ-08).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper can be obtained from the authors upon reasonable request.

References

1. L. Shen, J. Pauly, and L. Xing, “NeRP: implicit neural representation learning with prior embedding for sparsely sampled image reconstruction,” IEEE Trans. Neural Networks Learn. Syst. (2022).

2. A. J. Jerri, “The Shannon sampling theorem—Its various extensions and applications: A tutorial review,” Proc. IEEE 65(11), 1565–1596 (1977). [CrossRef]

3. L. A. Feldkamp, L. C. Davis, and J. W. Kress, “Practical cone-beam algorithm,” J. Opt. Soc. Am. A 1(6), 612–619 (1984). [CrossRef]

4. R. Gordon, R. Bender, and G. T. Herman, “Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy and X-ray photography,” J. Theor. Biol. 29(3), 471–481 (1970). [CrossRef]

5. A. H. Andersen and A. C. Kak, “Simultaneous algebraic reconstruction technique (SART): A superior implementation of the ART algorithm,” Ultrason. Imaging 6(1), 81–94 (1984). [CrossRef]

6. D. L. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory 52(4), 1289–1306 (2006). [CrossRef]

7. C. Xu, B. Yang, F. Guo, et al., “Sparse-view CBCT reconstruction via weighted Schatten p-norm minimization,” Opt. Express 28(24), 35469–35482 (2020). [CrossRef]

8. W. Wu, D. Hu, C. Niu, et al., “DRONE: dual-domain residual-based optimization NEtwork for sparse-view CT reconstruction,” IEEE Trans. Med. Imaging 40(11), 3002–3014 (2021). [CrossRef]

9. H. Zhang, B. Liu, H. Yu, et al., “MetaInv-Net: meta inversion network for sparse view CT image reconstruction,” IEEE Trans. Med. Imaging 40(2), 621–634 (2021). [CrossRef]

10. T. Su, Z. Cui, J. Yang, et al., “Generalized deep iterative reconstruction for sparse-view CT imaging,” Phys. Med. Biol. 67(2), 025005 (2022). [CrossRef]

11. Y. Li, X. Q. Sun, S. K. Wang, et al., “MDST: multi-domain sparse-view CT reconstruction based on convolution and swin transformer,” Phys. Med. Biol. 68(9), 095019 (2023). [CrossRef]

12. J. Harms, T. Wang, M. Petrongolo, et al., “Noise suppression for dual-energy CT via penalized weighted least-square optimization with similarity-based regularization,” Med. Phys. 43(5), 2676–2686 (2016). [CrossRef]

13. S. Tilley, M. Jacobson, Q. Cao, et al., “Penalized-likelihood reconstruction with high-fidelity measurement models for high-resolution cone-beam imaging,” IEEE Trans. Med. Imaging 37(4), 988–999 (2018). [CrossRef]

14. E. Y. Sidky and X. Pan, “Image reconstruction in circular cone-beam computed tomography by constrained, total-variation minimization,” Phys. Med. Biol. 53(17), 4777–4807 (2008). [CrossRef]

15. T. Pan, J. Duan, J. Wang, et al., “Iterative self-consistent parallel magnetic resonance imaging reconstruction based on nonlocal low-rank regularization,” Magn. Reson. Imaging 88, 62–75 (2022). [CrossRef]

16. J. Schmidhuber, “Deep learning in neural networks: An overview,” Neural Networks 61, 85–117 (2015). [CrossRef]

17. P. Henzler, V. Rasche, T. Ropinski, et al., “Single-image tomography: 3D volumes from 2D cranial X-Rays,” Comput. Graph. Forum 37(2), 377–388 (2018). [CrossRef]

18. Y. Wang, Z. Zhong, and J. Hua, “DeepOrganNet: on-the-fly reconstruction and visualization of 3D / 4D lung models from single-view projections by deep deformation network,” IEEE Trans. Visual. Comput. Graphics 26(1), 1 (2019). [CrossRef]

19. L. Shen, W. Zhao, and L. Xing, “Patient-specific reconstruction of volumetric computed tomography images from a single projection view via deep learning,” Nat. Biomed. Eng. 3(11), 880–888 (2019). [CrossRef]

20. W. Song, Y. Liang, J. Yang, et al., “Oral-3D: reconstructing the 3D structure of oral cavity from panoramic X-ray,” in 35th AAAI Conference on Artificial Intelligence, AAAI2021 (2021), 1.

21. Y. Lei, Z. Tian, T. Wang, et al., “Deep learning-based real-time volumetric imaging for lung stereotactic body radiation therapy: A proof of concept study,” Phys. Med. Biol. 65(23), 235003 (2020). [CrossRef]

22. Z. Tan, J. Li, H. Tao, et al., “XctNet: Reconstruction network of volumetric images from a single X-ray image,” Comput. Med. Imaging Graph. 98, 102067 (2022). [CrossRef]

23. H. C. Shao, Y. Li, J. Wang, et al., “Real-time liver tumor localization via combined surface imaging and a single x-ray projection,” Phys. Med. Biol. 68(6), 065002 (2023). [CrossRef]

24. X. Ying, H. Guo, K. Ma, et al., “X2CT-GAN: Reconstructing CT from biplanar x-rays with generative adversarial networks,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 10611–10620.

25. M. A. R. Ratul, K. Yuan, and W. Lee, “CCX-rayNet: a class conditioned convolutional neural network for biplanar x-rays to CT volume,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI) (IEEE, 2021), pp. 1655–1659.

26. A. Bayat, A. Sekuboyina, J. C. Paetzold, et al., “Inferring the 3D standing spine posture from 2D radiographs,” in Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2020), 12266 LNCS.

27. R. Ge, Y. He, C. Xia, et al., “X-CTRSNet: 3D cervical vertebra CT reconstruction and segmentation directly from 2D X-ray images,” Knowledge-Based Syst. 236, 107680 (2022). [CrossRef]

28. Z. Chen, L. Guo, R. Zhang, et al., “BX2S-Net: Learning to reconstruct 3D spinal structures from bi-planar X-ray images,” Comput. Biol. Med. 154, 106615 (2023). [CrossRef]

29. Y. Gao, H. Tang, R. Ge, et al., “3DSRNet: 3-D spine reconstruction network using 2-D orthogonal x-ray images based on deep learning,” IEEE Trans. Instrum. Meas. 72, 1–14 (2023). [CrossRef]

30. Z. Liu, Y. Lin, Y. Cao, et al., “Swin transformer: hierarchical vision transformer using shifted windows,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV) (2021), pp. 9992–10002.

31. T. Zhao and X. Wu, “Pyramid feature attention network for saliency detection,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), pp. 3080–3089.

32. G. Zhao, J. Lin, Z. Zhang, et al., “Explicit sparse transformer: concentrated attention through explicit selection,” arXiv, ArXiv abs/1912.1 (2019). [CrossRef]

33. X. Chen, H. Li, M. Li, et al., “Learning a sparse transformer network for effective image deraining,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), pp. 5896–5905.

34. T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, et al., “High-resolution image synthesis and semantic manipulation with conditional GANs,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 8798–8807.

35. X. Mao, Q. Li, H. Xie, et al., “Least squares generative adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV) (2017), pp. 2813–2821.

36. D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings (2014), pp. 1–15.

37. H. Der Sarkissian, F. Lucka, M. van Eijnatten, et al., “A cone-beam X-ray computed tomography data collection designed for machine learning,” Sci. Data 6(1), 215 (2019). [CrossRef]

Name	Description
Visualization 1	The dynamic display of rotating 3D reconstructed image (1st column of the 3rd row of Fig. 1 in the manuscript) is shown in Visualization 1
Visualization 2	The dynamic display of rotating 3D reconstructed image (2nd column of the 3rd row of Fig. 1 in the manuscript) is shown in Visualization 2
Visualization 3	The dynamic display of rotating 3D reconstructed image (3rd column of the 3rd row of Fig. 1 in the manuscript) is shown in Visualization 3
Visualization 4	The dynamic display of rotating 3D reconstructed image (4th column of the 3rd row of Fig. 1 in the manuscript) is shown in Visualization 4
Visualization 5	The dynamic display of rotating 3D reconstructed image (5th column of the 3rd row of Fig. 1 in the manuscript) is shown in Visualization 5
Visualization 6	The dynamic display of rotating 3D reconstructed image (6th column of the 3rd row of Fig. 1 in the manuscript) is shown in Visualization 6
Visualization 7	The dynamic display of rotating 3D reconstructed image (Fig. 7 in the manuscript) is shown in Visualization 7
Visualization 8	The dynamic display of rotating 3D reconstructed image (Fig. 11 in the manuscript) is shown in Visualization 8

Parameter	Value
Tube voltage	170 kV
Tube current	$60 μ A$
Focal spot size	$1 μ m$
Pixel size of detector	0.127 mm
Detector resolution	$1122 \times 1122$
Source-to-detector distance	780.577 mm
Source-to-object distance	300.22 mm

Layers	Parameters	Output Size
PFMNet	DenseBlock	$128 \times 128 \times 32$
	DenseBlock	$128 \times 128 \times 64$
	DenseBlock	$128 \times 128 \times 128$
Encodingr network	Downsampling	$64 \times 64 \times 128$
	Downsampling	$32 \times 32 \times 128$
	Downsampling	$16 \times 16 \times 128$
	Downsampling	$8 \times 8 \times 128$
Decoding Network	Upsampling	$16 \times 16 \times 128$
	EDFB	$16 \times 16 \times 128$
	Upsampling	$32 \times 32 \times 128$
	EDFB	$32 \times 32 \times 128$
	Upsampling	$64 \times 64 \times 128$
	EDFB	$64 \times 64 \times 128$
	Upsampling	$128 \times 128 \times 128$
	EDFB	$128 \times 128 \times 128$
SAFNet	SAFNet	$16 \times 16 \times 128$
	Upsampling +	$32 \times 32 \times 128$
	Upsampling +	$64 \times 64 \times 128$
	Upsampling +	$128 \times 128 \times 128$
	1 × 1 Conv	$128 \times 128 \times 128$

	RMSE	SSIM	PSNR
PatRecon	0.0167 ± 0.0025	0.9663 ± 0.0031	35.5102 ± 0.0408
X2CT-GAN	0.0148 ± 0.0011	0.9786 ± 0.0027	36.5591 ± 0.0126
3DSRNet	0.0126 ± 0.0007	0.9841 ± 0.0015	37.9835 ± 0.0112
X-CTReNet	0.0117 ± 0.0006	0.9924 ± 0.0018	38.6294 ± 0.0093

Combination			Metrics
EDFB	SAFB	AFNet	RMSE	SSIM	PSNR
			0.0149 ± 0.0022	0.9776 ± 0.0019	36.5314 ± 0.0170
√			0.0132 ± 0.0013	0.9816 ± 0.0026	37.5845 ± 0.0141
√		√	0.0126 ± 0.0009	0.9869 ± 0.0014	37.9831 ± 0.0105
√	√		0.0117 ± 0.0006	0.9924 ± 0.0018	38.6294 ± 0.0093

Reconstruction method suitable for fast CT imaging

Abstract

1. Introduction

2. Method

2.1 Generator

2.1.1 PFMNet

2.1.2 EDNet

2.1.3 SAFNet

2.2 Discriminator

2.3 Loss function

2.3.1 Adversarial loss

2.3.2 Reconstruction loss

2.3.3 Projection loss

2.3.4 SSIM loss

2.3.5 Total loss

3. Experimental setup and results

3.1 Dataset

3.2 Experimental implementation details

3.3 Comparison models

3.4 Comparison of experimental results

3.5 High-resolution reconstruction experiment

3.6 Robustness to noise

3.7 Ablation experiments

3.8 Model complexity

3.9 Walnut reconstruction results

4. Conclusion and discussion

Funding

Disclosures

Data availability

References

Supplementary Material (8)

Data availability

Cited By

Figures (11)

Tables (5)

Equations (10)

Optics Express

	Param. ( $\times 10^{7}$ )	FLOPs ( $\times 10^{9}$ )	Training time	Testing time
PatRecon	58.86	1304.12	8 d	3.97 s
X2CT-GAN	7.28	392.88	7 d	0.76 s
3DSRNet	8.31	394.47	7 d	0.81 s
X-CTReNet	2.72	289.36	3 d	0.35 s
X2CT-GAN-Generator	6.17	311.16	—	—
X-CTReNet-Generator	1.61	207.64	—	—