TCGAN: a transformer-enhanced GAN for PET synthetic CT

Jitao Li; Jitao Li; Jitao Li; Zongjin Qu; Zongjin Qu; Zongjin Qu; Yue Yang; Fuchun Zhang; Meng Li; Shunbo Hu

doi:10.1364/BOE.467683

1. Introduction

Multimodal medical synthesis is an important subtask in medical image processing. Image synthesis techniques can help synthesize images with missing modalities for multimodal image fusion analysis. In our previous study [1], we proposed a generative adversarial network (GAN)-based network [2] that can synthesize computed tomography (CT) data from existing positron emission tomography (PET) data. PET includes less structural information than CT as it is a functional imaging, whereas CT is a structural imaging technique. However, in our experiments, the final synthetic CT images have structural information similar to the real CT images.

Completing image synthesis tasks using traditional approaches is challenging. Recently, deep learning advancements have enabled the generation of high-quality medical images. In this sense, GANs have been successful in recent years, particularly in the field of image synthesis. GAN-based adversarial loss can promote the network to capture image details, synthesize images with better clarity, and greatly reduce the blurring effect of images. Transformers [3] have recently made a splash in computer vision, and their long-range properties allow them to synthesize images globally.

In this study, we present a transformer-improved GAN-based synthesis network. The proposed network can outperform either the GAN or transformer by combining the advantages of the transformer and CNN through series connection. To prevent information loss, a residual link between the two networks is added, which is adopted from ResNet [4]. The main contributions of this study in the field of medical image synthesis can be summarized as follows: 1. We collected datasets captured using the advanced PET equipment (Shandong Madic Technology Co., Ltd.) and conducted a study on PET synthetic CT for attenuation correction using our network. 2. Our innovative use of the transformer combined with the CNN can effectively combine the advantages of both. 3. The effect of $L_1$, $L_2$, and $Smooth\ L_1$ losses on image quality was studied, and image gradient loss was used to constrain image details. 4. We verified the generalization of our model on other datasets.

The remainder of the paper is organized as follows. In Section 2, we present the related works on medical image synthesis and how our work advances the field in relation to previous studies. Section 3 introduces the proposed method in detail. The analysis and presentation of the experimental results are presented in Section 4. Section 5 discusses the proposed method and results. Finally, Section 6 concludes the paper.

2. Related works

Deep learning has become one of the most indispensable tools in medical imaging analysis. Currently, deep learning methods, such as CNN- and GAN-based methods, have rapidly become dominant in medical image synthesis [5]. Most deep learning methods rely heavily on datasets, especially large-scale datasets. Hence, various deep-learning-based concepts cannot be realized due to the paucity of medical image datasets. Medical image synthesis tasks can learn the features of existing datasets and synthesize fake data to assist medical imaging analysis.

Recently, cross-domain synthesis of medical images has gained attention in the field of medical imaging. In most situations, cross-domain synthesis of medical images is employed as an intermediary link in deep learning rather than for direct diagnosis. A summary of medical image synthesis research and its clinical citations was provided by Wang et al. [6]. Medical image synthesis can assist in medical image segmentation [7–9], registration [10,11], classification [12,13], disease detection [14–16], and image super-resolution [17]. Image synthesis techniques are often used to perform high-resolution recovery tasks. In general, the higher the resolution of medical images, the better they are for physicians to diagnose patients.

Several studies have summarized GAN-based methods, showing the great potential of GANs in the field of image synthesis [18,19]. Although GAN networks have made great progress, their blurring effect on synthesized images cannot be ignored. In Ea-GANs [20], an attempt was made to integrate edge information in the GAN generator and discriminator of GAN to reflect the texture structure of the image content and delineate the boundaries of different objects in the image. MedGAN [21] cascades the U-Net [22] structure to deepen the network and feature extractor to constrain the style and content information of the image and promote the generation of higher quality images, achieving in three different medical imaging tasks. Nie et al. [23] proposed a loss function based on image gradient differences to alleviate the ambiguity of the generated CT images and further applied an automatic context model to implement context-aware GANs. Shen et al. [24] used contextual information to synthesize massive images of breast X-rays. Their proposed synthesis method includes contextual edge features, which can generate images with richer texture and edge information. In a previous study [25], MRI images were synthesized by preserving mid- and high-frequency details through an adversarial loss function; enhanced synthesis performances were obtained using pixel-level and perceptual loss functions for registered multi-contrast images and cycle-consistency loss functions for unregistered images. Klaser et al. [26] used a magnetic resonance (MR) image synthesize a CT image, then used the CT image to perform CT-based attenuation correction (CTAC) for PET. However, the images obtained in this way still suffer from artifacts; nonetheless, due to the rich structural information of MR, the synthesis task becomes much simpler. Still, in many cases, MR images are difficult to obtain due to equipment and cost constraints. Based on MedGAN, Upadhyay et al. [27] combined recent work and proposed uncertainty-guided progressive GANs for PET to CT image synthesis and other tasks.

In the field of computer vision, CNN have excellent effects and are widely used in various tasks. The convolution method greatly reduces the number of network parameters. Moreover, it has translation invariance and automatically extracts image features. However, the receptive field of CNNs is usually small, which is not suitable in capturing global features. At present, using a transformer-based network is a better alternative. A transformer has the capacity for long-range dependencies that can capture high-level features. It has gained increased attention in the field of computer vision, and its variants [28–30] have achieved remarkable results.

Owing to the superior performance of the transformer, it has also been applied to the field of medical image analysis, including disease classification [31,32] and medical image segmentation [33–36], registration [37], denoising [38], and synthesis [39].

In our previous work [1], we found that the simple CNN$\mbox {-}$based generator can achieve a good results for PET to CT image synthesis. In order to further improve it, and mitigating the adverse effects of hallucination problem. Our study proposes a medical image synthesis method combining the transformer with CNNs, effectively combining their advantages. The images we use are aligned, while the hallucination is a particularly big problem in misaligned images. Aligned images give a possibility to solve the problem [40]. The proposed method includes CNN-based-generator and transformer, which can not only utilize the information of local regions, but also consider from the long-range information. Through the corresponding learning of this spatial structure, the problem of hallucination can be solved to a certain extent. We use the image gradient loss based on the Prewitt operator to constrain the edge information of the synthesized image and promote the generation of better details and counteract the blurring effect caused by the GAN network. Simultaneously, we study $L_1$, $Smooth L_ 1$, and $L_2$ regularization for medical image synthesis.

3. Methods

3.1 Overall network structure

The basic structure of our network is adopted from the idea of GANs. The network consists of three main components: transformer generator, CNN$\mbox {-}$based generator, and discriminator. The two generator networks are connected in series to enhance the synthesis capability. Figure 1 shows the architecture of our network. We created a transformer-based generator, with the output result serving as crude synthesis information for the following synthesis of the generator. The synthesizer of the network is based on ResNet, which adds short linkages between layers, whereas we add short links between the two generator networks. Therefore, we combined the effects of multiple synthesis networks. The discriminator scores each patch and outputs the overall scoring result to determine the difference between the synthesis and target images.

Fig. 1. TCGAN uses a dual generator architecture, transformer generator, and CNN generator linked in series. The generated and real images will perform the following three operations: (1) Input into the discriminator, and score the synthesis quality of the them; (2) calculate the pixel-wise difference between the two; (3) perform feature extraction on them and calculate the image gradient loss (GDL) for both.

Download Full Size | PDF

3.1.1 Transformer generator

The transformer concept came from the attention mechanism [41] and has been continuously applied in the field of natural language processing (NLP). The transformer architecture has no convolution block, but it has time series and global advantages. The vision transformer (VIT) [28] without down-sampling allows for finer details and enables global perceptual fields for better global consistency [42]. Another characteristic of VIT is that it has a strong dependency on big data. For image synthesis, GPU memory limitations prevents us from directly feeding data into the transformer network. To solve this problem, we added CNN blocks to our transformer generator. The network combines the advantages of the transformer and CNN, and it can learn more information from the original and synthesize images with better quality. We confirmed that the transformer generator contributes to the overall network in our ablation experiments.

The transformer encoder module in the transformer generator borrows the idea of VIT and TransGAN [30]. Simultaneously, we up-sampled and down-sampled the input and output, respectively, to reduce the number of parameters of the model and combine the advantages of CNNs. We also added jump connections to the link information to prevented information loss. The linear flattened vector is fed into the transformer backbone through a multi-layer perceptron (MLP). To scale up to higher-quality images, the upsampling module was inserted after each transformer encoder. Each transformer encoder receives a 1D token embedding sequence as input. After layer norm, it is input into the multi-head self-attention layer, and finally output through layer norm and MLP layer. Figure 2 presents the transformer encoder architecture. In reality, we computed the attention function concurrently on a series of queries, which are grouped into matrix Q. In matrices K and V, the keys and values were also grouped together. The output matrix can be calculated as [3]:

(1)$$Attention(Q,K,V) = softmax(\frac{QK_{}^{T}}{\sqrt{d_{k}^{}}})V,$$

where $Q, K, and \ V$ represent the query, key, and value matrices, respectively, $\sqrt {d_k}$ is the number of columns of the $Q, K$ matrix, that is, the vector dimension.

Fig. 2. Transformer generator concept. To reduce the amount of model parameters, the transformer generator uses up-sampling and down-sampling modules. The detailed transformer encoder structure is shown on the right side of the figure.

Download Full Size | PDF

The transformer generator, inspired by CNNs, chooses to iterate in stages to increase the resolution, with the input flowing through three transformer encoders and up scaling between each transformer encoders.

3.1.2 CNN-based generator

In the CNN-based generator, our network adopts a classic U-Net [22] architecture. As the synthesis task is a relatively complex and difficult task, we deepened the depth of the encoder and decoder to improve the synthesis ability of the synthesizer network. Specifically, we repeated the sampling of feature block with 512 layers four times. Except for these repeated modules, the design of other areas is basically the same as U-Net, including its basic architecture and skip connections. Figure 3 shows the detailed structure. In a previous study, we directly used the CNN-based generator for CT synthesis form PET. Our original study referred to the design of pix2pix and modified the classic U-Net network.

Fig. 3. The overview of our CNN generator architecture.

Download Full Size | PDF

3.1.3 Discriminator

The discriminator can evaluate whether the image generated by the generator is close to the real image, and the generator will strive to improve the ability of synthetic images to fool the discriminator. The synthesis and discrimination ability of the generator and the discriminator can be greatly improved by this continuous confrontation. In our investigations, we used the PatchGAN [40] discriminator (patch discriminator) at the patch level and the pixel discriminator at the pixel level. In the end we choose the former, and its detailed structure is shown in Fig. 4.

Fig. 4. The overview of PatchGAN discriminator.

Download Full Size | PDF

3.2 Loss functions

3.2.1 Adversarial loss

The adversarial loss is an important loss in our network. For the purpose of competition, the generator is designed to fool the discriminator, whereas the discriminator is designed to distinguish between real and synthesis fake images. The adversarial loss embodies the process of the generator and the discriminator competing against each other, constraining each other, and improving the capabilities of each other. We define the image of the input modality as $x(x_1, \ldots, x_m)$, the image of the target modality as $y(y_1, \ldots, y_m)$, and the fake target image synthesized by the network as $\hat y(\hat y_1, \ldots, \hat y_m)$. Therefore, our adversarial loss is given by

(2)$$\begin{aligned} \min_{G}\max_{D}L(D,G)=E_{y \sim p_{data}(y)}logD(y|x) +E_{\hat y \sim p_{data}(\hat y)}[log(1-D(\hat y|x))]. \end{aligned}$$

3.2.2 Image gradient difference loss

Although using pixel loss can be used to achieve our objective, sometimes the network misinterprets our purpose. The overall visual quality of the image is biased when using only the pixel loss, although good results are achieved in some metrics. The edge information of an image is essential, and we can confine the edge information of an image via edge extraction algorithms, such as Sobel and Canny operators. Here, we used the Prewitt-operator-based gradient difference loss (GDL), which was introduced by previous studies [43,44]. The GDL loss is helpful on the task of medical image-to-image synthesis. The Prewitt operator is expressed as

(3)$$\bigtriangledown I_{h} = \begin{bmatrix} +1 & 0 & -1\\ +1 & 0 & -1\\ +1 & 0 & -1\\ \end{bmatrix} * I \ \ \ and \ \ \ \ \bigtriangledown I_{v} = \begin{bmatrix} +1 & +1 & +1\\ 0 & 0 & 0\\ -1 & -1 & -1\\ \end{bmatrix} * I,$$

where $I$ is the target image whose edge features need to be extracted, $\bigtriangledown I_h$ and $\bigtriangledown I_v$ are the gradients in the horizontal and vertical directions, respectively, and $*$ is the convolution process.

We calculated the gradient images in the horizontal and vertical directions and the gradient gap between the real and synthesis images. Our calculation formula is given by

(4)$$\mathcal{L}_{GDL} = |\bigtriangledown I_h -{\bigtriangledown} \hat I_h| + |\bigtriangledown I_v -{\bigtriangledown} \hat I_v|.$$

We selected an image from each test dataset and drew its gradient map, as is shown in Fig. 5. In the figure, the convolution operator can properly extract the images’ edge feature.

Fig. 5. Gradient image of horizontal and vertical directions for three datasets. The Prewitt operator was used to extract image gradient information and compute the image gradient loss.

Download Full Size | PDF

3.2.3 Pixel-wise loss

In the image translation task, the most important objective is to make the synthesis image closer to the real image. Therefore, the pixel-level difference between images is our main concern. Usually, we can use $L_1$, $Smooth\ L_1$ and $L_2$ to calculate the pixel-level differences between images. These three losses can be expressed as

(5)$$L_1 = \frac{\sum_{i=1}^{n}|y_{i}-\hat y_{i}|} {n},$$

(6)$$\begin{aligned} Smooth\ L_1 = \frac{\sum_{i=1}^{n} \{(0.5 \times (y_i-\hat y_i)_{}^{2})|_{|y_i-\hat y_i|<1} +(|y_i-\hat y_i|-0.5)|_{|y_i-\hat y_i|>{=}1}\}} {n}, \end{aligned}$$

(7)$$L_2 = \frac{ \sum_{i=1}^{n} (y_i-\hat y_i)_{}^{2}}{n},$$

where n is the total number of image pixels. $L_1$ loss is more robust in handling outlier points because it does not amplify the loss, $L_2$ loss is more stable with small fluctuations because it is derivable everywhere and has a smaller gradient value around the zero value, and $Smooth \ L_1$ loss combines the advantages of $L_1$ and $L_2$ losses. Based on the focus of the loss function and the detailed information features of our dataset, we chose the $L_2$ loss for the PET synthetic CT task and the $L_1$ loss for the MR modality conversion task. Table 1 lists the three loss’ characteristics.

Table 1. Characteristics of $L_1$, $Smooth L_ 1$, and $L_2$ loss.

View Table | View all tables in this article

3.2.4 Total loss

Therefore, the total loss can be expressed as

(8)$$\mathcal{L}_{total} = \lambda_1 \max_{G}\min_{D}L(D,G) + \lambda_2 L_{pixel} + \lambda_3 L_{GDL},$$

where $L_{pixel}$ is one of $L_1$ and $L_2$, $\lambda _1$, $\lambda _2$, and $\lambda _3$ are the hyperparameters of the three losses with values of 1, 10, and 0.1, respectively.

4. Experiments and results

All experiments were trained with the same training code, except for the model architecture used. In different model training, we tried different hyperparameters but did not obtained improved results; therefore, we settled on a fixed set of parameters. We set the learning rate to 1e-4 (the learning rate of all epochs is attenuated by 0.996), the batch size to 12, and the total epoch to 500. In the experiments, we used two types of graphics cards, 2080(10G) and 2080TI(11G).

We tested our network using three datasets, namely, the Madic small animal dataset (PET to CT images synthesis), IXI dataset [45] (human healthy brain T1 to T2 synthesis), and BraTS 2020 dataset [46–48] (human brain with tumor T1 to T2 synthesis). Four metrics used to evaluate the quality of the synthesized images: structural similarity (SSIM), peak signal-to-noise ratio (PSNR), visual information fidelity (VIF) [49], mean square error (MSE) and frechet inception distance(FID) [50]. The source supplemental Code is as we show in Code 1 (Ref. [51]).

4.1 PET to CT synthesis

Originally, the technology proposed in here was intended to synthesize CT images from PET images. Our dataset was obtained using a high-performance PET/CT device which was produced by Shandong Madic Technology Co., Ltd. with fluorodeoxyglucose (FDG) as the radiopharmaceutical for PET. We aligned and normalized the dataset and sliced two-dimensional images with three different cross-sections. The attenuation correction on PET images is required during the PET reconstruction process. The common method is the CT-based attenuation correction (CTAC) [52], which uses CT images to constrain the reconstructed images. We proposed a deep learning method to perform PET self-based attenuation correction directly which can reduce the cost of the equipment and radiation to the participant.

Using our method, the synthesized CT images are nearly identical the real images using our methods. We also used other image translation methods, which were compared them with a CNN$\mbox {-}$based model pix2pix and a transformer-based model TransGAN. The TransGAN is not suitable for image translation tasks. Therefore, we changed its input and output blocks.

The information of PET images is different from CT images due to the difference in the imaging methods and principles. PET is a functional imaging, which shows more functional information about the target. The collected dataset focuses more on the degree of cellular metabolism of the target being photographed. CT is a structural imaging, which focuses on the structural information of the tissue of the target being photographed. Synthesizing CT from PET, which is a process of synthesizing images with copious information from images with little information is difficult. Our experimental results demonstrate that images from PET that are close to real CT images can be synthesized. An interesting phenomenon is that the CT captures the experimental bed and the mouth guard used for anesthesia, whereas the PET does not contain this information. Using our model, we were able to synthesize this information. We used several advanced methods for the implementation of our task. Based on the results, our method outperforms the other methods, as shown in Table 2.

Table 2. Performance comparison between the proposed TCGAN and different state-of-the-art methods on all three datasets.

View Table | View all tables in this article

The images from the test dataset were also synthesized individually. Some of the results are presented in Fig. 6. Based on the error maps, the images synthesized using our approach are more accurate and has the most detailed information. All error maps were reduced to the same range to accentuate the visual effect of synthesis. The display range of the error map is $0-20$, while the maximum error map is 256. The display range of the error map is $0-20$, while the maximum error map is 256. Our method exhibits the lowest synthetic effect error and a stable error curve. The gray histogram of the error map is also more uniform and more concentrated near the origin. The error curve subgraph is a curve drawn by superimposing the error graph in the vertical direction. Based on its fluctuation, it can be seen that our method exhibits a very obvious advantage. The synthesis impact of TransGAN is the worst among the three methods, and the image brightness and darkness differ significantly. The transformer can provide long-range information; however, using transformers alone without adding more strategies may lead to poor results.

Fig. 6. Synthesis images for different methods on the Madic dataset. To better visualize the error, all error maps are scaled to 0–20, while the maximum error map value is 256. We selected the same area for different methods and zoomed in. Below the magnification of the error map, the gray histogram of the selected area is plotted. Moreover, a horizontal error curve for the box-selected area is presented.

Download Full Size | PDF

4.2 Human brain with tumors T1 to T2 synthesis

The BraTS 2020 dataset, which is a medical image segmentation dataset widely used in brain tumor segmentation, was also used in this study. In this study, we also employed T1- and T2- weighted images. We sliced the axial section of this dataset for two-dimensional images. The training set of the original dataset was utilized as our training set, while the validation set was separated into two, one of which was used as the validation set and the other as the test set. This dataset was also used to evaluate state-of-the-art methods and for the partial ablation studies.

Tumors can induce significant changes in the anatomy of the human brain, affecting T1 and T2 correspondence. Based on the experimental results, the tumor region synthesis may worsen the image more than that of the other regions. As a result, we conducted our study using the BraTS 2020 dataset, in which our approach produces the best results. The final experimental results are shown in Table 2 and Fig. 8. The best results were obtained using our method. BraTS 2020 is a medical image segmentation dataset. We believe that better registration and preprocessing operations on the BtaTS 2020 dataset may be more suitable for our image translation task.

4.3 Human health brain T1 to T2 synthesis

We also tested the model using images of a healthy human brain dataset to demonstrate its strong generalization ability. The data was first gathered from the IXI dataset. T1- and T2-weighted images of 40 participants were evaluated. After registering the images, nearly 90 axial section containing brain tissue without apparent artifacts were selected for each participant [25]. The final numbers of training, validation, and test sets are 2275, 455, and 910, respectively.

We performed experiments on T1 and T2 modality conversion using the IXI dataset to demonstrate the good generalization performance of the model. The synthesis ability of the model for the IXI dataset is substantially superior to that of BraTS 2020 dataset with a tumor area, as shown in Table 2 and Fig. 7 show the final experimental results. Our method also achieves the best results on the IXI dataset, but the advantage is lower than that on Madic dataset.

Fig. 7. Synthesis images of different methods on the BraTS 2020 dataset. Based on the image synthesis details, TCGAN outperforms the other methods. The synthetic effect of the three methods in the lesion area of this human brain tumor dataset is significantly lower than that in the healthy area.

Download Full Size | PDF

Fig. 8. Synthesis images of different methods on the IXI dataset.

Download Full Size | PDF

4.4 Ablation studies

We conducted a series of ablation experiments to verify the superiority of our model. Experiments with and without a transformer generator were carried out to ablate the generator blocks. Defining the transformer generator as $G_t$ and CNN$\mbox {-}$based generator as $G_c$, we conducted experiments using $Gt, Gc, GtGc, GtGcGc, GtGtGc$, and $GtGcGtGc$, respectively. Figure 9 presentes an illustration of our experiment with different generator architectures. We also conducted experiments using the CNN$\mbox {-}$based patch discriminator and pixel-wise discriminator to verify that the patch discriminator exhibited the best performance. Both discriminators were designed based on CNNs. The patch discriminator separately outputs a discriminant for each image patch, while the pixel-wise discriminator scores each pixel. A series of loss function studies were performed on the Madic dataset. For pixel-wise loss, we used $L_1, Smooth\ L_1$, and $L_2$, and the GDL was also studied.

Fig. 9. Exploration of the generator structure. We used serveral architectures:$Gt$, $Gc$, $GtGc$, $GtGcGc$, $GtGcGc$, $GtGcGtGc$, $GtGcGtGcGtGc$ several architectures. $Gt$ only uses a single transformer generator, $Gc$ only uses a single CNN generator. (c) $GtGcGc$ architecture. (d) $GtGtGc$ architecture. (e) Design idea of heavy TCGAN. Multiple TCGANs can be connected in parallel, and each module directly informs each other to reduce the gradient disappearance.

Download Full Size | PDF

4.4.1 Generator study

In our ablation experiments, it is necessary to explore the superiority of CNN$\mbox {-}$based networks and transformer ensembles. Therefore, we studied the effect of using the CNN and transformer generators on the Madic dataset. Simultaneously, we combined the CNN and transformer generators in several manners, including deepening our network. Based on Fig. 10, the effect of using the transformer synthesizer alone is poor, whereas using a CNN generator alone can achieve a good effect. Thus, combining the two using the proposed method achieves the best results.

Fig. 10. Histogram of results using different generator structures, GtGc achieves the best performance.

Download Full Size | PDF

We used the architectures of $Gt, Gc, GtGc, GtGcGc, GtGtGc$, and $GtGcGtGc$ architectures to conduct research on the Madic dataset. Then, we set the batch size of all experiments to unity to ensure the experimental fairness of the experiment and take into account the operation of the large model. The bar graphs of our experimental results using different architectures are plotted in Fig. 10. In this figure, the best results are achieved using the $GtGc$ architecture. Deepening the network did not achieve the desired effect even though we adopted a residual connection strategy. Although using the transformer alone will not achieve good results, it can provide some global information to the CNN network.

4.4.2 Discriminator study

We studied the two discriminators using three datasets, respectively. The output of the first discriminator is based on each image patch, and the second discriminator is a pixel-level discriminator that scores each corresponding pixel of the image. Table 3 lists the results of the two discriminators, and it can be seen that the effect of the patch discriminator is significantly better than that of the pixel discriminator. Figure 11 shows the effect of TCGAN synthesizing images after using the two discriminators. As seen in the error map, using the patch discriminator can achieve better synthesis results. We also plotted the corresponding discriminator output of the discriminator, which can reflect the degree to which the discriminator is deceived. However, instead of pursuing this, we aim to achieve a balance for the generator and discriminator.

Fig. 11. Two discriminators are used on the three datasets. The Dis map is the visual display of the output of the two generators. As the output of different discriminators are not comparable, we did not scale the output graphs of the discriminators to the same range. Patch Discr. and Pixel Discr. indicate the PatchGAN and pixel discriminators, respectively.

Download Full Size | PDF

Table 3. Results using the pixel and patch discriminators.

View Table | View all tables in this article

4.4.3 Loss studies

We studied $L_1$, $Smooth\ L_1$, and $L_2$ losses on the three datasets, as shown in Table 4. On the Madic dataset, $L_2$ loss achieves the best results. Through the study of loss functions, both $L_1$ and $L_2$ losses are found to be suitable for image synthesis tasks. However, their adaptability to different datasets is not the same. The experimental results demonstrate that $L_2$ loss works best for PET to CT conversion, and $L_1$ loss is better for MR T1 to T2 conversion. By analyzing the dataset, it is apparent that the CT data is smoother and contains relatively less details than MR data, while MR data has higher resolution and more detailed information. Therefore, $L_2$ has a better effect on the synthesis of the smooth data with less information, while the $L_1$ loss is more suitable for MR and other image synthesis tasks that contain more details. In many cases, for $Smooth\ L_1$, an average result can be obtained, and the training is relatively stable. In addition, we attempted to design a loss function that selectively uses $L_1$ and $L_2$ according to the image gradient, but the final result is slightly worse than $Smooth\ L_1$. This may be related to the inaccuracy of edge extraction and the chosen hyperparameters to use $L_1$ and $L_2$. We used the models trained by different losses to generate corresponding result images on the Madic dataset, which are shown in Fig. 12. Then, we plotted the measure matrix performance using different loss functions, as shown in Fig. 13. Based on the figure, the $L_2$ loss generally achieves the best results according to various indicators. The experimental results show that the results are significantly improved after using GDL loss. Simultaneously, from the effect of the synthetic image, the detail gap between the synthetic and real images is improved, and it can have better edge and detail information. The effect is poor if only using $L_2$ loss without adding GDL loss. This is related to the characteristics of $L_2$ loss. $L_2$ can generate smoother and better visual effects, but it can lead to the loss of image details, which can be compensated by adding GDL loss.

Fig. 12. Synthesis images of different loss.

Download Full Size | PDF

Fig. 13. Plot of the performance on the Madic dataset using different loss functions. Based on the curve, the $L_2$ loss achieves the best results on various indicators.

Download Full Size | PDF

Table 4. Results of the loss study.

View Table | View all tables in this article

5. Discussion

In this paper, we propose a transformer-enhanced GAN for PET to CT synthesis in this paper. We also investigated whether or not to employ transformer enhancing. The results show that using a transformer generator can increase the image synthesis capacity of the GAN. The transformer overcomes the weakness of the convolutional technique, which only has a local field of view, by focusing on a larger range of regions and hence, discovers the correlation between other regions and the target region. We attempted various combinations of the transformer generator and CNN$\mbox {-}$based synthesizer separately to investigate the impact of the transformer on the synthesis task. For the discriminator part, we used the patch and pixel discriminators. Finally, we studied the performance of the three loss functions, $L1$, $Smooth L1$, and $L2$, on the image synthesis task. Simultaneously, we also investigated the effect of the GDL loss on image synthesis.

We were unable to test the performance of our model on large-scale datasets due to the dependency of the transformer on large datasets and the scarcity of medical image multimodal datasets. The transformer has better performance for large-scale datasets. In our future studies, we will look for multimodal image data from non-medical image datasets to validate the performance of our model.

6. Conclusions

In this study, we proposed a multimodal medical image synthesis method named TCGAN. The addition of the transformer structure can resolve the limitations of CNNs and obtain more contextual information. TCGAN was tested on three datasets (i.e., small animal PET to CT dataset, T1 to T2 modality synthesis of human healthy brain, and T1 to T2 dataset of human brain with tumor) and compared with existing state-of-the-art methods. The results show that our method outperforms other methods, indicating that the augmentation of GANs using the transformer is practical and effective. Furthermore, it provides a new concept for addressing the limitations of CNNs. Our method has a significant effect on PET synthetic CT, however, its superiority on other datasets is not outstanding. Furthermore, we only investigated the possibility of synthesizing CT from PET, but not the problems encountered in the specific application to PET attenuation correction. In future work, we will conduct experiments on other image translation tasks and study the effect of our method on synthetic CT for PET attenuation correction.

Funding

Major Scientific and Technological Innovation Project of Shandong Province (2019JZZY021003); National Natural Science Foundation of China (61771230).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Code 1 [51].

References

1. J. Li, Y. Wang, Y. Yang, X. Zhang, Z. Qu, and S. Hu, “Small animal PET to CT image synthesis based on conditional generation network,” in 2021 14th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), (2021), pp. 1–6.

2. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Networks,” Adv. neural information processing systems 27 (2014).

3. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Adv. neural information processing systems 30, 1 (2017). [CrossRef]

4. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778.

5. B. Yu, Y. Wang, L. Wang, D. Shen, and L. Zhou, Medical Image Synthesis via Deep Learning (Springer International Publishing, 2020), pp. 23–44.

6. T. Wang, Y. Lei, Y. Fu, J. F. Wynne, W. J. Curran, T. Liu, and X. Yang, “A review on medical imaging synthesis using deep learning and its clinical applications,” J. Appl. Clin. Medical Phys. 22(1), 11–36 (2021). [CrossRef]

7. Y. Huo, Z. Xu, H. Moon, S. Bao, A. Assad, T. K. Moyo, M. R. Savona, R. G. Abramson, and B. A. Landman, “Synseg-net: Synthetic segmentation without target modality ground truth,” IEEE Trans. Med. Imaging 38(4), 1016–1025 (2019). [CrossRef]

8. A. Chartsias, T. Joyce, R. Dharmakumar, and S. A. Tsaftaris, “Adversarial image synthesis for unpaired multi-modal cardiac data,” in International workshop on simulation and synthesis in medical imaging, (Springer, 2017), pp. 3–13.

9. D. Romo-Bucheli, P. Seeböck, J. I. Orlando, B. S. Gerendas, S. M. Waldstein, U. Schmidt-Erfurth, and H. Bogunović, “Reducing image variability across OCT devices with unsupervised unpaired learning for improved segmentation of retina,” Biomed. Opt. Express 11(1), 346–363 (2020). [CrossRef]

10. S. Roy, A. Carass, A. Jog, J. L. Prince, and J. Lee, “MR to CT registration of brains using image synthesis,” in Medical Imaging 2014: Image Processing, vol. 9034 (International Society for Optics and Photonics, 2014), p. 903419.

11. G. Xie, J. Wang, Y. Huang, Y. Zheng, F. Zheng, and Y. Jin, “FedMed-ATL: Misaligned unpaired brain image synthesis via affine transform loss,” arXiv preprint arXiv:2201.12589 (2022).

12. Z. Qin, Z. Liu, P. Zhu, and Y. Xue, “A GAN-based image synthesis method for skin lesion classification,” Comput. Meth. Prog. Bio. 195, 105568 (2020). [CrossRef]

13. M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing 321, 321–331 (2018). [CrossRef]

14. L. Sun, J. Wang, Y. Huang, X. Ding, H. Greenspan, and J. Paisley, “An adversarial learning approach to medical image synthesis for lesion detection,” IEEE J. Biomed. Health Inform. 24(8), 2303–2314 (2020). [CrossRef]

15. J. Zhang, X. He, L. Qing, F. Gao, and B. Wang, “Bpgan: Brain pet synthesis from mri using generative adversarial network for multi-modal alzheimer’s disease diagnosis,” Comput. Meth. Prog. Bio. 217, 106676 (2022). [CrossRef]

16. Y. He, J. Li, S. Shen, K. Liu, K. K. Wong, T. He, and S. T. C. Wong, “Image-to-image translation of label-free molecular vibrational images for a histopathological review using the UNet+/seg-cGAN model,” Biomed. Opt. Express 13(4), 1924–1938 (2022). [CrossRef]

17. Y. Luo, L. Zhou, B. Zhan, Y. Fei, J. Zhou, Y. Wang, and D. Shen, “Adaptive rectification based adversarial network with spectrum constraint for high-quality PET image synthesis,” Med. Image Anal. 77, 102335 (2022). [CrossRef]

18. M. K. alias Anbu Devi and K. Suganthi, “Review of medical image synthesis using GAN techniques,” in ITM Web of Conferences, vol. 37 (EDP Sciences, 2021), p. 01005.

19. M. S. Meharban, M. K. Sabu, and S. krishnan, “Introduction to medical image synthesis using deep learning:a review,” in 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), vol. 1 (2021), pp. 414–419.

20. B. Yu, L. Zhou, L. Wang, Y. Shi, J. Fripp, and P. Bourgeat, “Ea-GANs: edge-aware generative adversarial networks for cross-modality MR image synthesis,” IEEE Trans. Med. Imaging 38(7), 1750–1762 (2019). [CrossRef]

21. K. Armanious, C. Jiang, M. Fischer, T. Küstner, T. Hepp, K. Nikolaou, S. Gatidis, and B. Yang, “MedGAN: Medical image translation using GANs,” Comput. Med. Imaging Graph. 79, 101684 (2020). [CrossRef]

22. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention, (Springer, 2015), pp. 234–241.

23. D. Nie, R. Trullo, J. Lian, C. Petitjean, S. Ruan, Q. Wang, and D. Shen, “Medical image synthesis with context-aware generative adversarial networks,” in International conference on medical image computing and computer-assisted intervention, (Springer, 2017), pp. 417–425.

24. T. Shen, K. Hao, C. Gou, and F.-Y. Wang, “Mass image synthesis in mammogram with contextual information based on GANs,” Comput. Meth. Prog. Bio. 202, 106019 (2021). [CrossRef]

25. S. U. Dar, M. Yurt, L. Karacan, A. Erdem, E. Erdem, and T. Çukur, “Image synthesis in multi-contrast MRI with conditional generative adversarial networks,” IEEE Trans. Med. Imaging 38(10), 2375–2388 (2019). [CrossRef]

26. K. Kläser, T. Varsavsky, P. Markiewicz, T. Vercauteren, D. Atkinson, K. Thielemans, B. Hutton, M. J. Cardoso, and S. Ourselin, “Improved MR to CT synthesis for PET/MR attenuation correction using imitation learning,” in International Workshop on Simulation and Synthesis in Medical Imaging, (Springer, 2019), pp. 13–21.

27. U. Upadhyay, Y. Chen, T. Hepp, S. Gatidis, and Z. Akata, “Uncertainty-guided progressive GANs for medical image translation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2021), pp. 614–624.

28. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” CoRR abs/2010.11929 (2020).

29. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 10012–10022.

30. Y. Jiang, S. Chang, and Z. Wang, “TransGAN: Two transformers can make one strong GAN,” CoRR abs/2102.07074 (2021).

31. Y. Dai, Y. Gao, and F. Liu, “Transmed: Transformers advance multi-modal medical image classification,” Diagnostics 11(8), 1384 (2021). [CrossRef]

32. S. A. Kamran, K. F. Hossain, A. Tavakkoli, S. L. Zuckerbrod, and S. A. Baker, “Vtgan: Semi-supervised retinal image synthesis and disease prediction using vision transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 3235–3245.

33. J. M. J. Valanarasu, P. Oza, I. Hacihaliloglu, and V. M. Patel, “Medical transformer: Gated axial-attention for medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2021), pp. 36–46.

34. Y. Zhang, H. Liu, and Q. Hu, “Transfuse: Fusing transformers and cnns for medical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2021), pp. 14–24.

35. J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306 (2021).

36. D. Karimi, S. D. Vasylechko, and A. Gholipour, “Convolution-free medical image segmentation using transformers,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, (Springer, 2021), pp. 78–88.

37. T. C. W. Mok and A. C. S. Chung, “Affine medical image registration with coarse-to-fine vision transformer,” (2022).

38. A. Luthra, H. Sulakhe, T. Mittal, A. Iyer, and S. Yadav, “Eformer: Edge enhancement based transformer for medical image denoising,” arXiv preprint arXiv:2109.08044 (2021).

39. O. Dalmaz, M. Yurt, and T. Çukur, “Resvit: Residual vision transformers for multi-modal medical image synthesis,” arXiv preprint arXiv:2106.16031 (2021).

40. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), pp. 5967–5976.

41. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473 (2014).

42. R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for dense prediction,” CoRR abs/2103.13413 (2021).

43. C. Hognon, F. Tixier, D. Visvikis, and V. Jaouen, “Influence of gradient difference loss on MR to PET brain image synthesis using GANs,” SNMMI Annual Meeting 2020 (2020). Poster.

44. D. Nie, R. Trullo, J. Lian, L. Wang, C. Petitjean, S. Ruan, Q. Wang, and D. Shen, “Medical image synthesis with deep convolutional adversarial networks,” IEEE. Trans. Biomed. Eng. 65(12), 2720–2730 (2018). [CrossRef]

45. BrainDevelopment.Org, “IXI dataset,” Imperial College, London, 2015, https://brain-development.org/ixi-dataset/.

46. S. Bakas, M. Reyes, A. Jakab, et al., “Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the BRATS challenge,” CoRR abs/1811.02629 (2018).

47. S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. S. Kirby, J. B. Freymann, K. Farahani, and C. Davatzikos, “Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features,” Sci. Data 4(1), 170117 (2017). [CrossRef]

48. B. H. Menze, A. Jakab, S. Bauer, et al., “The multimodal brain tumor image segmentation benchmark (BRATS),” IEEE Trans. Med. Imaging 34(10), 1993–2024 (2015). [CrossRef]

49. H. Sheikh and A. Bovik, “Image information and visual quality,” IEEE Trans. Image Process. 15(2), 430–444 (2006). [CrossRef]

50. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in Advances in Neural Information Processing Systems, vol. 30 (Curran Associates, Inc., 2017).

51. J. Li, “Source code for tcgan,” GitHub (2022). https://github.com/jinxiqinghuan/TCGAN.

52. C. Bai, C.-H. Tung, J. Kolthammer, L. Shao, K. Brown, Z. Zhao, A. Da Silva, J. Ye, D. Gagnon, M. Parma, and E. Walsh, “CT-based attenuation correction in PET image reconstruction for the Gemini system,” in 2003 IEEE Nuclear Science Symposium. Conference Record (IEEE Cat. No.03CH37515), vol. 5 (2003), pp. 3082–3086

$L_{1}$	$L_{2}$	Smooth $L_{1}$
Robust	Not robust	Robust
Unstable solution	Stable solution	Stable solution
Multiple solutions	One solution	One solution

	Madic dataset					BraTS 2020 dataset					IXI dataset
	SSIM	PSNR	VIF	MSE	FID	SSIM	PSNR	VIF	MSE	FID	SSIM	PSNR	VIF	MSE	FID
pix2pix	0.965	43.54	0.779	0.000235	74.01	0.924	28.31	0.296	0.00575	35.95	0.825	26.96	0.318	0.00212	80.86
TransGAN	0.960	41.56	0.725	0.000257	92.56	0.911	25.96	0.260	0.00592	98.69	0.844	25.98	0.316	0.00266	124.49
Ours	0.966	45.66	0.836	0.000203	69.94	0.930	31.29	0.320	0.00557	26.11	0.867	27.37	0.342	0.00197	55.23

	Madic dataset					BraTS 2020 dataset					IXI dataset
	SSIM	PSNR	VIF	MSE	FID	SSIM	PSNR	VIF	MSE	FID	SSIM	PSNR	VIF	MSE	FID
Pixel Discriminator	0.966	45.39	0.830	0.000204	75.15	0.928	31.18	0.311	0.005636	37.63	0.853	27.27	0.340	0.001987	61.41
Patch Discriminator	0.966	45.66	0.836	0.000203	69.94	0.930	31.29	0.320	0.005573	26.11	0.867	27.37	0.342	0.001968	55.23

	Madic dataset					BraTS 2020 dataset					IXI dataset
	SSIM	PSNR	VIF	MSE	FID	SSIM	PSNR	VIF	MSE	FID	SSIM	PSNR	VIF	MSE	FID
L1 + GDL	0.968	45.40	0.812	0.000221	74.88	0.930	31.29	0.320	0.00557	26.11	0.867	27.37	0.342	0.001968	55.23
Smooth L1+GDL	0.967	45.46	0.832	0.000204	80.68	0.924	28.31	0.296	0.00575	37.45	0.840	27.09	0.326	0.002091	64.76
L2 + GDL	0.966	45.66	0.832	0.000203	69.94	0.924	28.28	0.293	0.00574	30.33	0.843	27.17	0.328	0.002042	74.74
L2 without GDL	0.965	45.42	0.802	0.000204	95.53	0.923	27.71	0.288	0.00588	29.85	0.833	26.88	0.314	0.002185	76.96

$L_{1}$	$L_{2}$	Smooth $L_{1}$
Robust	Not robust	Robust
Unstable solution	Stable solution	Stable solution
Multiple solutions	One solution	One solution

TCGAN: a transformer-enhanced GAN for PET synthetic CT

Abstract

Corrections

1. Introduction

2. Related works

3. Methods

3.1 Overall network structure

3.1.1 Transformer generator

3.1.2 CNN-based generator

3.1.3 Discriminator

3.2 Loss functions

3.2.1 Adversarial loss

3.2.2 Image gradient difference loss

3.2.3 Pixel-wise loss

3.2.4 Total loss

4. Experiments and results

4.1 PET to CT synthesis

4.2 Human brain with tumors T1 to T2 synthesis

4.3 Human health brain T1 to T2 synthesis

4.4 Ablation studies

4.4.1 Generator study

4.4.2 Discriminator study

4.4.3 Loss studies

5. Discussion

6. Conclusions

Funding

Disclosures

Data availability

References

Supplementary Material (1)

Data availability

Cited By

Figures (13)

Tables (4)

Equations (8)

Biomedical Optics Express