MC-GAT: multi-layer collaborative generative adversarial transformer for cholangiocarcinoma classification from hyperspectral pathological images

Yuan Li; Xu Shi; Liping Yang; Chunyu Pu; Qijuan Tan; Qijuan Tan; Zhengchun Yang; Zhengchun Yang; Zhengchun Yang; Hong Huang; Hong Huang

doi:10.1364/BOE.472106

1. Introduction

Cholangiocarcinoma (CCA) is a cancerous tumor with a high risk of recurrence. It is reported that CCA accounts for 3% of gastrointestinal cancers and has a yearly incidence of 5000 new cases [1], while accurate diagnosis in an early phase can obviously improve the 5-year survival rate of CCA. In all diagnostic methods, histopathological analysis is the gold standard [2]. However, the diagnostic procedure is usually time-consuming and significantly reliant on expertise of pathologists. Therefore, automatic classification of CCA pathological images provides pathologists with valuable insight.

The computer-aided diagnosis (CADx) system assists pathologists to relieve heavy workloads by analyzing pathological images quantitatively. Traditional CADx typically adapts a large number of handcraft features, such as cell size, shape, texture, and distribution of pixel intensity in the cells and nuclei [3–5]. These handcraft features are exploited for diagnostic model building [6,7]. However, limited representation capacity of handcraft features is difficult to satisfy the demands of pathological image analysis with high accuracy.

Deep learning (DL) algorithms for CADx systems have gained interest over the past several years [8–11]. Deep learning methods construct multiple layers of networks which extract hierarchical feature representations. Villa-Pulgarin et al. [12] proposed three convolutional neural networks models to distinguish different skin lesions. Iizuka et al. [13] used CNN and recurrent neural networks (RNN) to classify histopathology whole-slide images (WSI) into three subtypes–adenocarcinoma, adenoma, and non-neoplastic. Kiani et al. [14] proposed a deep learning-based algorithm to aid pathologists in distinguishing between two subtypes of primary liver cancer. However, these DL methods are based on color imaging, which is unable to portray the intricate biological structures of pathological images and leads to constrained analysis performance [15,16].

Hyperspectral imaging, as a new imaging modality, has a lot of potential for pathological image analysis [17,18]. It comprises both spatial and spectral information, allowing it to extract rich information from histopathological tissue. Li et al. [19] exploits a deep CNN for tumor/non-tumor binary prediction from microscopic pathological HSIs. Hu et al. [20] proposed a DL analysis method for pathological diagnosis of gastric cancer tissue by the microscopic hyperspectral technique. Wang et al. [21] designed a 1-D convolutional neural network to classify the hyperspectral data of hepatocellular carcinoma pathological slices. However, CNN in these models only have local receptive fields, which is hardly able to capture relations of sequential information over a long distance. In addition, excessive focus on the extraction of spatial information lets CNN distort spectral information in the learned features [22]. These deficiencies will result in a bottleneck in improving the performance of HSI classification.

In recent years, vision transformer (ViT) has made significant development in the realm of computer vision. Transformer is based on the self-attention mechanism [23], and it captures global dependency of input sequence. Zhou [24] et al. proposed a swin-spectral transformer network, which designed a spectral multi-head self-attention (Spectral-MSA) in the spectral dimension to extract spectral features. Li [25] proposed a spectral context-aware Transformer (SCAT) segmentation method for cholangiocarcinoma hyperspectral images, which mines the spatial geometric structure information and obtains the relevant features of the region of interest. Yun [26] proposed a spectral transformer, which formulates the contextual feature learning across spectral bands. However, owing to the lack of assumptions, such as translation equivariance and locality, these transformer-based models have limited generalization ability when trained on insufficient amounts of data [27]. Furthermore, the size of pathological images is very large, and the shortage of pathologists makes it difficult to realize large-scale annotations. This restricts the use of transformer-based approaches that require large-scale labeled data.

Generative adversarial networks (GAN) have gained a lot of attention in medical applications due to their capability of data augmentation [28], which effectively alleviates the problem of insufficient samples. Hu et al. [29] developed a unified GAN architecture with a novel formulation of loss to achieve robust cell-level visual representation learning in an unsupervised setting. Madani et al. [30] validated that deep GANs are able to learn the visual structure in medical imaging domains (particularly in chest X-rays). Lahiri [31] proposed a semi-supervised GAN, and mitigated labeling effort for retinal vessel classification. Maayan et al. [32] used GAN for data augmentation, and then the performance of CNN is improved for medical image classification. However, transformers with skip connections and multi-layer perceptrons will lose expressive power doubly exponentially with respect to network depth [33], which restricts the prospect of combining transformer and GAN. There is not an effective mechanism of information exchange between network layers to alleviate this problem. Although some related works [22,34] have been proposed to tackle this problem, they ignore sequence relations from shallow to deep features in transformer encoders, which leads to limited classification performance.

To overcome the above drawbacks, a new multi-layer collaborative generative adversarial transformer termed MC-GAT is proposed for CCA classification from hyperspectral imagery. MC-GAT consists of two pure transformer-based networks–a generative network and a discriminative network. The key contributions of this paper are listed as follows.

(1) To increase model generalization, the generative network adopts a three-stage up-scale strategy and transforms noise into fake samples to confuse the discriminative network.

(2) In the discriminative network, the multi-layer collaborative transformer encoder adaptively integrates output features from different layers into collaborative features, which enhances information interaction of different layers and improves the classification performance.

This paper is organized as follows. Section 2 briefly reviews some related works and details our proposed method. In Section 3, the experimental results on a CCA hyperspectral pathological image dataset are presented to indicate the capability of the MC-GAT method. Section 4 provides some concluding remarks and suggestions for future work.

2. Materials and methods

2.1 Related works

2.1.1 Generative Adversarial Networks (GAN)

In the field of machine learning, there are two types of supervised learning approaches: generative approaches and discriminate approaches. Generative approaches synthesize new samples by learning the distribution assumption of the explicit or implicit variables of the real data. GAN is a kind of implicit generative model, which is made up of two neural networks –a generator and a discriminator. In generator, it accepts random noise $\boldsymbol {z}$ from a prior distribution $p_{noise}(\boldsymbol {z})$ as input and learns generator function $G(\boldsymbol {z})$ that transforms $\boldsymbol {z}$ into real alike samples. In discriminator, it takes samples $\boldsymbol {y}$ as input and estimates probability $D(\boldsymbol {y})$ that $\boldsymbol {y}$ is drawn from real data distribution $p_{data}(\boldsymbol {y})$. In the optimizing procedure, discriminator is trained to maximize $D(\boldsymbol {y})$ while generator is trained to maximize $D(G(\boldsymbol {z}))$. Therefore, the ultimate aim of the optimization is to solve the minimax problem:

(1)$$\min _{G} \max _{D} V(D, G)={E}_{\boldsymbol{y} \sim p_{data }(\boldsymbol{y})}[\log D(\boldsymbol{y})]+{E}_{\boldsymbol{z} \sim p_{noise}(\boldsymbol{z})}[\log (1-D(G(\boldsymbol{z})))]$$

where $E$ is the expectation operator. After training, the samples synthesized by $p_{noise}(\boldsymbol {z})$ should approximate the real data distribution $p_{data}(\boldsymbol {y})$ as closely as possible.

2.1.2 Vision transformer encoder

Transformer is a type of deep neural network that is built mainly on self-attention mechanism, which excels at calculating the correlation degree of the input sequence. The flowchart of the overall transformer encoder is presented in Fig. 1. Suppose sequence data as $\boldsymbol x = \left [ \boldsymbol x_1, \boldsymbol x_2, \ldots,\boldsymbol x_n\right ] \in {\Re ^{dim \times n}}$, where $dim$ and $n$ are the dimension and number of tokens, respectively. For classification tasks, a learnable classification token $\boldsymbol x_{n+1}$ is usually concatenated to $\boldsymbol x$. Due to the fact that transformer operates on the input simultaneously and identically, the order of the sequence is neglected. To utilize the sequential information, a common solution is to add a learnable positional encodings $\boldsymbol {o} = \left [\boldsymbol {o}_1, \boldsymbol {o}_2, \ldots,\boldsymbol {o}_{n+1} \right ] \in {\Re ^{dim \times (n+1)}}$, which forms a new sequence data $\boldsymbol x^{'}$. These procedures can be noted as

(2)$$\boldsymbol x^{'} = [\boldsymbol x_{n+1}, \boldsymbol x] + \boldsymbol{o}.$$

When $\boldsymbol x^{'}$ is passed into a transformer model, attention weights are calculated between each two tokens simultaneously. As shown in Fig. 2, for every token $\boldsymbol x_i, i\in \left \lbrace 1,2,\ldots,n+1\right \rbrace$, the attention unit produces embeddings, which contain relations information about the token itself along with other tokens by three attention weight – the query weight $\boldsymbol W^q \in {\Re ^{dim_{model} \times dim}}$, the key weights $\boldsymbol W^k \in {\Re ^{dim_{model} \times dim}}$, and the value weights $\boldsymbol W^v \in {\Re ^{dim_{model} \times dim}}$, where ${dim_{model}}$ is the dimension of mapping in attention weight. These matrices linearly change the original tokens and increase the diversity of the model features sampling. Then, a query vector $\boldsymbol q_i \in {\Re ^{dim_{model} \times 1}}$ , a key vector $\boldsymbol k_i \in {\Re ^{dim_{model} \times 1}}$, and a value vector $\boldsymbol v_i \in {\Re ^{dim_{model} \times (n+1)}}$ can be caculated as follows:

(3)$$\boldsymbol q_i = \boldsymbol W^{q}\boldsymbol x_i$$

(4)$$\boldsymbol k_i = \boldsymbol W^{k}\boldsymbol x_i$$

(5)$$\boldsymbol v_i = \boldsymbol W^{v} \boldsymbol x_i.$$

Then, for any token $\boldsymbol x_j$, $j\in \left \lbrace 1,2,\ldots,n+1\right \rbrace$, its weight corresponding to $\boldsymbol x_m$, $m\in \left \lbrace 1,2,\ldots,n+1\right \rbrace$ can be obtained according to the $\boldsymbol q_j$ and $\boldsymbol k_m$. This is modeled as

(6)$$\hat{a}_{j}^{m}=\frac{\exp \left(\frac{\boldsymbol k_m^T\boldsymbol q_j}{\sqrt{dim_{model}}}\right)}{\sum_{t=1}^{n+1} \exp \left(\frac{\boldsymbol k_t^T \boldsymbol q_j }{\sqrt{dim_{model}}}\right)}$$

in which $\sqrt {dim_{model}}$ is designed for normalization. Finally, the changed token $\boldsymbol c_j$ can be calculated by a weighted average operation:

(7)$$\boldsymbol c_{j}=\sum_{m=1}^{n+1} \hat{a}_{j}^{m} \boldsymbol v_m.$$

Using the softmax function, the above attention computation can be written as a single large matrix calculation to quickly compute matrix operations as follows:

(8)$$\operatorname{Attention}(\boldsymbol Q, \boldsymbol K, \boldsymbol V)=\operatorname{softmax}\left(\frac{\boldsymbol Q^{\mathrm{T}} \boldsymbol K}{\sqrt{dim_{model}}}\right) \boldsymbol V^\mathrm{T}$$

where $\boldsymbol Q=[ \boldsymbol q_1, \boldsymbol q_2, \ldots,\boldsymbol q_{n+1} ] \in {\Re ^{dim_{model} \times (n+1)}}$, $\boldsymbol K=[ \boldsymbol k_1, \boldsymbol k_2, \ldots,\boldsymbol k_{n+1} ] \in {\Re ^{dim_{model} \times (n+1)}}$, $\boldsymbol V=[ \boldsymbol v_1, \boldsymbol v_2, \ldots,\boldsymbol v_{n+1} ] \in {\Re ^{dim_{model} \times (n+1)}}$

Fig. 1. The flowchart of the overall transformer encoder.

Download Full Size | PDF

Fig. 2. The detailed process of self-attention mechanism.

Download Full Size | PDF

To obtain greater power to encode multiple relations and nuances of each token, multi-head attention (MHSA) is introduced to capture richer interpretations of the sequence. It repeats computations multiple times with different projections in parallel and then concatenates all the results together. The multi-head attention is defined as

(9)$$\operatorname {Multi-Head}(\boldsymbol Q, \boldsymbol K, \boldsymbol V) = \operatorname {concat}( head_1, head_2, \ldots, head_i, \ldots, head_{n+1})\boldsymbol W^{O}$$

where $head_i = \operatorname {Attention}(\boldsymbol W_i^{Q} \boldsymbol x^{'}, \boldsymbol W_i^K \boldsymbol x^{'}, \boldsymbol W_i^V \boldsymbol x^{'}), i \in {\left \lbrace 1,2, \dots, n+1\right \rbrace }$, $\boldsymbol W_i^Q\in {\Re ^{dim_{model} \times dim}}$, $\boldsymbol W_i^K\in {\Re ^{dim_{model} \times dim}}$, and $\boldsymbol W_i^V\in {\Re ^{dim_{model} \times dim}}$.

At last, two successive feedforward networks (FFN) with Gaussian error linear units activation [35] are performed, that is,

(10)$$\operatorname{FFN}(\boldsymbol t)= \boldsymbol w_{2}[\operatorname{GELU}\left(\boldsymbol w_{1} \boldsymbol t+\boldsymbol b_{1}\right)] + \boldsymbol b_{2}$$

where $\boldsymbol w_{1}$ and $\boldsymbol w_{2}$ are the weights of affine transformation, $\boldsymbol b_{1}$ and $\boldsymbol b_{2}$ are corresponding bias, $\boldsymbol t$ is the output of previous layer, and GELU is the Gaussian error linear units activation. Between each two layers, FFN are applied to each token position equally with different parameters, which performs a similar role as point-wise convolution. To enhance the scalability of the transformer encoder, all of the sub-layers adopt a residual connection [36] and a layer normalization [37]. The output $\boldsymbol s_{out}$ of each sub-layer can be expressed by

(11)$$\boldsymbol s_{out} = \operatorname{LayerNorm}(\boldsymbol s + \operatorname{LayerNorm}(\boldsymbol s))$$

where $\boldsymbol s$ is the output of MHSA layer or FFN layer, and LayerNorm is layer normalization.

2.2 Proposed methods

The architecture of the proposed MC-GAT is demonstrated in Fig. 3. MC-GAT consists of two parts – generator and discriminator. At first, a CCA hyperspectral pathological image is split into fixed-size cubes. Suppose these real samples as $\boldsymbol U^{real} = {\left \lbrace \boldsymbol u^{real}_1, \ldots, \boldsymbol u^{real}_i, \ldots, \boldsymbol u^{real}_n\right \rbrace } \in {\Re ^{ b\times w \times h \times n}}$ and corresponding one-hot encoding label vector as $\boldsymbol l_i \in \{[0, 1], [1, 0]\}$, where $n$, $w$, $h$ indicates number of bands, the width, and the height of HSI cubes. In the training phase, for real sample $\boldsymbol u^{real}_i$, the corresponding label vector $\boldsymbol l_i$ together with noise vector $\boldsymbol z$ is input to the generator. Through a series of upsampling and transformer encoder operations, the implicit probability distribution of the generator produces fake data $\boldsymbol u^{fake}_{i}$. $\boldsymbol u^{real}_i$ and $\boldsymbol u^{fake}_i$ are mixed and sent to the discriminator. In the discriminator, a multi-layer collaborative transformer encoder is designed to optimize the deep features by adaptively fusing output features from different layers. At the end, two fully connected layers are utilized to classify the authenticity and malignance of samples. The structure of generator and discriminator are described detailedly in Subsections 2.2.1 and 2.2.2, respectively.

Fig. 3. The flowgraph of the designed MC-GAT.

Download Full Size | PDF

2.2.1 Generator

Due to the computational complexity growing quadratically with spectral resolution [38], it is infeasible to apply transformers to generate fake samples. Therefore, an iteratively up-scale resolution of strategy is adopted to reduce computational overhead. As shown in Fig. 4, a transformer-based generator is proposed, which contains an initialization stage and three up-scale stages. Note $\boldsymbol u_{i}^{init}$ , $\boldsymbol u_{i}^{stage1}$, $\boldsymbol u_{i}^{stage2}$, and $\boldsymbol u_{i}^{stage3}$ as output of the initialization stage, the first up-scale stage, the second up-scale stage, and the third up-scale stage, respectively. At the initialization stage, to ensure that generated samples and the real samples have same labels, a label vector $\boldsymbol l_i \in {\Re ^{l}}$ and a vector $\boldsymbol {z} \in {\Re ^{z}}$ of Gaussian noise are input to the model and then concatenated to $\boldsymbol {z}^{c} \in {\Re ^{l+z}}$ . Next, a multi-layer perception (MLP) is performed to map $\boldsymbol {z}^l$ into a high-dimensional vector $\boldsymbol {z}^{h} \in {\Re ^{b_0 m_0}}$. $\boldsymbol {z}^{h}$ is unflattened into a initial vector $\boldsymbol u_{i}^{init} \in {\Re ^{b_0 \times m_0}}$, where $b_0$ is the number of initial bands, and $m_0$ is the number of pixels of each band. Then, $\boldsymbol u_{i}^{init}$ is input to three sequential up-scale stages, which increase the spectral resolution. At first stage, there are an upsampling module and a transformer encoder. The upsampling module reshapes $\boldsymbol u_{i}^{init}\in {\Re ^{b_0 \times m_0}}$ to $\boldsymbol u_{i}^{stage1} \in {\Re ^{v_{1}b_0 \times \frac {m_0}{\nu _{1}}}}$, where $\nu _{1}$ is the magnification of spectral resolution. After that, a transformer encoder is introduced to calculate the correspondence between each two bands, and it transforms disordered sequence vectors into ordered sequence vectors. This operation enhances the authenticity of $\boldsymbol u_{i}^{stage1}$. Analogously, the next two up-scale stage increase the spectral resolution by $v_{2}$ and $v_{3}$ times, which forms $\boldsymbol u_{i}^{stage2} \in {\Re ^{v_{1}v_{2}b_0 \times \frac {m_0}{v_{1}v_{2}}}}$ and $\boldsymbol u_{i}^{stage3} \in {\Re ^{v_{1}v_{2}v_{3}b_0 \times \frac {m_0}{v_{1}v_{2}v_{3}}}}$, respectively. Finally, we unflatten $\boldsymbol u_{i}^{stage3}$ to $\boldsymbol u_{i}^{fake} \in {\Re ^{ b\times w \times h}}$, where $b=v_{1}v_{2}v_{3}b_0$, and $wh = \frac {m_0}{v_{1}v_{2}v_{3}}.$

Fig. 4. Overview illustration of the proposed generator.

Download Full Size | PDF

2.2.2 Discriminator

Figure 5 gives an overview illustration of the proposed discriminator. The discriminator adopts real HSI cubes and corresponding generated cubes as input. For sample $\boldsymbol u_{i} \in \left \lbrace \boldsymbol u_{i}^{fake}, \boldsymbol u_{i}^{real}\right \rbrace$, the linear flatten layer first reshapes $\boldsymbol u_{i} \in {\Re ^{ b\times w \times h}}$ to $\boldsymbol u_{i} \in {\Re ^{ b\times (wh)}}$, and it makes $\boldsymbol u_{i}$ match the input shape of subsequent multi-layer collaborative transformer encoder. In the next phase of forward propagation, several encoders extract features one by one. Note $\boldsymbol {f}_{i} \in {\Re ^{ (b+1)\times (wh)}, i \in \left \lbrace 1,2,\ldots,n\right \rbrace }$ as the output features from the $i$-th encoder, where $n$ is the number of encoders and $b+1$ indicates that a classification token is concatenated. The ability of feature representation will be limited with the deepening of network layers [33]. However, there is more and more high-level semantic information in $\boldsymbol f_{i}$ as the number of encoders $i$ grows. These low-level to high-level features contain potential progressive relations, which can be regarded as a sequence. This reveals a potential effective feature to excavate. Therefore, the multi-layer collaborative transformer encoder is designed to capture these relations and enhance information exchange between each two encoders. The multi-layer collaborative transformer encoder employs a bypass connection, which regards features from different layers as an input sequence. Then, an MHSA module is performed to adaptively obtain collaborative features $\boldsymbol {f}^{col}$. These procedures can be expressed as

(12)$$\boldsymbol {f}_{1}^{'}, \boldsymbol {f}_{2}^{'},\ldots, \boldsymbol {f}_{n-1}^{'},\boldsymbol {f}_{n}^{'} = \operatorname{transformer}( \boldsymbol {f}_{1}, \boldsymbol {f}_{2},\ldots, \boldsymbol {f}_{n-1},\boldsymbol {f}_{n} )$$

(13)$$\boldsymbol f^{col} = [ \boldsymbol {f}_{1}^{'}, \boldsymbol {f}_{2}^{'},\ldots, \boldsymbol {f}_{n-1}^{'}, \boldsymbol {f}_{n}^{'}]$$

where $\operatorname {transformer}(\bullet )$ denotes a transformer encoder operation, $[\bullet ]$ is a concatenating operation, and $\boldsymbol f^{col} \in {\Re ^{n\times [(b+1)(wh)]}}$ is the collaborative features. Next, a 1D convolution operation is performed to reduce the number of channels of $\boldsymbol f^{col}$. $\boldsymbol f^{col}$ is added to $\boldsymbol f_{n}$, and the fused features are output. Finally, the classification token is coupled to two fully connected layers for two classification tasks: real or fake, cancer or normal. The precise construction of the MC-GAT is shown in Table 1.

Fig. 5. Overview illustration of the multi-layer collaborative discriminator.

Download Full Size | PDF

Table 1. The detailed construction of MC-GAT

View Table | View all tables in this article

2.2.3 Optimization objective

In the proposed MC-GAT, the generator takes label $\boldsymbol l_i$ and Gaussian noise $\boldsymbol z$ as input, and it generates fake sample $\boldsymbol u^{fake}_i$ by the implicit probability $p_G(\boldsymbol l_i, \boldsymbol z)$. This can be expressed mathematically by

(14)$$p_G(\boldsymbol l_i, \boldsymbol z) = p(G(\boldsymbol l_i, \boldsymbol z)=\boldsymbol u^{fake}_i)$$

in which $G(\bullet )$ is the implicit generator function. For each input sample $\boldsymbol u_i$, the discriminator calculates a probability over sources and a probability over the class labels by implicit estimating function $D(\bullet )$. This is formulated by

(15)$$p_D(src_i|\boldsymbol u_i), p_D(cls_i|\boldsymbol u_i) = D(\boldsymbol u_i)$$

where $src_i$ and $cls_i$ are the source and label of sample $\boldsymbol u_i$, respectively.

Discriminator is trained to maximize the probability of differentiating samples (authenticity and malignance) correctly, which can be noted as the following optimization function:

(16)$$\begin{aligned} \max _{D} L(D) & =\sum_{i=1}^{n} p_D(src_i=real|\boldsymbol u_i^{real}) + \sum_{i=1}^{n}p_D(cls_i=\boldsymbol l_i|\boldsymbol u_i^{real}) \end{aligned}.$$

Generator is trained for two purposes: (1) Confuse the discriminator by reducing ability to identify the authenticity of samples. (2) Improve the ability to identify the malignance of samples. The optimization function can be noted as

(17)$$\min _{G} L(G) = \sum_{i=1}^{n} p_G(\boldsymbol l_i, \boldsymbol z) p_D(src_i=fake|\boldsymbol u_i^{fake}) - \sum_{i=1}^{n}p_G(\boldsymbol l_i, \boldsymbol z)p_D(cls_i=\boldsymbol l_i|\boldsymbol u_i^{fake}).$$

Therefore, the following two-player minimax game with objective function $L(D, G)$ is played by the generator and the discriminator:

(18)$$\begin{aligned} \min _{G} \max _{D} L(D, G) & =\sum_{i=1}^{n} p_D(src_i=real|\boldsymbol u_i^{real}) + \sum_{i=1}^{n}p_D(cls_i=\boldsymbol l_i|\boldsymbol u_i^{real}) +\\ & \quad\sum_{i=1}^{n} p_G(\boldsymbol l_i, \boldsymbol z) p_D(src_i=fake|\boldsymbol u_i^{fake}) - \sum_{i=1}^{n}p_G(\boldsymbol l_i, \boldsymbol z)p_D(cls_i=\boldsymbol l_i|\boldsymbol u_i^{fake}) \end{aligned}.$$

2.3 Data description

The Multidimensional Choledoch database is constructed by Shanghai Key Laboratory of Multidimensional Information Processing, China [39]. The choledoch tissues are collected by Changhai Hospital, Shanghai, China with the approval of the ethics committee. Every choledoch tissue is stained with hematoxylin and eosin, and the slide thickness is 10 microns. The database contains 880 scenes of HSIs and corresponding RGB images. In these multidimensional scenes, 690 scenes from 125 patients contain part of cancer areas, 48 scenes from 14 patients are filled with cancer areas, and 142 scenes from 35 patients contain no cancer areas. Each HSI contains a spatial size of 1280 $\times$ 1024 pixels with 60 spectral channels in the wavelength range from 550 to 1000 nm. The magnification of the objective lens is $\times 20$.

3. Experimental results and analysis

3.1 Experimental setup

1) Evaluation Metrics: The classification performance of each model is evaluated quantitatively in terms of four commonly used indices, i.e., overall accuracy (OA), area under curve (AUC), precision, and recall. In addition, the classification maps generated by different models are visualized to allow for a qualitative comparison.

2) Comparison With Several Latest Algorithms: The following comparison experiments will exploit a variety of deep neural networks. They are 2DCNN [40], 3DCNN [41], Contextual Deep CNN (CDCNN) [42], Vision Transformer (ViT) [27], SpectralFormer (SpecT) [22], Dense Discriminator (Dense-D), Multi-layer Collaborative Discriminator (MCD), and the proposed MC-GAT. To prove the efficacy of the proposed multi-layer collaborative transformer encoder, the Dense-D is implemented, and the output features from transformer encoders are densely connected as in Fig. 6. For MCD, it is the discriminator in MC-GAT, which uses the multi-layer collaborative transformer encoder. Only one of the two classification heads is used in training and prediction process. For MC-GAT, the spatial size of input cubes, generator learning rate, discriminator learning rate, and the number of depths are important parameters. The optimal values of them are determined experimentally. For a fair comparison, the optimal values of hyperparameters in all methods will be optimized by 5-fold cross-validation to get the best performance.

Fig. 6. The detail encoder architecture in Dense Discriminator.

Download Full Size | PDF

3) Implementation Details: In all experiments, the scenes of CCA pathological HSIs which contain part of cancer areas are chosen. Training and test data are strictly separated with respect to the patients. Take a training set including one patient and a test set including all the remaining patients as an example, and the experimental workflow is described in Fig. 7. At first, the $i$-th ($i \in \left \lbrace 1,2,\ldots,125\right \rbrace$) patient is randomly selected, and a CCA pathological HSI is randomly selected from the patient. From the selected HSI, training pixels and valid pixels from each class are randomly selected to construct cube samples, which form a training set and a validation set, respectively. After the training process is finished, all CCA pathological HSIs from remaining patients are used for test. For each HSI in the test set, the labels of all pixels are predicted, and a classification map is obtained. In the training procedure, the cross-entropy loss function is used for each model, and the epoch and the batch size are set as 10,000 and 512, respectively. As for optimization, Adam [43] optimizer is used in all experiments. All models are built using Python 3.7. Training networks is employed using Python-based PyTorch [44] library on a desktop workstation server with two Intel Xeon 3.6 GHz CPU processors, 256 GB memory, and 6 NVIDIA RTX TITAN GPU cards.

Fig. 7. The workflow of all the experiments.

Download Full Size | PDF

3.2 Parameter analysis

In order to conduct a comprehensive investigation of the proposed MC-GAT, we analyze some key factors including spatial size of input cubes, the learning rate, and the depth of multi-layer collaborative transformer encoder. For the spatial size of input cubes, it not only influences the amount of spatial information but also affects the complexity of the model. As for learning rate, too large learning rate can cause the model to converge too quickly to a suboptimal solution, but too small learning rate is able to cause the convergence process to get stuck. With the increment of the depth in multi-layer collaborative transformer encoder, the model is easy to fall into an overfitting state. Thus, these parameters are investigated to achieve better classification accuracies. In all experiments, the data splitting strategies are shown in Table 2.

Table 2. The data splitting strategies of all experiments in Subsection 3.2

View Table | View all tables in this article

For spatial size of input cubes, it is selected within a set of $\left \lbrace 3, 5, 7, 9, 11\right \rbrace$. As for learning rate, we search the optimum learning rate from $\left \lbrace 1 \times 10^{-5}, 1 \times 10^{-4}, 1 \times 10^{-3}, 1 \times 10^{-2} \right \rbrace$ for generator and discriminator. In terms of depth in multi-layer collaborative transformer encoder, it is selected from a set of $\left \lbrace 3,4,5,6,7 \right \rbrace$. The results of experiments are demonstrated in Fig. 8.

Fig. 8. Parameter analysis on Multidimensional Choledoch database. (a) OAs of MC-GAT under different spatial sizes of cubes. (b) OAs with different learning rates. (c) OAs with different numbers of depth in multi-layer collaborative transformer encoder.

Download Full Size | PDF

As illustrated in Fig. 8(a), it is apparent that increasing the spatial dimension of cubes improves classification performance significantly. This is due to the fact that a bigger spatial size contains more spatial information, which strengthens the discriminant ability of cubes. Considering memory footprint and classification performance, the spatial size is set as 9.

As shown in Fig. 8(b), discriminator learning rate and generator learning rate are both set as $1 \times 10^{-4}$. For generator, too small learning rate causes inadequate learning about implicit probability distribution $P_G(\boldsymbol l_i, \boldsymbol z)$, and training instability occurs when the learning rate is too large. For discriminator, a small learning rate makes it harder for the model to acquire high-level features, and too large learning rate results in model diverging as well as vanishing exploding gradients phenomena [45].

As shown in Fig. 8(c), it is obvious that the best depth of multi-layer collaborative transformer encoder is 5. It is for the reason that too small depth will cause an overfitting problem [46], while for too large depths, the parameters are difficult to train.

3.3 Analysis of generated samples

As described as Eq. (18), the proposed MC-GAT plays a two-player minimax game. If the generator learns implicit probability distribution $P_G(\boldsymbol l_i, \boldsymbol z)$ better, synthetic fake samples are closer to real samples. Thus, some experiments are performed to analyze the generative ability. At first, the proposed MC-GAT is trained with a data splitting strategy in Table 3. Next, the trained generator produces the same number of fake samples as training samples. Since the spatial size (9$\times$9) of samples is too small to visually distinguish, the averaged spectral radiances of these fake samples with different epochs are illustrated in Fig. 9.

Fig. 9. Visualization of synthetic fake samples with different epochs. Comparisons between generated cancer samples and real cancer samples are shown on the top, while comparisons between generated normal samples and real normal samples are shown on the bottom.

Download Full Size | PDF

Table 3. The data splitting strategy of the experiment in Subsection 3.3

View Table | View all tables in this article

As shown in Fig. 9, at the initialization of the generator, the generated samples are irregular. With the training of the neural network continuing, the generated samples are more and more similar to real samples. It can be concluded that, in the game with the discriminator, the generator gradually learns the implicit probability $P_G(\boldsymbol l_i, \boldsymbol z)$ that transforms such unstructured noise $\boldsymbol {z}$ and $\boldsymbol l_i$ into real samples. Finally, after the generator model converges, it produces various samples from $P_G(\boldsymbol l_i, \boldsymbol z)$ and makes the discriminator model more robust.

3.4 Analysis of the proposed multi-layer collaborative discriminator

Grad-CAM [47] technique is exploited to exhibit the visual explanation of the proposed MCD. The data splitting strategy is shown in Table 4. After training is finished, we sum the weighted feature maps by channel to calculate the contribution weights of different bands and layers in the multi-layer collaborative transformer encoder. The average contribution band weights and layer contribution weights of all test samples are shown in Fig. 10(a) and Fig. 10(b).

Fig. 10. Visual explanation of the proposed MCD. (a) Contribution weights of different bands. (b) Contribution weights of different layers.

Download Full Size | PDF

Table 4. The data splitting strategy of the experiment in Subsection 3.4

View Table | View all tables in this article

As shown in Fig. 10(a), the contribution weights differ with different numbers of bands. Obviously, there are responses at most wavebands. The reason is that the proposed MCD makes full use of the information at different bands of CCA hyperspectral pathological images. While in Fig. 10(b), the contribution weights have stronger responses from Layer 2 to Layer 5. It can be inferred that the multi-layer collaborative transformer encoder excavates potential sequence information from shallow to deep encoders and comprehensively utilizes output features from different layers.

3.5 Two-dimension embedding analysis

In this subsection, we illustrate the contribution of the multi-layer collaborative transformer encoder and the generator to classification performance of MC-GAT. T-Distributed Stochastic Neighbor Embedding (T-SNE) [48] technique is used to exhibit the deep features of ViT, MCD, and MC-GAT. The data splitting strategy is indicated in Table 5. After training, the deep features of all test samples are visualized in Fig. 11.

Fig. 11. Visualization of deep features from different models by T-SNE. (a) ViT. (b) MCD. (c) MC-GAT.

Download Full Size | PDF

Table 5. The data splitting strategy of the experiment in Subsection 3.5

View Table | View all tables in this article

As shown in Fig. 11 (a), the deep features from ViT produced a large number of confused points belonging to different classes, since the expressive power of encoder in ViT will be limited with the deepening depth of layers. While in Fig. 11 (b), only a small number of embedding features are overlapped. The reason is that the multi-layer collaborative transformer encoder in MCD adaptively mines collaborative features from multi-layers, which enhances the compactness of intraclass samples and the separation of interclass data. As for Fig. 11 (c), the data points from different classes are almost completely separated, which indicates that, in the adversarial training process, the generator overcomes insufficient training data to a certain extent and enhances the generalization capability of the discriminator.

3.6 Analysis of receiver operating characteristic curve

In this subsection, the receiver operating characteristic curve (ROC) of MC-GAT with several algorithms is indicated in Fig. 12. The data splitting strategy is depicted in Table 6. After training, all test samples are evaluated by constructing ROC curves.

Fig. 12. Comparisons of ROCs generated by MC-GAT with some state-of-the-art methods on Multidimensional Choledoch database.

Download Full Size | PDF

Table 6. The data splitting strategy of the experiment in Subsection 3.6

View Table | View all tables in this article

As shown in Fig. 12, the AUC of MC-GAT is higher than those of CNN-based methods. It reveals that the multi-layer collaborative transformer encoder not only fully utilizes the information of different layers in discriminator but also holds original spectral sequence relations. Apart from this, the generator learns similar probability distribution as real samples by sampling from real data space, and it makes the discriminator learning a variety of samples. These superiorities mitigate the overfitting phenomenon and enhance generalization capability of the discriminator. This indicates that the proposed MC-GAT is capable of assisting pathologists in early histopathological diagnosis of CCA.

3.7 Comparisons between the state-of-the-art deep learning methods and MC-GAT

In this subsection, the classification performance of MC-GAT is evaluated. The data splitting strategy is depicted in Table 7. Table 8 shows the classification results with different algorithms on Multidimensional Choledoch database.

Table 7. The data splitting strategy of the experiment in Subsection 3.7

View Table | View all tables in this article

Table 8. Performances of different algorithms on Multidimensional Choledoch database

View Table | View all tables in this article

According to Table 8, it is apparent that all the transformer-based methods attain competitive classification in most cases. It is for the reason that transformer retains the original spectral order of CCA hyperspectral pathological images and captures relations between bands over a long distance. MCD achieves better classification results than Dense-D and SpecT, because the multi-layer collaborative transformer encoder adaptively extracts collaborative features from multi-layers, which is beneficial to classification. It should be pointed out that the classification performances of MCD are better than Dense-D, because simple superposition of outputs from multi-layers results in redundant features, which is detrimental to classification. This also confirms the effectiveness of the proposed transformer encoder. In most cases, MC-GAT achieves better classification results than other algorithms. It is because that, in the adversarial training process, the synthetic fake samples from the CCA pathological HSI increase the number of training samples, which alleviates insufficient training data to a certain extent.

In order to better present the classification results of different methods on Multidimensional Choledoch database, some classification maps of different approaches are shown in Fig. 13. By inspecting corresponding classification maps, the numerical results can be supported. The MC-GAT produces more homogenous areas and smoother classification maps than other methods for the following two reasons: (1) The generator produces more various samples for data augmentation, and it optimizes the multi-layer collaborative transformer discriminator. (2) The multi-layer collaborative transformer encoder preserves the original spectral sequence of CCA hyperspectral pathological images and fully utilizes features from different layers by transforming them into multi-layer collaborative features. These features enhance the discriminant ability of deep features by feature fusion. This indicates that the proposed MC-GAT method holds great potential for diagnosis of CCA hyperspectral pathological images, especially in case of insufficient labeled samples.

Fig. 13. The classification maps of Multidimensional Choledoch database with compared methods: (a) False color image. (b) Ground-truth labels. (c) 2DCNN. (d) 3DCNN. (e) CDCNN. (f) ViT. (g) Dense-D. (h) SpecT. (i) MCD. (j) MC-GAT.

Download Full Size | PDF

4. Conclusion

In this paper, for CCA classification from hyperspectral pathological images, a multi-layer collaborative generative adversarial transformer called MC-GAT is presented. To tackle the issue that traditional deep learning methods require a large number of labeled samples, MC-GAT adopts a generator with an iteratively up-scale solution to enhance generalization ability of the discriminator. To overcome spectral sequence distortion caused by CNN and the problem of expressive power decrease with transformer encoders deepening, a transformer-based discriminator with multi-layer collaborative transformer encoder is designed to adaptively capture original spectral sequence information and acquire relations among multi-layers by enhancing information interaction among different layers of the network. The accuracy, area under curve, precision, and recall achieve 82.64%, 79.59%, 76.74%, and 81.07% on the Multidimensional Choledoch database. The experiment results demonstrate that the proposed MC-GAT algorithm performs better than some state-of-the-art methods. It is concluded that the proposed MC-GAT has capability to aid pathologists in histopathological analysis of CCA from hyperspectral imagery. Further work will be carried out to use local receptive field to reduce the computational complexity of the MC-GAT as much as possible.

Funding

National Natural Science Foundation of China (42071302); Innovation Program for Chongqing Overseas Returnees (cx2019144); Graduate Research and Innovation Foundation of Chongqing (CYB21060); Higher Education and Research (NVIDIA).

Acknowledgments

The authors would like to thank Prof. Qingli Li of East China Normal University for providing us with the Multidimensional Choledoch database. Thanks to the anonymous reviewers and the associate editor for their insightful comments and suggestions.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [39].

References

1. J. M. Banales, J. J. Marin, A. Lamarca, P. M. Rodrigues, S. A. Khan, L. R. Roberts, V. Cardinale, G. Carpino, J. B. Andersen, and C. Braconi, “Cholangiocarcinoma 2020: the next horizon in mechanisms and management,” Nat. Rev. Gastroenterol. Hepatol. 17(9), 557–588 (2020). [CrossRef]

2. X. Zhang, F. Xue, D. Dong, M. Weiss, I. Popescu, H. P. Marques, L. Aldrighetti, S. K. Maithel, C. Pulitano, and T. W. Bauer, “Number and station of lymph node metastasis after curative-intent resection of intrahepatic cholangiocarcinoma impact prognosis,” Ann. Surg. 274(6), e1187–e1195 (2021). [CrossRef]

3. A. Shoeibi, N. Ghassemi, R. Alizadehsani, M. Rouhani, H. Hosseini-Nejad, A. Khosravi, M. Panahiazar, and S. Nahavandi, “A comprehensive comparison of handcrafted features and convolutional autoencoders for epileptic seizures detection in eeg signals,” Expert Syst. Appl. 163, 113788 (2021). [CrossRef]

4. F. Shi, L. Xia, F. Shan, B. Song, D. Wu, Y. Wei, H. Yuan, H. Jiang, Y. He, Y. Gao, H. Sui, and D. Shen, “Large-scale screening to distinguish between COVID-19 and community-acquired pneumonia using infection size-aware classification,” Phys. Med. Biol. 66(6), 065031 (2021). [CrossRef]

5. N. Q. K. Le, T. N. K. Hung, D. T. Do, L. H. T. Lam, L. H. Dang, and T. T. Huynh, “Radiomics-based machine learning model for efficiently classifying transcriptome subtypes in glioblastoma patients from mri,” Comput. Biol. Med. 132, 104320 (2021). [CrossRef]

6. T. Xia, A. Kumar, M. Fulham, D. Feng, Y. Wang, E. Y. Kim, Y. Jung, and J. Kim, “Fused feature signatures to probe tumour radiogenomics relationships,” Sci. Rep. 12(1), 2173 (2022). [CrossRef]

7. A. Ibrahim, B. Barufaldi, T. Refaee, T. M. Silva Filho, R. J. Acciavatti, Z. Salahuddin, R. Hustinx, F. M. Mottaghy, A. D. Maidment, and P. Lambin, “Maaspenn radiomics reproducibility score: A novel quantitative measure for evaluating the reproducibility of ct-based handcrafted radiomic features,” Cancers 14(7), 1599 (2022). [CrossRef]

8. X. Ouyang, J. Huo, L. Xia, F. Shan, J. Liu, Z. Mo, F. Yan, Z. Ding, Q. Yang, and B. Song, “Dual-sampling attention network for diagnosis of covid-19 from community acquired pneumonia,” IEEE Trans. Med. Imaging 39(8), 2595–2605 (2020). [CrossRef]

9. T. Zhou, K. H. Thung, X. Zhu, and D. Shen, “Effective feature learning and fusion of multimodality data using stage-wise deep neural network for dementia diagnosis,” Hum. Brain Mapp. 40(3), 1001–1016 (2019). [CrossRef]

10. L. Meng, D. Dong, X. Chen, M. Fang, R. Wang, J. Li, Z. Liu, and J. Tian, “2d and 3d ct radiomic features performance comparison in characterization of gastric cancer: a multi-center study,” IEEE J. Biomed. Health Inform. 25(3), 755–763 (2021). [CrossRef]

11. S. Wang, Y. Zha, W. Li, Q. Wu, X. Li, M. Niu, M. Wang, X. Qiu, H. Li, and H. Yu, “A fully automatic deep learning system for covid-19 diagnostic and prognostic analysis,” Eur. Clin. Respir. J. 56(2), 2000775 (2020). [CrossRef]

12. J. P. Villa-Pulgarin, A. A. Ruales-Torres, D. Arias-Garzón, M. A. Bravo-Ortiz, H. B. Arteaga-Arteaga, A. Mora-Rubio, J. A. Alzate-Grisales, E. Mercado-Ruiz, M. Hassaballah, and S. Orozco-Arias, “Optimized convolutional neural network models for skin lesion classification,” CMC-Comput. Mat. Contin. 70(2), 2131–2148 (2022). [CrossRef]

13. O. Iizuka, F. Kanavati, K. Kato, M. Rambeau, K. Arihiro, and M. Tsuneki, “Deep learning models for histopathological classification of gastric and colonic epithelial tumours,” Sci. Rep. 10(1), 1504 (2020). [CrossRef]

14. A. Kiani, B. Uyumazturk, P. Rajpurkar, A. Wang, R. Gao, E. Jones, Y. Yu, C. P. Langlotz, R. L. Ball, and T. J. Montine, “Impact of a deep learning assistant on the histopathologic classification of liver cancer,” npj Digit. Med. 3(1), 23–28 (2020). [CrossRef]

15. G. Lu and B. Fei, “Medical hyperspectral imaging: a review,” J. Biomed. Opt. 19(1), 010901 (2014). [CrossRef]

16. Q. Wang, J. Wang, M. Zhou, Q. Li, and Y. Wang, “Spectral-spatial feature-based neural network method for acute lymphoblastic leukemia cell identification via microscopic hyperspectral imaging technology,” Biomed. Opt. Express 8(6), 3017–3028 (2017). [CrossRef]

17. L. A. Courtenay, D. González-Aguilera, S. Lagüela, S. del Pozo, C. Ruiz-Mendez, I. Barbero-García, C. Román-Curto, J. C. nueto, C. Santos-Durán, M. E. C. noso Álvarez, M. Roncero-Riesco, D. Hernandez-Lopez, D. Guerrero-Sevilla, and P. Rodríguez-Gonzalvez, “Hyperspectral imaging and robust statistics in non-melanoma skin cancer analysis,” Biomed. Opt. Express 12(8), 5107–5127 (2021). [CrossRef]

18. Q. Li, X. He, Y. Wang, H. Liu, D. Xu, and F. Guo, “Review of spectral imaging technology in biomedical engineering: achievements and challenges,” J. Biomed. Opt. 18(10), 100901 (2013). [CrossRef]

19. L. Sun, M. Zhou, Q. Li, M. Hu, Y. Wen, J. Zhang, Y. Lu, and J. Chu, “Diagnosis of cholangiocarcinoma from microscopic hyperspectral pathological dataset by deep convolution neural networks,” Methods 202, 22–30 (2022). [CrossRef]

20. B. Hu, J. Du, Z. Zhang, and Q. Wang, “Tumor tissue classification based on micro-hyperspectral technology and deep learning,” Biomed. Opt. Express 10(12), 6370–6389 (2019). [CrossRef]

21. R. Wang, Y. He, C. Yao, S. Wang, Y. Xue, Z. Zhang, J. Wang, and X. Liu, “Classification and segmentation of hyperspectral data of hepatocellular carcinoma samples using 1-d convolutional neural network,” Cytometry, Part A 97(1), 31–38 (2020). [CrossRef]

22. D. Hong, Z. Han, J. Yao, L. Gao, B. Zhang, A. Plaza, and J. Chanussot, “Spectralformer: Rethinking hyperspectral image classification with transformers,” IEEE Trans. Geosci. Remote Sens. 60, 1–16 (2022). [CrossRef]

23. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems (NIPS), vol. 30 (CAI, 2017).

24. Z. Zhou, S. Qiu, Y. Wang, M. Zhou, X. Chen, M. Hu, Q. Li, and Y. Lu, “Swin-spectral transformer for cholangiocarcinoma hyperspectral image segmentation,” in International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), (2021), pp. 1–6.

25. N. Li, J. Xue, and S. Jia, “Spectral context-aware transformer for cholangiocarcinoma hyperspectral image segmentation,” in 2022 the 5th International Conference on Image and Graphics Processing (ICIGP), (ACM, 2022), pp. 209–213.

26. B. Yun, Y. Wang, J. Chen, H. Wang, W. Shen, and Q. Li, “Spectr: Spectral transformer for hyperspectral pathology image segmentation,” arXiv preprint arXiv:2103.03604 (2021).

27. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, and S. Gelly, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929 (2020).

28. X. Yi, E. Walia, and P. Babyn, “Generative adversarial network in medical imaging: A review,” Med. Image Anal. 58, 101552 (2019). [CrossRef]

29. B. Hu, Y. Tang, I. Eric, C. Chang, Y. Fan, M. Lai, and Y. Xu, “Unsupervised learning for cell-level visual representation in histopathology images with generative adversarial networks,” IEEE J. Biomed. Health Inform. 23(3), 1316–1328 (2019). [CrossRef]

30. A. Madani, M. Moradi, A. Karargyris, and T. Syeda-Mahmood, “Semi-supervised learning with generative adversarial networks for chest x-ray classification with ability of data domain adaptation,” in International Symposium on Biomedical Imaging (ISBI), (IEEE, 2018), pp. 1038–1042.

31. A. Lahiri, K. Ayush, P. Kumar Biswas, and P. Mitra, “Generative adversarial learning for reducing manual annotation in semantic segmentation on large scale miscroscopy images: Automated vessel segmentation in retinal fundus image as test case,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2017), pp. 42–48.

32. M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, “Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification,” Neurocomputing 321, 321–331 (2018). [CrossRef]

33. Y. Dong, J. B. Cordonnier, and A. Loukas, “Attention is not all you need: Pure attention loses rank doubly exponentially with depth,” in International Conference on Machine Learning (ICML), (PMLR, 2021), pp. 2793–2803.

34. Y. Tang, K. Han, C. Xu, A. Xiao, Y. Deng, C. Xu, and Y. Wang, “Augmented shortcuts for vision transformers,” in Advances in Neural Information Processing Systems (NIPS), vol. 34 (CAI, 2021).

35. J. D. M. W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Annual Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies (NACCL-HLT), (ACL, 2019), pp. 4171–4186.

36. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (IEEE, 2016), pp. 770–778.

37. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450 (2016).

38. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in IEEE International Conference on Computer Vision (ICCV), (IEEE, 2021), pp. 10012–10022.

39. Q. Zhang, Q. Li, G. Yu, L. Sun, M. Zhou, and J. Chu, “A multidimensional choledoch database and benchmarks for cholangiocarcinoma diagnosis,” IEEE Access 7, 149414–149421 (2019). [CrossRef]

40. B. Liu, X. Yu, P. Zhang, X. Tan, A. Yu, and Z. Xue, “A semi-supervised convolutional neural network for hyperspectral image classification,” Remote Sens. Lett. 8(9), 839–848 (2017). [CrossRef]

41. Y. Li, H. Zhang, and Q. Shen, “Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network,” Remote Sens. 9(1), 67 (2017). [CrossRef]

42. H. Lee and H. Kwon, “Going deeper with contextual cnn for hyperspectral image classification,” IEEE Trans. Image Process. 26(10), 4843–4855 (2017). [CrossRef]

43. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 (2014).

44. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems (NIPS), vol. 32 (CAI, 2019).

45. B. Hanin, “Which neural net architectures give rise to exploding and vanishing gradients?” in Advances in Neural Information Processing Systems (NIPS), vol. 31 (CAI, 2018).

46. H. Li, J. Li, X. Guan, B. Liang, Y. Lai, and X. Luo, “Research on overfitting of deep learning,” in International Conference on Computational Intelligence and Security (ICIS), (IEEE, 2019), pp. 78–81.

47. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, (IEEE, 2017), pp. 618–626.

48. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne,” J Mach. Learn. Res. 9, 11 (2008).

Networks	No.	Type	Input Size	Output Size
Generator	1	FullyConnected	$z + l$	$b_{0} m_{0}$
	2	Reshape	$b_{0} m_{0}$	$v_{1} b_{0} \times \frac{m_{0}}{v_{1}}$
	3	Position Embedding	$v_{1} b_{0} \times \frac{m_{0}}{v_{1}}$	$v_{1} b_{0} \times \frac{m_{0}}{v_{1}}$
	4	Transformer	$v_{1} b_{0} \times \frac{m_{0}}{v_{1}}$	$v_{1} b_{0} \times \frac{m_{0}}{v_{1}}$
	5	Reshape	$v_{1} b_{0} \times \frac{m_{0}}{v_{1}}$	$v_{1} v_{2} b_{0} \times \frac{m_{0}}{v_{1} v_{2}}$
	6	Position Embedding	$v_{1} v_{2} b_{0} \times \frac{m_{0}}{v_{1} v_{2}}$	$v_{1} v_{2} b_{0} \times \frac{m_{0}}{v_{1} v_{2}}$
	7	Transformer	$v_{1} v_{2} b_{0} \times \frac{m_{0}}{v_{1} v_{2}}$	$v_{1} v_{2} b_{0} \times \frac{m_{0}}{v_{1} v_{2}}$
	8	Reshape	$v_{1} v_{2} b_{0} \times \frac{m_{0}}{v_{1} v_{2}}$	$v_{1} v_{2} v_{3} b_{0} \times \frac{m_{0}}{v_{1} v_{2} v_{3}}$
	9	Position Embedding	$v_{1} v_{2} v_{3} b_{0} \times \frac{m_{0}}{v_{1} v_{2} v_{3}}$	$v_{1} v_{2} v_{3} b_{0} \times \frac{m_{0}}{v_{1} v_{2} v_{3}}$
	10	Transformer	$v_{1} v_{2} v_{3} b_{0} \times \frac{m_{0}}{v_{1} v_{2} v_{3}}$	$v_{1} v_{2} v_{3} b_{0} \times \frac{m_{0}}{v_{1} v_{2} v_{3}}$
	11	Unflatten	$v_{1} v_{2} v_{3} b_{0} \times \frac{m_{0}}{v_{1} v_{2} v_{3}}$	$b \times w \times h$
Discriminator	1	Flatten	$b \times w \times h$	$b \times (w h)$
	2	Classification Token Concatenation	$b \times (w h)$	$(b + 1) \times (w h)$
	3	Position Embedding	$(b + 1) \times (w h)$	$(b + 1) \times (w h)$
	4	Multi-layer Collaborative Transformer	$(b + 1) \times (w h)$	$(b + 1) \times (w h)$
	5	Classification Token Selection	$(b + 1) \times (w h)$	$w h$
	6	FullyConnected	$w h$	2
		FullyConnected	$w h$	2

Division	Number of patients	Number of HSIs	Sampling	Number of samples
Random	1	1	Random	Training	25,000
	1	1		Valid	10,000
	1	1		Test	All

Division	Number of patients	Number of HSIs	Sampling	Number of samples
Random	1	1	Random	Training	25,000
Random	1	1	Random	Valid	10,000

Division	Number of patients	Number of HSIs	Sampling	Number of samples
Random	1	1	Random	Training	25,000
	1	1		Valid	10,000
	1	1		Test	25,000

Division	Number of patients	Number of HSIs	Sampling	Number of samples
Random	1	1	Random	Training	25,000
	1	1		Valid	10,000
	1	1		Test	25000

MC-GAT: multi-layer collaborative generative adversarial transformer for cholangiocarcinoma classification from hyperspectral pathological images

Abstract

1. Introduction

2. Materials and methods

2.1 Related works

2.1.1 Generative Adversarial Networks (GAN)

2.1.2 Vision transformer encoder

2.2 Proposed methods

2.2.1 Generator

2.2.2 Discriminator

2.2.3 Optimization objective

2.3 Data description

3. Experimental results and analysis

3.1 Experimental setup

3.2 Parameter analysis

3.3 Analysis of generated samples

3.4 Analysis of the proposed multi-layer collaborative discriminator

3.5 Two-dimension embedding analysis

3.6 Analysis of receiver operating characteristic curve

3.7 Comparisons between the state-of-the-art deep learning methods and MC-GAT

4. Conclusion

Funding

Acknowledgments

Disclosures

Data availability

References

Data availability

Cited By

Figures (13)

Tables (8)

Equations (18)

Biomedical Optics Express

	CNN-based Methods			Transformer-based Methods
Metrics	2DCNN	3DCNN	CDCNN	ViT	Dense-D	Spec-T	MCD	MC-GAT
OA	65.78	73.17	72.85	77.87	79.60	80.65	81.48	82.64
AUC	69.61	72.13	76.11	74.05	75.35	77.84	78.13	79.59
PRECION	59.29	70.73	71.39	75.20	70.26	72.69	73.12	76.74
RECALL	85.58	71.58	77.22	75.02	74.69	78.70	79.37	81.07