Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Retinal layer segmentation in optical coherence tomography (OCT) using a 3D deep-convolutional regression network for patients with age-related macular degeneration

Open Access Open Access

Abstract

Introduction – Retinal layer segmentation in optical coherence tomography (OCT) images is an important approach for detecting and prognosing disease. Automating segmentation using robust machine learning techniques lead to computationally efficient solutions and significantly reduces the cost of labor-intensive labeling, which is traditionally performed by trained graders at a reading center, sometimes aided by semi-automated algorithms. Although several algorithms have been proposed since the revival of deep learning, eyes with severe pathological conditions continue to challenge fully automated segmentation approaches. There remains an opportunity to leverage the underlying spatial correlations between the retinal surfaces in the segmentation approach. Methods - Some of these proposed traditional methods can be expanded to utilize the three-dimensional spatial context governing the retinal image volumes by replacing the use of 2D filters with 3D filters. Towards this purpose, we propose a spatial-context, continuity and anatomical relationship preserving semantic segmentation algorithm, which utilizes the 3D spatial context from the image volumes with the use of 3D filters. We propose a 3D deep neural network capable of learning the surface positions of the layers in the retinal volumes. Results - We utilize a dataset of OCT images from patients with Age-related Macular Degeneration (AMD) to assess performance of our model and provide both qualitative (including segmentation maps and thickness maps) and quantitative (including error metric comparisons and volumetric comparisons) results, which demonstrate that our proposed method performs favorably even for eyes with pathological changes caused by severe retinal diseases. The Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) for patients with a wide range of AMD severity scores (0–11) were within 0.84±0.41 and 1.33±0.73 pixels, respectively, which are significantly better than some of the other state-of-the-art algorithms. Conclusion – The results demonstrate the utility of extracting features from the entire OCT volume by treating the volume as a correlated entity and show the benefit of utilizing 3D autoencoder based regression networks for smoothing the approximated retinal layers by inducing shape based regularization constraints.

1. Introduction

Age-related macular degeneration (AMD) is the leading cause of blindness in people over 50 years of age in the Western world [1,2]. Changes to retinal layers assessed using Optical Coherence Tomography (OCT) images provide information related to retinal disease etiology and progression, especially for patients with a range of retinal diseases as AMD. In this work we focus on Spectral Domain-OCT (SD-OCT) images. Retinal layer segmentation/annotation can reveal disease-relevant structural changes; however, the annotation process is a costly and time-intensive task requiring expert knowledge by trained graders. Automating the process could thus save time and money and would also provide an objective method to assess and quantify disease-induced changes to aid clinical evaluations and research endeavors.

1.1 Related work

Significant advancement in segmentation methods has led to the proposal of several algorithms that annotate retinal layers from SD-OCT images [36]. More recently, deep learning based semantic segmentation algorithms have been applied to retinal layer segmentation [1,512].

For some of the traditionally used methods in the literature, the problem of layer segmentation is addressed as a two-step approach: first, the layers are segmented using a state-of-the-art pixel-wise segmentation method; next, post-processing on the pixel-wise surface segmentation results is performed using either level-set [1821] or graph based methods [4,22,23] to obtain the final smooth and spatially connected surfaces. However, pathologic changes challenge classical methods, which often fail to find an optimal balance between disease-related disruptions and spatial connectivity/smoothness. For other recently proposed algorithms - [8] shows results only using 2D networks. Also the Euclidean squared error based loss function used would be susceptible to small errors in ground truth surface locations, as this is the only loss function of the network; [9] also uses a 2D network and is not an end-to-end trainable network as our proposed network but requires a post-processing step after the use of a 2D UNet; [10] uses a 3D method but it does not segment the Bruch’s membrane (BM) layer. Segmenting BM is significantly more difficult for eyes with large drusens due to the subtle difference of intensities between the BM and underlying choroidal region in such eyes; [11] combines CNN with LSTM and treat the layers as a time-series along different columns of B-scans, however the number of volumes from normal (10 volumes) and pathology based patients (20 volumes) is not as comprehensive as our dataset. Moreover, the model does not segment the BM layer; [12] is also a 2 step process with segmentation followed by a Gaussian smoothing process and does not show the performance of the algorithm on large drusen and atrophic eyes.

1.1.1 Structured layer surface segmentation from retina OCT using fully convolutional regression networks

Recently, He et al. [13] proposed a network that adds spatial constraints while extracting the deep features from retinal images. The method uses convolutional regression networks to predict retinal surface locations. The method operates independently on 2D B-scans but the authors in [13] do not evaluate the performance in the presence of moderately/severely diseased eyes with AMD pathology. Specifically, they evaluate the performance on eyes with macular edema but not with AMD.

Most of the previously proposed retinal layer segmentation approaches fail in the presence of severe pathological abnormalities, such as large drusen, which manifests as deposits between the retinal pigment epithelium (RPE) layer and BM layer [14,15]. These changes are the key findings of early and intermediate stage AMD, representative of the disease pathophysiology. Graph search based segmentations often fail in the presence of abruptly large drusen (resulting in abrupt ascents and descents of the RPE layer) because it becomes very difficult to learn gradients between layers that are not smoothly connected and also because searching for a neighboring pixel with the lowest connection cost becomes unreliable in such regions [14,15]. These abnormalities cause changes to the normal layer contour seen in the healthy retina and present a challenge to current segmentation methods that we aim to address in this work.

1.2 Brief overview

The objective of our algorithm is to find an optimal balance between maintaining the spatial continuity and smoothness typical of healthy retinal layers while accommodating any changes/disruptions due to disease. To achieve this, we propose and test the following steps. First, we exploit the inherent 3D spatial context from the volumetric OCT data by implementing a 3D UNet [30] via the use of 3D convolutional features [1618]. This enables the image information from adjacent B-scans to be used in a spatially coherent manner. We compare the performance of 3D versus 2D UNets to identify the advantages in accurately segmenting in the presence of disease. Second, we propose a 3D autoencoder to act as a constraint-inducing regularizer and perform the task of selectively enforcing smoothness to the output generated from the 3D UNet. Due to computational constraints, we train and test our network using blocks of 8 spatially contiguous B-scans. Finally, we use an aggregation network at the end to aggregate all blocks of 8 B-scans to form the entire OCT volume.

Our proposed algorithm learns the locations of layer boundaries directly producing a continuous single pixel-wide layer boundary enabling precise measurements of the distances or volumes between the segmented layers. State-of-the-art segmentation algorithms that generate pixel-labels produce labels that are often spatially disjointed and multi-pixel-wide and that require additional processing to precisely segment the retinal layers.

In this paper, we utilize a dataset from a clinical study of AMD that followed patients over 50 years of age with a range of AMD severities with sufficient examples to train a deep learning model and evaluate its performance with test examples. In our proposed work, we show that our algorithm is capable of accurately segmenting the retinal layers in the presence of pathologic abnormalities caused by AMD.

The paper is organized as follows: Section 1 introduces the problem of semantic segmentation specific to the context of retinal image analysis in eyes with AMD and briefly describes the motivation behind our proposed approach and the state-of-the-art methods related to our proposed work. Section 2 presents the method proposed in this paper. Section 3 discusses the Experimental Settings. In Section 4 we discuss the Results and Analysis. Section 5 is the Discussion section where we discuss important aspects of the proposed method and its limitations. Section 6 presents the Conclusion along with our future work goals.

2. Methods

2.1 Proposed work: 3D aggregation regression network (3D AggRegNet) to learn spatial locations of surfaces

In this paper we propose a 3D deep neural network that can directly output surface locations (i.e., boundary between adjacent retinal layers), ensuring the accuracy, continuity, and smoothness of surfaces. Our proposed network consists of two different subnetworks that operate in tandem: (1) 3D UNet, which performs multi-class (retinal layers) voxel labelling of retinal layer surfaces and (2) 3D convolutional-autoencoder, which constrains the output of the 3D UNet and forces it to estimate a smooth contour representing the boundary between retinal layers. In the following subsections, we describe the steps of the proposed network and experiments for assessing performance.

Given an input SD-OCT image, the network estimates the boundary locations for 3 distinct surfaces [i.e., Inner Limiting Membrane (ILM), RPE, and BM]. The overall workflow of the proposed network solution is shown in Fig. 1. Eleven retinal surfaces [i.e., inner limiting membrane (ILM), retinal nerve fiber layer (RNFL), ganglion cell layer (GCL), inner plexiform layer (IPL), inner nuclear layer (INL), outer plexiform layer (OPL), external limiting membrane (ELM), inner photoreceptor layer (PR1), the outer photoreceptor layer (PR2), retinal pigment epithelium (RPE), and Bruch’s Membrane (BM)] were generated using the Heidelberg proprietary software, with three of these surfaces (ILM, RPE, and BM) manually reviewed and corrected by expert graders at the Wisconsin Reading Center.

 figure: Fig. 1.

Fig. 1. Complete block architectural diagram of the proposed 3D Deep Neural Network with representational training image and label volumes. The column-wise operation produces 11 layers, but only the 3 corrected layers are retained or used.

Download Full Size | PDF

 figure: Fig. 2.

Fig. 2. Block architectural diagram of the back-end 3D UNet used in the proposed framework [23]. Blue boxes represent the feature maps. The number of channels is denoted above each feature map. The last convolution layer is activated by the softmax function.

Download Full Size | PDF

 figure: Fig. 3.

Fig. 3. Block architectural diagram of the proposed 3D Auto-encoder module which constrains the spatial locations generated by the 3D UNet. The numbers above the input and output volumes represent the input and output dimensions. The numbers above the feature map volumes in encoder and decoder represents the number of filters / channels in the respective feature map volumes.

Download Full Size | PDF

The initial part of the network with the 3D UNet minimizes categorical cross-entropy loss to label each voxel with the corresponding retinal layer boundary. This step identifies boundary voxels only approximately and assigns a probabilistic likelihood to multiple pixels along an A-scan as belonging to one boundary class. The exact location of the boundary is computed for each class label by performing a column-wise (along A-scans) argmax operation (argument that gives the maximum value from a target function) of probability likelihoods. While tuning the 3D UNet to minimize the error of this boundary with respect to the ground truth, we found that the performance of the 3D UNet was improved if all the 11 retinal surfaces generated from Heidelberg were used for training, instead of only the 3 corrected classes (ILM, RPE, and BM). We hypothesize that the presence of the other layers (even though not manually verified) induces spatial constraints on where the 3 layers (ILM, RPE, and BM) can occur, and in turn improve the prediction accuracy of the 3D UNet. Furthermore, since the 3D UNet has a probabilistic loss, where it learns the binary probability of whether a layer is present or not, small errors introduced by potentially incorrect positions of the uncorrected layers do not reduce the prediction probability of the corrected layers.

The surfaces obtained from the 3D UNet (working in isolation) can sometimes display a lack of spatial continuity, which can be aggravated further by layer disruptions from drusen-like deposits in eyes with AMD. To address these issues, the surface locations derived from the 3D UNet are then input to the 3D convolutional-autoencoder. The 8 additional uncorrected classes induce huge distance based ‘mean squared error’ losses in this second part of the network comprising the 3D Autoencoder, so we do not use them beyond the 3D UNet. The 3D Autoencoder acts as a constraint-inducing regression network and forces the output of the 3D UNet to be smoothed, by acting as a regularizer and minimizing the mean squared error loss between its output and the ground truth surface locations.

2.2 Retinal image dataset

The study participants consisted of patients who were enrolled in a prospective longitudinal study of dark adaptation in AMD (identifier NCT01352975, www.clinicaltrials.gov) [19] at the National Eye Institute, National Institutes of Health, Bethesda, MD). Patients of age above 50 years with no-AMD to intermediate-AMD in at least one eye were enrolled, including patients with late AMD in their fellow eye [the presence of choroidal neovascularization and/or central geographic atrophy as defined in the Age-Related Eye Disease Study (AREDS)] [20]. The study was approved by the Institutional Review Board of the National Institutes of Health and followed the tenets of the Declaration of Helsinki. All participants provided written informed consent after the nature and possible consequences of the study were explained.

Multimodal imaging was acquired at each study visit and included color fundus photographs acquired with the TRC-50DX retinal camera (Topcon Medical Systems, Oakland, NJ, USA) and SD-OCT imaging acquired using Heidelberg Spectralis HRA + OCT system (Heidelberg Engineering Inc, Heidelberg, Germany). Color fundus images were graded by the Wisconsin Reading Center and assigned AMD severity scores scores 0 to 11 [21].

SD-OCT imaging consisted of a volumetric macular scan (496 ${\times} $ 768 ${\times} $ 121) spanning 30° horizontally and 25° vertically (corresponding to 1.9 mm ${\times} $ 9 mm ${\times} $ 7 mm, with a spacing of 3.87 µm ${\times} $ 11.1 µm ${\times} $ 58.7 µm between the frames of the 3 spatial axes). The data is sparsely sampled along the direction of the B-scans compared to the other 2 directions. Heidelberg SD-OCT imaging uses Automatic Real-time Tracking (ART) based on the application of eye tracking and frame averaging to reduce noise and improve image quality. Additionally, the randomly distributed noise would cancel itself when image averaging is performed over 5 frames depicting the same tracked retinal region. The dataset used in this analysis included SD-OCT imaging consisting of 358 eye volumes derived from 161 unique patients that were imaged in both the right (OD) and left eyes (OS), generated from 2 longitudinal study visits. Ideally, we would have 161 ${\times} 2 \times 2{\; }$= 644 volumes if we have all eyes for all visits, but several patients did not have both their eyes imaged during a particular visit, did not follow up for the subsequent visit, and we eliminated some volumes with poor quality B-scans.

IR&OCT 30° ART scans were exported using Heidelberg Eye Explorer [22] version-1.10.0.0 to create the SD-OCT volumes. Three distinct surfaces segmented in each of the 496 ${\times} $ 768 B-scans (ILM, RPE, and BM layers) were generated from Heidelberg proprietary software with manual correction by the Wisconsin reading center.

150 eyes (including 50 eyes having largest drusen volumes from the reading center provided ground-truth data) were chosen for training our proposed model from the total dataset of 358 eye volumes. We tune the network hyperparameters using 20 eyes which do not overlap with the train or test sets. The training loss did not decrease on increasing the number of training eyes beyond 150, as after this point the network parameters do not change enough to decrease the training loss. The network has seen enough variance during training from the dataset for its training loss to stabilize. We also wanted to evaluate the performance of our algorithm on a large number of unseen test eyes. The remaining 188 eyes (not used during training or hyperparameter tuning, and reserved as a hold-out test set; there is no overlap between patients in train, validation, and test sets) were used for the purpose of testing our proposed network. The model was trained for 3 epochs with a learning rate of 1e–05.

2.3 3D UNet: segmentation

The initial part of our network is a 3D UNet (Fig. 2) as proposed in Çiçek et al [23]. The input images and the 3D UNet output have the same volumetric dimensions. The volumetric scans are inherently anisotropic. Spatial resolution is finer within the B Scan (x-y) plane and coarser along the z-direction. Isotropic convolutional kernel in pixels (i.e., 3×3×3) would extend more in physical dimension along the coarser z-direction compared to the finer, in-pane x-y directions. This could help to provide more spatial context along the z-direction to obtain spatially consistent smooth contours across B-scans in a volumetric grid, following the inherent resolution of the data. The kernel is finer along the x-y plane within the Bscan, where data resolution is high, and coarser along the z-direction, where data resolution is low. By interpolating to make spacings same across all spatial dimensions we could lose data and introduce errors into the dataset, as interpolation has its own disadvantages and does not increase data resolution. A weighted categorical cross entropy based loss function was used to minimize the loss between the network output and the ground truth labels, according to:

$$\textrm{H}({a,b} )={-} \mathop \sum \nolimits_{\textrm{x} \in X} \mathop \sum \nolimits_{\textrm{y} \in C} w({x,y} )a({x,y} ) \;log \;b(x )$$
where w(x, y) denotes the weight corresponding to samples of class y within voxel (x), a(x, y) denotes the binary ground truth belonging to samples of class y within voxel (x) (1 if sample belongs to class y, 0 otherwise), and b(x) denotes network-predicted labels for the voxels ($x)$ in the 3D training set, X, which has C classes. The loss tries to maximize the likelihood of the target class for each training instance. The weights of the 12 classes (11 surfaces and 1 background) were empirically found to be as following: background: 1.0, ILM: 50.0, RNFL: 25.0, GCL: 25.0, IPL: 25.0, INL: 25.0, OPL: 25.0, ELM: 25.0, PR1: 25.0, PR2: 25.0, RPE: 50.0, BM: 50.0. Since, the number of background pixels are much more than the number of pixels depicting retinal layer classes, the background has a much lower class weight compared to other classes. This is based on difficulty related to boundary segmentation and more weights were assigned to the corrected layers.

Ideally, we would use the entire 496 ${\times} $ 768 ${\times} $ 121 volumetric scan as a voxel during training. However, in our initial experiments, we found the size of the volume exceeded GPU memory capacity (even with large memory state-of-the-art GPU’s – NVIDIA TESLA V100 Volta 32GB GPU’s), making training using the entire volumes infeasible. To address this, we cropped the volume into 496 ${\times} $ 768 ${\times} $ 8 volumetric partitions during both training and testing. Thus, we grouped the B-scans into smaller groups of 8 (power of 2), so that the corresponding downsampling and upsampling layers (with a sampling factor of 2) in the 3D UNet are able to recover the entire input frames. Since there were 121 B-scans in the OCT volumes, we used the first 15 sets of 8 adjacent B-scans, and for the 16th set we used the last 8 B-scans (from 114 to 121). This approach still utilizes the relations between spatially adjacent B-scans and could capture learnable local feature patterns, in contrast to capturing global features across the entire eye. We aggregate the cropped partitions to generate the output segmentation surface at the end of the pipeline by using a 3D Auto Encoder based Aggregation Network, as shown in Fig. 1. This solves the GPU memory problem as the Aggregation network lacks an additional dimension denoting the rows from the volumes, and can be used to assimilate all the sub-volumes. For 3D UNet probability likelihood outputs with dimensions of 496 ${\times} $ 768 ${\times} $ 8 ${\times} $ 3 (with 3 at the end denoting the 3 different retinal surfaces), a column-wise argmax operation on the probability likelihoods result in the generation of surface locations with dimensions of 768 ${\times} $ 8 ${\times} $ 3, where the probability of that surface is maximum along A-scans. After this stage, we neglect the uncorrected outputs for the 8 other layers.

2.4 3D autoencoder: regularization

In our network, the 3D denoising-autoencoder (Fig. 3) acts as a constraint-inducing regularizer that enforces smoothness onto the output of the 3D UNet and also corrects the learning of the 3D UNet. The 3D surfaces, output by the 3D UNet, representing each retinal layer, are spatially correlated and constrained along all of three dimensions (anterior-posterior, temporal-nasal, and superior-inferior). 3D Autoencoders are an ideal solution to simultaneously capture the spatial relationships along all of the 3 dimensions, to mitigate any spatial inconsistencies, and to constrain the surface locations output from the 3D UNet. The loss function for the 3D Autoencoder is mean-squared- error, as follows (where p is the ground truth surface location, q denotes the predicted locations generated from the 3D Autoencoder, and n denotes the total number of samples or predictions):

$$MS{E_{loss}} = \frac{1}{n}\mathop \sum \nolimits_{y \in C} \mathop \sum \nolimits_{x = 1}^n {({{p_{xy}} - {q_{xy}}} )^2}$$

2.5 3D RegNet

At this stage of the network (henceforth referred to as 3D RegNet), before the final Aggregation Network, the 3D UNet and the 3D Autoencoder work together to extract features from the input image and ensure an accurate and constrained regularized output from the 3D UNet. The total loss is back-propagated through the entire network during end-to-end training and is defined as follows:

$$Los{s_{RegNet}} ={-} \lambda \mathop \sum \nolimits_{\textrm{x} \in X} \mathop \sum \nolimits_{\textrm{y} \in C} w({x,y} )a({x,y} ) \;log\; b(x )+ ({1 - \mathrm{\lambda }} )\frac{1}{n}\mathop \sum \nolimits_{y \in C} \mathop \sum \nolimits_{x = 1}^n {({{p_{xy}} - {q_{xy}}} )^2}\; $$

The hyperparameter $\lambda $ denotes a weighing term to balance the contributions from two-part losses to the total loss. $\lambda = 0$, implies only the ‘mse’ loss is active and $\lambda = 1$, implies that only the ‘categorical-cross-entropy’ loss is active. For our proposed network and dataset, we tuned $\lambda $ to be equal to 0.5, i.e., both the losses - segmentation loss for the 3D UNet and regression loss for the 3D Autoencoder were given equal importance.

2.5 3D AggRegNet

The 3D Autoencoder based 3D Aggregation Network at the end aggregates the groups of 8 spatially adjacent B-scan predictions to form the entire macula consisting of 121 B-scan predictions. The inputs to the 3D Aggregation Network are the pixel locations generated from the 3D UNet, which is constrained by the regularizer based 3D Autoencoder (or the input is the output of the 3D RegNet). An ‘mse’ loss between the predictions from the aggregation network and the ground truth locations provided by the reading center for the 3 corrected layers is minimized at this stage, to obtain the final continuous output predictions.

3. Experiments

3.1 Algorithm comparison

We compare our final network - 3D Aggregation-Regression-Network (3D AggRegNet) - to 4 other related state-of-the-art approaches, including 2 preliminary networks that form the basis of the final proposed 3D AggRegNet. We compare our network with recently published retinal image segmentation algorithms which use the OCT Explorer [24], 2D UNet [2527] and 3D UNet [28]. We compare with:

  • (1) A traditional graph cut based segmentation method, using publicly available OCT Explorer software: the Iowa Reference Algorithms (Retinal Image Analysis Lab, Iowa Institute for Biomedical Imaging, Iowa City, IA) [2931]
  • (2) A 2D UNet with 1-pixel-wise surface boundaries obtained by performing a column-wise probability maximum operation to the output of a regular 2D UNet (Labeled as ‘2D UNet’ in the tables. For a direct comparison, the 2D UNet has the same type and number of layers as the corresponding 3D UNet, as shown in this paper. The parameters of the layers are also kept exactly the same, and the only difference between the 2D and 3D UNet is that they have 2D-filters and 3D filters, respectively. This assesses the benefit of performing direct 3D convolutions in comparison to aggregating outputs from independently performed 2D convolutions.
  • (3) A 3D UNet with 1-pixel-wise surface boundaries obtained by performing a column-wise probability maximum operation to the output of a regular 3D UNet (labeled as ‘3D UNet’ in the tables). This compares the performance of optimizing only against a weighted cross-entropy loss function – an output directly attainable with the voxel based classification of 3D UNet.
  • (4) A constrained 3D UNet having both the ‘categorical-cross-entropy’ (segmentation) loss and the ‘mse’ regression loss, in order to learn the surface locations, with the back-end autoencoder following the 3D UNet to perform the constrained regularization (labeled as ‘3D RegNet’ in the tables). This network implements a regression solution entirely based on 3D UNet, when it is constrained by the 3D Autoencoder during the end-to-end training but without the final 3D Aggregation Network. The only difference with our final proposed output is that the boundary location is derived immediately after the column-wise maximum operation on the output of the constrained 3D UNet, without invoking the final aggregation network.

3.2 Validation

Quantitative analysis: We compare our proposed approach with other state-of-the-art algorithms using 3 different metrics – (1) RMSE: measures the square root of the average of the squares of the errors, i.e., the average squared difference between the estimated values and the actual value, $\textrm{RMS}{E_{Error}} = \sqrt {{(}\frac{1}{n}\mathop \sum \nolimits_{i = 1}^n {\{ }{{({{a_i} - {e_i}} )}^2}{)\} }} $, (where n is the number of samples, e is the estimated value, and a is the actual value), (2) MAE: measures the absolute difference between paired observations expressing the same phenomenon, $\textrm{MA}{\textrm{E}_{\textrm{Error}}} = \frac{1}{\textrm{n}}\mathop \sum \nolimits_{\textrm{i} = 1}^\textrm{n} |{({{\textrm{a}_\textrm{i}} - {\textrm{e}_\textrm{i}}} )} |$, (3) Hausdorff distance metric: it is the greatest of all the distances from a point in one set of contours to the closest point in the other set of contours. For cases where prediction is severely poor in a very low number of regions and great for most regions – RMSE would be higher than MAE and Hausdorff distance would be high. Each metric shows different features from results.

We also perform a total retinal volume and an retinal pigment epithelium-drusen complex (RPEDC) volume analysis, to compare the volumes generated by our segmented contours to the ground truth volumes supplied by the reading center. Ground truth volumes were provided by the reading center rounded to 2 decimal places. The total retinal volume is the volume between the first segmented layer, i.e., the ILM layer, and the last segmented layer, i.e., the BM layer. The RPEDC volume is the volume between the penultimate retinal layer, i.e., the RPE layer, and the final retinal layer, i.e., the BM layer. We utilized the Early Treatment Diabetic Retinopathy Study (ETDRS) grid [32] for calculating volumes over specific retinal areas.

Qualitative analysis: We show the predictions by showing 2D B-scans from predicted volumes. We show that the network-predicted layers are very similar to the ground truth. We also perform an en-face thickness map comparison between the reading center RPE-BM contours and the RPE-BM contours generated from our proposed network. We show the high visual similarity between the thickness maps created from the reading center based ground truths and those created from our network-predicted layers. We show robust qualitative results for a range of AMD eyes, including ones affected by large drusen.

4. Results and analysis

The images in the test set exhibit the entire range of AMD severity grades, ranging from 0 to 11, challenging the learning algorithm due to variable degrees of pathology. The pathologic severities were binned into 5 different groups, as shown in Table 1. In Table 1 below, we also show the total retinal volumes (TR-V in mm3) between the ILM layer and the BM layer, the RPEDC volumes (RPEDC-V in mm3) between the RPE layer and the BM layer, measured using ground truth data. As we advance through the severity groups, RPEDC-V increases with increasing levels of pathologic changes, thus increasingly challenging accurate segmentation.

Tables Icon

Table 1. AMD severity grade based grouping

Interrogation of the proposed network is demonstrated in Fig. 4, which compares the performance of different stages of the algorithm, and also other state-of-the-art segmentation algorithms, for two representative cases (AMD severity grades of 6 and 7 in the left and right columns, respectively). For ease of visualization of the predicted surfaces, we show outputs of the 2D slices of B-scans from the predicted 3D volumetric labels. The top 2 rows show the OCT B-scan image and the corresponding ground truth segmentations for reference. The 3rd row shows the output of the ILM and RPE contours from a publicly accessible software, OCT Explorer. The 4th row shows the column-wise maximum single pixel-wise labels from the output of 2D UNet and 5th row shows the locations where the probabilities are maximum along the columns in the B-scans from the predictions made by 3D UNet, with obvious failures of both observed in the BM contour. The penultimate 6th row shows the single-pixel wide output of 3D RegNet (3D UNet with both losses and before the end-to-end network’s autoencoder, without the aggregation network), with significant improvement in the BM contour but with some small discontinuities. The last row shows the final predictions made by our proposed 3D AggRegNet, minimizing the discontinuities and closely representing the ground truth.

 figure: Fig. 4.

Fig. 4. Prediction results compared to state-of-the-art algorithms. Predictions are overlapped onto corresponding SD-OCT B-scans.

Download Full Size | PDF

Table 2. shows class-wise and total RMSE, MAE, and Hausdorff distance comparison between our proposed network and other state-of-the-art retinal layer segmentation approaches.

Tables Icon

Table 2. Layerwise RMSE for the Retinal Image Dataset: Grouped by AMD Severity. The lowest error in each column within each group is indicated in bold fonts.

The final proposed 3D AggRegNet in Table 2 maintains the surface smoothness and spatial correlations between B-scans and thus is most representative of actual structural changes occurring in the retina. In Table 2, we can observe the superior quantitative performance of the final proposed 3D AggRegNet, compared to most other networks. There are cases where we observe minimally better error values (at sub-pixel levels) for the preliminary 3D RegNet network that forms the basis of the final proposed 3D AggRegNet; however, that preliminary network does not maintain the surface smoothness and spatial correlations. The biologic nature of retinal cellular layers is that they are continuous. While the error is smallest in the RegNet for some cases, the discontinuities within a layer is not biologically plausible. One of the other goals of segmenting the retinal layers is to extract volumes between those layers. Lack of smoothness and absent regions in the predictions using the preliminary 3D RegNet leads to huge arbitrary volumetric errors. The superiority of 3D AggRegNet is evident in its ability to minimize error while maintaining continuity of the retinal layers, so that the resulting segmentations are biologically consistent and extracted volumes are close to the ground-truth volumes.

1) Thickness maps: qualitative analysis

We assessed the performance of the final algorithm against the ground truth by qualitative inspection of the thickness maps generated using the RPE and BM contours from entire macular volumes. In Fig. 5 below, we show that the en-face thickness maps generated using the ground truth RPE-BM contours have a very similar appearance, compared to the corresponding en-face thickness maps generated using the proposed network-predicted RPE-BM contours. The latter both maintains the large areas of elevation while accommodating the smaller elevations that are part of the disease spectrum. This result confirms that the proposed network is able to robustly reproduce the properties of ground truth contours, both for eyes with low AMD severity scores as well as for eyes with high AMD severity scores, with large drusen. The areas of low thickness and high thickness between the RPE and BM layers are very similar between the reading center-created ground truth contours and the proposed network-predicted contours.

2) Volumetric errors on AMD eyes: per AMD category:

To validate our model, we also calculated the total retinal, and the RPEDC volume, by using the ETDRS grid, for all the eyes in the test dataset. We evaluated the volumetric differences (in mm3) between the predictions from our model and those from the ground truth contours generated by the Wisconsin Reading Center (as described in the Methods Section). Figure 6(a) below shows the violin plots of the volumetric errors from the volumes generated by our segmentation predictions, with respect to the ground truth. In Fig. 6(b) (Total Retinal Volumes (TRV)) and 6c (RPEDC volumes), we show Bland-Altman plots between our predicted volumes and ground truth volumes for severity groups 1–4. For Total Retinal Volume, the limits of agreement were between −0.17 mm3 and 0.16 mm3, and the mean was −0.003 mm3. For RPEDC volumes, the limits of agreement were between −0.11 mm3 and 0.13 mm3, and the mean was 0.011 mm3. The dashed lines in the Bland-Altman plots show the mean values (cyan) and the limits of agreement (yellow) of the volumetric errors per AMD severity group. Figures 6(b) and 6(c) show that the mean values for volumetric errors are excellent for both the TRV’s and the RPEDC volumes. From Fig. 6(c), we observe that, although the limits of agreement for RPEDC volumes are on the slightly higher side, compared to their volumes, the majority of the samples which do not have extremely large drusen volumes have a low error. The cluster near the origin in Fig. 6(c) suggests this. A few outliers having very large drusen volumes result in large errors and skew the RPEDC limits of agreement (i.e., the mean value of the errors is low, but a few outliers skew the standard deviation). The training set provided good generalizability and the algorithm works well for vast majority of the test cases. It does fail for some extreme outliers, demonstrated by large errors, such as those eyes having exceedingly large RPEDC volumes due to the absence of such samples in the training set. The low volumetric measurement errors in the violin plots and the fair limits of agreement in the Bland-Altman analysis indicate that the algorithm can successfully reproduce the measurements obtained via slice-by-slice manual annotations of experts. Thus, the algorithm may provide an accurate, robust, time-efficient alternative to human experts in a clinical setting.

 figure: Fig. 5.

Fig. 5. Enface RPEDC thickness map comparison between 4 different cases. (a), (c), (e), and (g) represent 4 thickness maps created from the ground-truth RPE and BM layers. (b), (d), (f), and (h) represent the 4 corresponding thickness maps created from the proposed network predicted RPE and BM layers. The ticks on the x and y-axis are in mm. The legend in the middle represents absolute differences in µm or widths between the RPE and BM layers, ranging from a minimum of 0µm pixel to a maximum of 58µm.

Download Full Size | PDF

 figure: Fig. 6.

Fig. 6. (a). Violin plots of Volumetric Errors, (b) and (c) Group-wise Bland-Altman plots of Volumetric Errors: For (1) Total retinal volumes and (2) RPEDC volumes. Group 1: AMDSC-0 to 1 (no AMD), Group 2: AMDSC-2 to 5 (mild AMD), Group 3: AMDSC-6 to 8 (severe AMD), Group 4: AMDSC-9 to 10 (Atrophy), Group 5: AMDSC-11 (Choroidal Neovascularization).

Download Full Size | PDF

5. Discussion

The technical approach of using 3D filters to capture 3D features from the retinal volumes has been very limitedly exploited in the domain of retinal image analysis, and this feature makes our proposed algorithm novel and more robust, compared to the other state-of-the-art algorithms. Our results show that all the variants of 3D UNets described in this paper perform better than the corresponding 2D UNet and graph based algorithms.

The innovation of adding aggregation networks can be appreciated especially in the resolution of the discontinuities in the segmentation boundaries. As we analyze the results of other referenced networks from Fig. 4, we see that the output surfaces of the 2D UNet, and even the 3D UNet, are 1 pixel wide but are severely discontinuous and do not in any manner preserve or respect the spatial constraints and continuity governing the original retinal layer surfaces. Even with the 3D UNet, abrupt changes to retinal layers caused by structural alterations induced by disease weaken the probabilistic assignments of the pixels to different classes, and this further leads to discontinuities primarily in the areas most impacted by pathologic changes, such as areas with large drusen accumulations. Discontinuities along the columns lead to discontinuous retinal layer surfaces, implying that they do not respect the spatial contiguity and constraints of the ground truth surfaces. The use of a 3D Denoising Convolutional Autoencoder leads us to results that are superior compared to the state-of-the-art pixel-wise semantic segmentation networks. However, the output using the 3D UNet with the 3D Autoencoder still has very minor spatial discontinuities, due to the absence of information beyond the 8 neighboring scans during learning of the regression problem, especially in cases where the volumes are impacted by severe pathologies such as large drusen. This network points to the need of the specialized 3D Aggregation network at the end (which accumulates the set of 8 B-scans into the entire macula having 121 B-scans) to learn the surface locations that have minimal discontinuities. Especially in the presence of large drusen, i.e., when the RPE layer shows sudden drastic changes or disruptions, the network using both the autoencoders has significant advantages compared to the other networks.

There are limitations, however, that pertain to the accuracy of segmenting the RPE layer and the resulting volume from RPE to BM (i.e., RPEDC). As can be seen with these data containing the full spectrum of drusen volumes, the errors in RPEDC volumes are very small, hence the predicted RPEDC volume errors are sensitive to even minute pixel-level errors in the segmentations. Since even the RPEDC volumes are much smaller compared to the TRV’s, a small segmentation error results in a much larger percentage error in RPEDC volume approximation, compared to TRV approximation. Moreover, as observed from the Bland-Altman analysis (Fig. 6(b) and 6(c)), the algorithm still fails to correctly segment macula that have extremely large drusen, due to the absence of such eyes in the training set. It generalizes to reduce the error in the majority of the eyes and neglects these extreme cases. Having more of such eyes in the dataset and adding them to the training set could help, but eyes with such extreme conditions are generally rare. Ground truth considerations also pose limitations as the inter-grader variability cannot be eliminated especially during manual segmentations of steep contours of the retinal layers at a single-pixel level. We are not correcting for optical aberrations during acquisition that could limit accuracy in both manual ground truth and the algorithm. Since deep learning algorithms are device specific, we do not expect our model to work well on out of domain test data – such as Zeiss Cirrus SD-OCT volumes or Swept Source-OCT volumes. We intend to explore domain adaptation algorithms as a part of our future work to expand the application of our model.

6. Conclusion

The goal of our work is to approach a retinal OCT segmentation problem with particular attention to the challenges to segmentation presented by the presence of AMD pathology. For this aim to be accomplished, it is critical that the dataset utilized is representative of the clinical disease and that this segmentation network is targeted to solve the problem of segmentation with accurately available ground truth reference information. The availability of a large number of training and testing samples, along with the abundant numbers of eyes with a range of pathological features (ranging from aged eyes without AMD through the stages of early and intermediate AMD to advanced atrophic AMD), makes our proposed retinal layer segmentation dataset particularly suited to tackling this problem.

The traditional problems encountered when using state-of-the-art pixel based semantic segmentation deep neural networks include loss of spatial continuity and loss of spatial correlations within and between the predicted surfaces, respectively, which can be addressed by our proposed segmentation regression network. Our network simultaneously captures and exploits the spatial correlation along all three dimensions (anterior-posterior, temporal-nasal, and superior-inferior). From the qualitative visualizations and the quantitative metrics, we observe that our proposed solution leads to results that respect the spatial continuity and correlations of the layers from the original domain.

Given that our work focuses on performance in the presence of pathology of the outer retina, specifically the RPE layer, it is very likely that it would also be suited to other retinal diseases. Future work will explore the ability of this algorithm to generalize to accurately segment retinal layers in the setting of other diseases such as retinal degenerations. Given the large number of people worldwide affected by AMD and the clinical reliance on OCT for retinal evaluations of disease severity and risk of progression, there is increasing need for the development of tools, such as the network presented herein, that can accurately provide retinal layer metrics.

Funding

National Eye Institute (EY000509).

Acknowledgments

This research was supported by the Intramural Research Program of the NIH, National Eye Institute (EY000509).

Disclosures

The authors declare no conflicts of interest.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. D. S. Friedman, B. J. O’Colmain, B. Munoz, et al., “Prevalence of age-related macular degeneration in the United States,” Arch. Ophthalmol. 122(4), 564 (2004). [CrossRef]  

2. A. V. Chappelow and P. K. Kaiser, “Neovascular age-related macular degeneration,” Drugs 68, 1029–1036 (2008). [CrossRef]  .

3. L. Fang, D. Cunefare, C. Wang, R. H. Guymer, S. Li, and S. Farsiu, “Automatic segmentation of nine retinal layer boundaries in OCT images of non-exudative AMD patients using deep learning and graph search,” Biomed. Opt. Express 8(5), 2732–2744 (2017). [CrossRef]  

4. A. Lang, A. Carass, M. Hauser, E. S. Sotirchos, P. A. Calabresi, H. S. Ying, and J. L. Prince, “Retinal layer segmentation of macular OCT images using boundary classification,” Biomed. Opt. Express 4(7), 1133–1152 (2013). [CrossRef]  

5. A. Montuoro, S. M. Waldstein, B. S. Gerendas, U. Schmidt-Erfurth, and H. Bogunović, “Joint retinal layer and fluid segmentation in OCT scans of eyes with severe macular edema using unsupervised representation and auto-context,” Biomed. Opt. Express 8(3), 1874–1888 (2017). [CrossRef]  

6. S. Lou, X. Chen, X. Han, J. Liu, Y. Wang, and H. Cai, “Fast retinal segmentation based on the wave algorithm,” IEEE Access 8, 53678–53686 (2020). [CrossRef]  

7. A. G. Roy, S. Conjeti, S. P. K. Karri, D. Sheet, A. Katouzian, C. Wachinger, and N. Navab, “ReLayNet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks,” Biomed. Opt. Express 8(8), 3627–3642 (2017). [CrossRef]  

8. A. Shah, L. Zhou, M. D. Abrámoff, and X. Wu, “Multiple surface segmentation using convolution neural nets: application to retinal layer segmentation in OCT images,” Biomed. Opt. Express 9(9), 4509 (2018). [CrossRef]  

9. Z. Mishra, A. Ganegoda, J. Selicha, Z. Wang, S. R. Sadda, and Z. Hu, “Automated retinal layer segmentation using graph-based algorithm incorporating deep-learning-derived information,” Sci. Rep. 10(1), 9541 (2020). [CrossRef]  

10. A. Shah, M. Abramoff, and X. Wu, “Simultaneous multiple surface segmentation using deep learning,” ArXiv170507142 Cs, vol. 10553, pp. 3–11, 2017.

11. K. Gopinath, S. B. Rangrej, and J. Sivaswamy, “A deep learning framework for segmentation of retinal layers from OCT images,” ArXiv180608859 2022.

12. M. Pekala, N. Joshi, T. Y. A. Liu, N. M. Bressler, D. C. DeBuc, and P. Burlina, “Deep learning based retinal OCT segmentation,” Comput. Biol. Med. 114, 103445 (2019). [CrossRef]  

13. Y. He, A. Carass, Y. Liu, B. M. Jedynak, S. D. Solomon, S. Saidha, P. A. Calabresi, and J. L. Prince, “Structured layer surface segmentation for retina OCT using fully convolutional regression networks,” Med. Image Anal. 68, 101856 (2021). [CrossRef]  

14. S. J. Chiu, J. A. Izatt, R. V. O’Connell, K. P. Winter, C. A. Toth, and S. Farsiu, “Validated Automatic Segmentation of AMD Pathology Including Drusen and Geographic Atrophy in SD-OCT Images,” Investig. Opthalmology Vis. Sci. 53(1), 53 (2012). [CrossRef]  

15. A. Stankiewicz, T. Marciniak, A. Dąbrowski, M. Stopa, P. Rakowicz, and E. Marciniak, “Improving Segmentation of 3D Retina Layers Based on Graph Theory Approach for Low Quality OCT Images,” Metrol. Meas. Syst. 23(2), 269–280 (2016). [CrossRef]  

16. B. Hasani and M. H. Mahoor, “Facial expression recognition using enhanced deep 3D convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 30–40.

17. C.-H. Pham, A. Ducournau, R. Fablet, and F. Rousseau, “Brain MRI super-resolution using deep 3D convolutional networks,” in 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), 2017, pp. 197–200.

18. G. Riegler, A. Osman Ulusoy, and A. Geiger, “Octnet: Learning deep 3d representations at high resolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3577–3586.

19. J. Flamendorf, E. Agrón, W.T. Wong, D. Thompson, H.E. Wiley, E.L. Doss, S. Al-Holou, F.L. Ferris III, E.Y. Chew, and C Cukras, “Impairments in dark adaptation are associated with age-related macular degeneration severity and reticular pseudodrusen,” Ophthalmology 122(10), 2053–2062 (2015). [CrossRef]  

20. F. L. Ferris, M. D. Davis, T. E. Clemons, et al.,“A simplified severity scale for age-related macular degeneration: AREDS Report No. 18,” Arch. Ophthalmol. 123(11), 1570 (2005). [CrossRef]  

21. T. C. Clemons, R. C. Milton, R. Klein, et al., “Risk factors for the incidence of advanced age-related macular degeneration in the age-related eye disease study (AREDS), AREDS report no. 19,” Ophthalmology 112(4), 533–539.e1 (2005). [CrossRef]  

22. “Heidelberg Engineering GmbH, Spectralis HRA + OCT User Manual Software Version 6.0, 2014.”

23. Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: learning dense volumetric segmentation from sparse annotation,” in International conference on medical image computing and computer-assisted intervention, 2016, pp. 424–432.

24. M. Sonka and M. D. Abràmoff, “Quantitative analysis of retinal OCT,” Med. Image Anal. 33, 165–169 (2016). [CrossRef]  

25. I. Z. Matovinovic, S. Loncaric, J. Lo, M. Heisler, and M. Sarunic, “Transfer learning with U-Net type model for automatic segmentation of three retinal layers in optical coherence tomography images,” in 2019 11th International Symposium on Image and Signal Processing and Analysis (ISPA), Dubrovnik, Croatia, Sep. 2019, pp. 49–53. doi: 10.1109/ISPA.2019.8868639.

26. S. Liu, D. P. Shamonin, G. Zahnd, A. F. W. van der Steen, T. van Walsum, and G. van Soest, “Healthy Vessel Wall Detection Using U-Net in Optical Coherence Tomography,” in Machine Learning and Medical Engineering for Cardiovascular Health and Intravascular Imaging and Computer Assisted Stenting, vol. 11794, H. Liao, S. Balocco, G. Wang, F. Zhang, Y. Liu, Z. Ding, L. Duong, R. Phellan, G. Zahnd, K. Breininger, S. Albarqouni, S. Moriconi, S.-L. Lee, and S. Demirci, Eds. (Springer International Publishing, 2019), pp. 184–192.

27. R. Asgari, S. Waldstein, F. Schlanitz, M. Baratsits, U. Schmidt-Erfurth, and H. Bogunović, “U-Net with spatial pyramid pooling for drusen segmentation in optical coherence tomography,” ArXiv191205404 Cs Eess, Dec. 2019.

28. M. X. Li, S. Q Yu, W. Zhang, X. Xu, T. W. Qian, and Y. J. Wan, “Segmentation of retinal fluid based on deep learning: application of three-dimensional fully convolutional neural networks in optical coherence tomography images,” Int. J. Ophthalmol. 12(6), 22 (2019). [CrossRef]  

29. M. D. Abramoff, M. K. Garvin, and M. Sonka, “Retinal Imaging and Image Analysis,” IEEE Rev. Biomed. Eng. 3, 169–208 (2010). [CrossRef]  

30. K. Li, X. Wu, D. Z. Chen, and M. Sonka, “Optimal surface segmentation in volumetric images-a graph-theoretic approach,” IEEE Trans. Pattern Anal. Mach. Intell. 28(1), 119–134 (2006). [CrossRef]  

31. M. K. Garvin, M. D. Abramoff, X. Wu, S. R. Russell, T. L. Burns, and M. Sonka, “Automated 3-D intraretinal layer segmentation of macular spectral-domain optical coherence tomography images,” IEEE Trans. Med. Imaging 28(9), 1436–1447 (2009). [CrossRef]  

32. “Association of elevated serum lipid levels with retinal hard exudate in diabetic retinopathy: early treatment diabetic retinopathy study (ETDRS) Report 22,” p. 6.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (6)

Fig. 1.
Fig. 1. Complete block architectural diagram of the proposed 3D Deep Neural Network with representational training image and label volumes. The column-wise operation produces 11 layers, but only the 3 corrected layers are retained or used.
Fig. 2.
Fig. 2. Block architectural diagram of the back-end 3D UNet used in the proposed framework [23]. Blue boxes represent the feature maps. The number of channels is denoted above each feature map. The last convolution layer is activated by the softmax function.
Fig. 3.
Fig. 3. Block architectural diagram of the proposed 3D Auto-encoder module which constrains the spatial locations generated by the 3D UNet. The numbers above the input and output volumes represent the input and output dimensions. The numbers above the feature map volumes in encoder and decoder represents the number of filters / channels in the respective feature map volumes.
Fig. 4.
Fig. 4. Prediction results compared to state-of-the-art algorithms. Predictions are overlapped onto corresponding SD-OCT B-scans.
Fig. 5.
Fig. 5. Enface RPEDC thickness map comparison between 4 different cases. (a), (c), (e), and (g) represent 4 thickness maps created from the ground-truth RPE and BM layers. (b), (d), (f), and (h) represent the 4 corresponding thickness maps created from the proposed network predicted RPE and BM layers. The ticks on the x and y-axis are in mm. The legend in the middle represents absolute differences in µm or widths between the RPE and BM layers, ranging from a minimum of 0µm pixel to a maximum of 58µm.
Fig. 6.
Fig. 6. (a). Violin plots of Volumetric Errors, (b) and (c) Group-wise Bland-Altman plots of Volumetric Errors: For (1) Total retinal volumes and (2) RPEDC volumes. Group 1: AMDSC-0 to 1 (no AMD), Group 2: AMDSC-2 to 5 (mild AMD), Group 3: AMDSC-6 to 8 (severe AMD), Group 4: AMDSC-9 to 10 (Atrophy), Group 5: AMDSC-11 (Choroidal Neovascularization).

Tables (2)

Tables Icon

Table 1. AMD severity grade based grouping

Tables Icon

Table 2. Layerwise RMSE for the Retinal Image Dataset: Grouped by AMD Severity. The lowest error in each column within each group is indicated in bold fonts.

Equations (3)

Equations on this page are rendered with MathJax. Learn more.

H ( a , b ) = x X y C w ( x , y ) a ( x , y ) l o g b ( x )
M S E l o s s = 1 n y C x = 1 n ( p x y q x y ) 2
L o s s R e g N e t = λ x X y C w ( x , y ) a ( x , y ) l o g b ( x ) + ( 1 λ ) 1 n y C x = 1 n ( p x y q x y ) 2
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.