Sample-efficient deep learning for accelerating photonic inverse design

Ravi Hegde

doi:10.1364/OSAC.420977

1. Introduction

The continuing progress in nanophotonics [1] in terms of unprecedented fabrication capabilities and increasing array of novel materials has vastly expanded the design space and has necessitated research into novel computational techniques [2] for inverse design. A plethora of inverse design techniques have been reported [3] and can be grouped into local and global optimizers. Topology optimization [4–6] is a popular local optimization technique; however, local optimizers may get stuck at suboptimal designs which is especially problematic in non-convex optimization scenarios like nanophotonics [7]. On the other hand, evolutionary techniques [8] are global optimizers, but, require a large number of fitness evaluations before an acceptable solution can be found. The difficulties in optimization are exacerbated as the dimensionality of the problem increases (the so called “curse of dimensionality” [9]).

To accelerate optimization, a common workflow involves the training of surrogate models [10–12] to approximate the underlying, computationally expensive, physical relationship. The surrogate models are built upon solutions of selected collocation points in design parameter space. Deep Learning [13] and other data-driven techniques are being actively explored across various science and engineering domains [14,15]. Being universal function approximators [16], Deep Neural Networks (DNNs) could potentially overcome the curse of dimensionality which makes them attractive for constructing surrogate models in problems exhibiting strong nonlinearities and high dimensionality [17]. Recently, several workers have reported that photonics inverse design optimization problems can be accelerated by DL based surrogate models. Beginning with the seminal work by Peurifoy and coworkers [18], DL has been explored for a broad class of problems [19,20] in integrated photonics, nanoresonators and metasurfaces [3,21–28].

Despite the early progress, several challenges remain before DL can become a versatile and effective tool. Specifically, the accuracy of the DL surrogate is a major determinant of the success of any DL based workflow. However, high model accuracy increases the model development burden [29]. Thus, the cost of generation of a training dataset and tuning of model hyperparameters must be weighed against computational savings achieved by the surrogate-assisted optimization. Since the set of possible geometries is very large and it is not known apriori whether an optimal design can be found for a particular shape class; the entire surrogate development workflow needs to be repeated for a new shape class. Although, generative networks can potentially address the problem of geometry definition [21,25,26], the output of a physical system can be very sensitive to system parameters such that it becomes increasingly difficult to obtain a good functional approximation over a large parameter space. Thus there is a need for workflows which are suited for scenarios where training data is inadequate and/or the model accuracy is compromised.

In this article, we propose a workflow for surrogate-assisted optimization in the scenario of inadequate training data and low model accuracy of the surrogate. The proposed approach differs from conventional surrogate assisted evolutionary optimizations [10] in that the surrogate is used only for reducing the number of exact fitness evaluations at each iteration. We extend our previous work [30,31] on the surrogate assisted differential evolution (SADE) [32] algorithm with two new contributions: (1) proposal of novel training loss functions based on structural similarity of two spectra; (2) proposal to train surrogate models on-the-fly with data accumulated in the optimization process. The proposed workflow allows a smooth trade-off between inversion speed, solution quality and model development cost permitting an iterative development. Following this introduction, in section 2, the inverse design problem; details of the surrogate-assisted algorithm; the network architecture and training procedure; and, the description of the novel training loss are found. In section 3, the experiments comparing different loss functions, optimization performance comparisons among different configurations are presented along with discussions on the obtained results. The paper concludes in section 4.

2. Problem description and optimization methodology

The canonical problem of multilayered thin-film coating design is chosen to demonstrate the efficacy of the proposed workflow. The problem of broadband antireflection coating (ARC) design, in particular, has received extensive attention and a broad range of theoretical and computational techniques [33–40] have been investigated. This is a challenging multi-modal optimization problem with regions of flat fitness [33,37]. Strong mathematical and computational evidence regarding the existence of global optima [34–36] makes this problem a good benchmark problem and can provide intuition in the application of DL to problems where clear performance bounds are unknown. For normal incidence, two material systems are known to provide the best designs for ARC and the optimal reflectance achievable depends on the highest and the lowest refractive indices, the total optical thickness as well as the number of layers [33–36].

The multilayered thin-film coating design problem is a constrained optimization problem where the individual layer thicknesses are to be determined (within specified thickness limits) in order to achieve a reflectance (or transmittance) spectrum that best matches a targeted spectrum [Fig. 1(A)]. The merit of a given $N$-layered design $\mathbf {x} = [t_0, t_1 \cdots t_{N-1}]$ is quantitatively evaluated by sampling the spectral response to create the response vector $\mathbf {y} = [R_0, R_1 \cdots R_{S-1}]$ and is given by: MSE($\mathbf {y}$, $\mathbf {y^{T}}$), where MSE is the mean-squared error function and $\mathbf {y}^{T}$ is the targeted response (sampled exactly at the same wavelengths as $\mathbf {y}$). The optimization problem can be formally stated as:

(1)$$\min_{\mathbf{x} = [t_0, t_1 \cdots t_{N-1}]} f(\mathbf{y}, \mathbf{y^{T}}), \quad \textrm{s.t.} \quad t_{min} \le t_j \le t_{max}, j = 0 \ldots N -1,$$

where $f$ is the MSE function and $t_{min}, t_{max}$ are the constraints on the thickness of each layer. Specifically, we consider a 16-layered thin-film using two materials (silica and titania) and restrict the thickness variations of each layer to be between 10 nm ($t_{min}$) to 100 nm ($t_{max}$) and fix the wavelength range to 400 nm to 800 nm in 64 steps. For the silica/titania system, the best achievable mean reflectance $R_m$ may be expressed as: $\displaystyle R_m=0.1029 d^{-0.975}_{opt}$, where $R_m$ is expressed in percentage and $d_{opt}$ is the optical thickness expressed in microns and no constraints are imposed on layer thicknesses. For the chosen parameter set, the optimal mean reflectance lies in the 0.1% to 0.2% range [30,31].

Fig. 1. Overview of DNN surrogate-assisted Differential Evolution (SADE). A The goal is to optimize the individual layer thicknesses ($t_0, t_1 \ldots$) of a multilayered thin-film so that its reflectance spectrum closely matches a target. B: The optimization process is a sequence of point estimates; points are widely spread out in early phase (blue) and converge to subregions in late phase (purple). Simplified representations of regular differential evolution (DE) (C), SADE with off-line training (D) and SADE with on-line training (F). E: A detailed comparison of a single iteration for DE and SADE. The old individuals (denoted by black circles) undergo a variation step resulting in mutant individuals (shown in blue circles). Fitness evaluations on the mutant individuals then leads to the repopulation phase. Exact fitness evaluations are shown by black vertical lines and approximate fitness evaluations in blue. Color codes are employed as follows: red – wasteful function call, green – saved function call, yellow – false negative error (see text for description).

Download Full Size | PDF

A pictorial representation of the various aspects of the proposed algorithm is seen in Fig. 1(C)–1(F). The optimization algorithm is performed on a set (population) of candidate solutions (individuals) (16-dimensional vectors in our case). The optimization can be considered as a search in the design parameter space. Figure 1(B) shows a simplified representation of the optimization process which begins with a set of designs distributed throughout the allowed design space (blue circles) and, ideally, ends with a set of designs located in a desirable sub-region (purple circles) by completing a number of point evaluations along the way.

Similar to other evolutionary algorithms, differential evolution (DE) has three key steps: (1) variation; (2) evaluation and (3) repopulation that are repeated every iteration until a termination criterion is reached. The most obvious way to use the surrogate function is to use it in the evaluation step. However, this could mean that the evolutionary algorithm converges to the global optimum of the approximate fitness function (which may not be the global optimum or at least a near-optimum of the original fitness function). Therefore, the surrogate model should be used together with the original fitness function [41] to ensure solution quality. There are thus three ways to perform the evaluation phase: (1) exactly [as seen in Fig. 1(C)], (2) approximately and (3) exactly over a reduced set [as seen in Fig. 1(D)]. Figure 1(D) uses a surrogate trained only on historical data. The exact function calls are a valuable data source which can be used to train a surrogate model that is locally accurate and use it for preselection purposes as seen in Fig. 1(F) which is split into two phases. The first phase is a bootstrap phase which may optionally rely on a surrogate if historical data are available. The second phase involves multiple rounds of surrogate retraining. Figure 2 shows the detailed flow chart of the SADE using online training.

Fig. 2. Flow chart of the surrogate-assisted DE with on-line training [see Fig. 1(F)] .

Download Full Size | PDF

Figure 1(E) shows the detailed pictorial representation of a typical iteration of SADE [30,31] in comparison with the plain DE. If we denote the $j$-th individual at the beginning of the $i$-th iteration as $p_j^{i}$, then the set of designs $\displaystyle p_j^{0}, \quad 0 \le j \le P - 1$ is the initial population. Before beginning the optimizations, a set $\displaystyle F_j^{0}, \quad 0 \le j \le P - 1$ of real numbers called the fitness vector is generated by calling the exact fitness function on this initial population. The fitness vector is updated at the end of every iteration and at the end of the $i$-th iteration is denoted as $\displaystyle F_j^{i}$. Each iteration begins with the mutation phase followed by a crossover phase which results in another set of $P$ individuals called mutants ($j$-th mutant during the $i$-th iteration is denoted by $m_j^{i}$). Another set of real numbers called the mutant fitness vector (denoted as $\displaystyle f_j^{i}, \quad 0 \le j \le P - 1$) is calculated at every iteration by calling the exact fitness function on the set of mutants.

The process of generating the $j$-th mutant involves an intermediate individual generated by the mutation phase as described by: $o_j^{i} = a_j + r_{mut}*(b_j - c_j) \quad \textrm {and}\quad a_j, b_j, c_j \in \{ p_0^{i -1}, p_1^{i -1}, \ldots , p_{P-1}^{i - 1}\} - \{p_j^{i -1 }\}$, where the $a, b, c$ individuals are randomly drawn for each individual and $r_{mut}$ is a hyperparameter of the optimization called the mutation radius. Finally, the $j$-th mutant is obtained using $p_j^{i -1}$ and $o_j^{i}$ in the crossover phase as:

(2)$$m_j^{i}[d] = \begin{cases} o_j^{i}[d], &\textrm{if} \;p >= p_{cross} \\ p_j^{i-1}[d] &\textrm{otherwise} \end{cases}, \quad 0 \le d \le 15.$$

Crossover is an element-wise operation on the individual (which is a 16-dimensional vector) and involves drawing a random number $p$ for each of the elements. This randomly drawn number is compared with $p_{cross}$ (another hyperparameter of the optimization called the crossover probability) to generate the mutant. The details for the selection of appropriate values for the optimization hyperparameters are discussed in a previous paper [31]. The iteration ends with a repopulation phase that results in the next generation of individuals (and an updated fitness vector) and can be summarized as:

(3)$$\begin{cases} p_j^{i} = m_j^{i}, F_j^{i} = f_j^{i} &\textrm{if}\; f_j^{i} > F_j^{i-1} \\ p_j^{i} = p_{j}^{i-1}, F_j^{i} = F_j^{i-1} &\textrm{otherwise} \end{cases}, \quad 0 \le j \le P -1.$$

Note that it is not necessary to store the fitness vectors at each iteration. For $N$ iterations, this results in $(N + 1)*P$ function calls. Note that some fitness computations here result in parent individuals staying on in the population. If a surrogate could allow us to eliminate these calls, we can reduce the evaluation count without affecting the quality of the obtained optimum.

In the evaluation phase of the proposed algorithm, the exact computation of the mutant fitness vector $f_j^{i}$ for all the $P$ members is replaced by exact computation only on a subset of the mutants. To determine the subset for which exact computation is neccesary, two auxiliary fitness vectors $g$ (analogous to $f$) and $G$ (analogous to $F$) are calculated using the DNN surrogate. The modified repopulation phase that results in the next generation of individuals (and an updated fitness vector) can be summarized as:

(4)$$ \begin{cases} \begin{cases} p_j^i = m_j^i, F_j^i = f_j^i &\text{if $f_j^i > F_j^{i-1}$} \\ p_j^i = p_{j}^{i-1}, F_j^i = F_j^{i-1} &\text{otherwise} \end{cases} &\text{if $g_j^i > G_j^{i-1}$} \\ p_j^i = p_{j}^{i-1}, F_j^i = F_j^{i-1} &\text{otherwise} \end{cases},$$

where $0 \le j \le P -1$. The final mutant population in a particular iteration can then be selected by comparing the various intermediate fitness vectors.

There are several possibilities of this approximation phase. This is denoted in Fig. 1(C) using color shading: (1) $g_j^{i} < G_j^{i-1}, f_j^{i} < F_j^{i-1}$: green shade, one function call is correctly saved; (2) $g_j^{i} > G_j^{i-1}, f_j^{i} > F_j^{i-1}$: optimization has advanced correctly, not shaded; (3) $g_j^{i} < G_j^{i-1}, f_j^{i} > F_j^{i-1}$: false negative error, optimization has been impeded, no unnecessary function calls, yellow shade; and, (4) $g_j^{i} > G_j^{i-1}, f_j^{i} < F_j^{i-1}$: false positive error, optimization not impeded, but function call is not saved, red shade. During a SADE run, it is possible to count the number of errors arising out of false positive errors (marked red) and use it as an indicator of the local approximation error of the surrogate model. The ratio of the number of false positive errors to the total number of fitness evaluation calls is calculated at every step. If this number exceeds a certain threshold, fitness evaluations are performed on all the individuals.

2.1 DNN architecture

The DNN architecture [30,31] was a feedforward neural network with fully connected layers with one or more hidden layers each with 128 neurons. Swish activation [42] was used on all but the last layer where sigmoid activation was used to bound the output between 0 and 1. Batch normalization and dropout were not found to be beneficial. ADAM and N-ADAM training procedures were found to significantly outperform Stochastic Gradient Descent (SGD). The input is a 16-dimensional vector specifying the thickness of each layer of the multilayer. The output is the sampled reflectance spectrum at normal incidence at 64 points equally spaced in the frequency range between frequencies corresponding to a minimum and maximum wavelengths of 400 nm and 800 nm respectively. For generating the training datasets for off-line training, a Latin Hypercube Sampling is performed on a 16-dimensional hypercube such that each layer thickness lies in the (10 nm, 100 nm) range for each layer thickness. The influence of the training dataset size, model parameter count and training epochs on the training accuracy of the DNN was reported in our earlier paper [31]. It was found that the dataset size had the largest influence on the training accuracy and that increasing the network complexity (by varying the number of hidden layers) improved training accuracy without significantly improving the generalization ability.

2.2 Structural similarity based loss functions

The loss function is used to guide the training of the DNN during supervised learning. An optimal training metric ought to be sensitive to errors indicating poor model agreement, while remaining relatively invariant to random noise. The simplest and most widely used metric is the mean squared error (MSE) computed by averaging the squared differences of two spectra. MSE is highly discriminatory between different signals and is appealing because of its simplicity and mathematical convenience during optimization. The motivation to look for other metrics arises from the following observation: given two spectra A and B and a target spectrum T, a human observer is able to easily identify whether A or B is closest to the target T. In our method, surrogate models only work in selecting among two possible designs and would benefit from being able to predict the overall shape of the spectrum in addition to minimizing local error.

The structural similarity index metric SSIM [43] was developed for the comparison of the quality of images and measures how similar two images are with respect to their structure within a particular convolution window. This calculation procedure can be adopted for comparing 1D spectra [44] ($y_t$, the true spectrum and $y_p$, the predicted spectrum) by considering them to be images with only one row. At every spectral location $i$, subarrays are taken spanning a kernel size $k$ and means $\mu _t, \mu _p$ and variances $\sigma _t, \sigma _p$ are calculated in addition to the covariance $\sigma _{tp}$. The SSIM is a local measure of similarity and returns an array of the same size as the arrays being compared (after adequate padding at the ends of the array). The mean of the SSIM, MSSIM is calculated as follows:

(5)$$MSSIM_{1D}\left( y^{t},y^{p}\right) = \displaystyle \dfrac{1}{N_\lambda} \sum_{i=0}^{N_\lambda} SSIM_{1D}\left( y^{t},y^{p}\right)[i],$$

\begin{aligned}SSIM_{1D}\left( y^{t},y^{p}\right)[i] & = I[i] c[i] s[i],\\ I\left( y^{t},y^{p}\right) [i] & = \dfrac{2 \mu_t \mu_p + C_1}{\mu_t^{2} + \mu_p^{2} + C_1},\\ c\left( y^{t},y^{p}\right) [i] & = \dfrac{2 \sigma_t \sigma_p + C_2}{\sigma_t^{2} + \sigma_p^{2} + C_2},\\ s\left( y^{t},y^{p}\right) [i] & = \dfrac{\sigma_{tp} + C_3}{\sigma_t \sigma_p + C_3}, \end{aligned}

where $N_{lambda}$ is the number of samples in the spectrum. The constants $C_1, C_2, C_3$ are needed to avoid numerical problems arising due to very small denominators and are defined as: $C_1 = (K_1 L)^{2}$, $C_2 = (K_2 L)^{2}$, $C_3 = C_2/2$. The constants $K_1 = 0.01$ and $K_2 = 0.02$ are defined arbitrarily with $L$ being 1.0 (the dynamic range of spectral variations). Note that the value of MSSIM is bound between −1.0 and 1.0 with perfect match giving a score of 1.0.

Two loss functions, $L(y^{t}, y^{p})$, for training the DNN are then defined as:

(6)$$\begin{aligned} L_{SSIM-MSE}\left( y^{t}, y^{p} \right) = \alpha & \left(1.0 - MSSIM_{1D}(y^{t}, y^{p}) \right) +\\ (1.0 - \alpha) & \left( MSE(y^{t}, y^{p}) \right) + \lambda \sum_{i=1}^{W} \beta_i^{2} \end{aligned}$$

(7)$$\begin{aligned} L_{SSIM-MAE}\left( y^{t}, y^{p} \right) = \displaystyle \alpha & \left(1.0 - MSSIM_{1D}(y^{t}, y^{p}) \right) +\\ (1.0 - \alpha) & \left( MAE(y^{t}, y^{p}) \right) + \lambda \sum_{i=1}^{W} \beta_i^{2}, \end{aligned}$$

where MSE and MAE are the Mean squared error and Mean absolute error losses, $\alpha$ is a number between 0 and 1. A L2 weight regularization term with scaling constant $\lambda$ is added to the losses which minimizes the $W$ weights of the DNN being trained for reducing overtraining.

3. Results and discussion

The open-source python program TMM [45] is used for spectrum calculation. DNNs are implemented with the Keras library [46] running on Tensorflow 1.13 backend. The DE component was written in python from scratch with separate CPU and GPU versions and we ensured that unnecessary data transfers between the CPU and the GPU are avoided. The source code for the implementation, datasets and saved models will be made available upon publication. The following results are obtained on a workstation with a Intel i9–7920X CPU with a NVIDIA GeForce GTX 1080 Ti GPU card with 11 GB memory. The training of a DNN (with 3 hidden layers of 128 neurons each) takes less than 3 minutes on a single GPU for a 32k sized dataset and 200 epochs. For the evaluation of the spectrum of 100,000 geometries, the trained DNN took 479 ms while the TMM python package took 6 minutes and 19 s using 16 cores in parallel and about 1 hour 5 minutes for a single-threaded version.

3.1 Training progress with different loss functions

The first experiment examines the role of various loss functions and training dataset size in the prediction ability of trained DNNs after training for 200 epochs. There are four separate sub-experiments whose results are summarized in Fig. 3. With a dataset size of 64k, a trained DNN has a mean spectral error of 0.1 on the training set which increased to 0.15 on a training set of 32k and 0.3 for a training set of 8k. Dataset sizes of less than 10k can be thus considered to be quite small. A 400,000 sized dataset and separate 100,000 sized testing dataset are first generated. Smaller training datasets are split off from the 400k sized dataset. L2 regularization (see Eqs. (6) and 7) was used for training with a constant $\lambda$ of 5e-4.

Fig. 3. Prediction capability of DNNs trained under various loss functions and dataset sizes. (a), (b), (c) and (d) compare the predicted spectrum (colored solid lines) against the ground truth spectrum (black dotted line) for a random unseen test input. (e), (f), (g) and (h) show the distribution of mean absolute prediction error (blue) and peak absolute prediction error (red) on a test set of size 100k. The text in blue shows the overall mean of the mean absolute error distribution. See the text for description of other parameters for the four sub-experiments (a, e), (b, f), (c, g) and (d, h). All the DNNs have 2 hidden layers with 128 neurons each.

Download Full Size | PDF

The first sub-experiment [Figs. 3(a) and 3(e)] considers five separate DNNs with 2 hidden layers containing 128 neurons on a dataset size of 64k trained with the $L_{SSIM-MAE}$ (Eq. (7)) loss with varying $\alpha$. Five values of $\alpha$ are considered: 0.2, 0.4, 0.6, 0.8 and 1.0 (Fig. 3(a) top to bottom). It is seen that a loss composed of the linear combination of MSSIM and MAE works better than using MSSIM or MAE alone. With high values of $\alpha$, it is noted that the predicted spectrum tends to lean towards larger values. Values of $\alpha$ in the range of 0.4 to 0.6 are seen to work best. The second sub-experiment [Figs. 3(b) and 3(f)] considers four separate DNNs (with 2 hidden layers containing 128 neurons each) trained with the $L_{SSIM-MSE}$ (Eq. (6)) loss on a dataset size of 64k with varying $\alpha$ of 0.0, 0.25, 0.50 and 0.75. The DNN trained with the $L_{SSIM-MAE}$ (Eq. (7)) loss with $\alpha$ of 0.5 is used for comparison. It is seen that a loss composed of the linear combination of MSSIM and MSE works better than MSSIM or MSE acting alone. Values of $\alpha$ in the range of 0.4 to 0.6 are seen to work best. $L_{SSIM-MAE}$ with $\alpha$ of 0.5 performs better than $L_{SSIM-MSE}$ with the same $\alpha$. The third sub-experiment [Fig. 3(c) and 3(g)] considers four separate DNNs (with 2 hidden layers containing 128 neurons each) trained with the $L_{SSIM-MAE}$ (Eq. (7)) loss on a dataset size of 64k with $\alpha$ of 0.5 and kernel sizes of 3, 5, 9 and 15. Although, increasing kernel size leads to improved prediction, the training time also increases. Thus, unless otherwise mentioned, a kernel size of 9 is used for computing MSSIM in this paper. The fourth sub-experiment [Figs. 3(d) and 3(h)] considers six separate DNNs with 2 hidden layers containing 128 neurons each. The first three trained on MSE loss with training dataset sizes of 32k, 64k and 128k respectively. The remaining three are trained with $L_{SSIM-MAE}$ (Eq. (7)) loss with $\alpha$ of 0.5 with training dataset sizes of 32k, 64k and 128k respectively. It can be seen that the newly proposed loss functions lead to improved prediction ability regardless of training dataset sizes.

3.2 Surrogate-assisted DE

DE is a stochastic algorithm and the result of an optimization run is a random variable. The random nature of the algorithm results from random initializations in addition to the random nature of the variation step run (mutation and crossover) at every iteration. The typical course of a plain DE run for 9 separate instances is seen in Fig. 4(a). In Fig. 4(b), the typical course of SADE with online trained surrogates [see Fig. 1(F)] is shown. It is seen that the progression of the SADE case is almost similar to the DE case and that most of the computational savings occur in the middle stage. Furthermore, it is also seen that the fitness plateaus early. The performance of an optimization framework can be characterized by: (1) the number of function evaluations (NFE) used during the optimization averaged over several runs, (2) the median fitness over several runs and (3) the variance in the fitness achieved at the end of multiple runs. The use of DNN surrogates should ideally result in the reduction of NFE and the variance without undue compromise in the quality of the optimum. For the remaining set of experiments, the results are shown in the form of boxplots of the kind seen in Fig. 4(a) (with the median marked).

Fig. 4. A Typical course of plain DE optimization for designing a broadband antireflection coating (9 separate runs). The parameters of the optimization: population - 40, mutation radius - 0.8, crossover probability - 0.7, number of iterations - 256. The result of a set of runs can be summarized by a boxplot as seen on the right. B Typical course of a single SADE optimization showing the fitness evolution and the function evaluation count at each iteration. The training and retraining operation of surrogates is performed at the marked points.

Download Full Size | PDF

In the first experiment, DE is run using only approximate fitness evaluations. The DNN is run on the GPU where it is possible to evaluate multiple input designs in parallel. One hundred separate islands with population of 128 each are run in parallel for 64 iterations using a mutation radius of 0.4 and a crossover probability of 0.4. The optima determined in the previous step are then evaluated with an exact function call as a verification step. The results of this experiment are summarized in Fig. 5. A main observation is that optimization relying only on trained DNNs give poorer solution quality with the mean reflectance of only about 4%. However, the optimization results are seen to improve with larger and larger dataset sizes. We note that a plain DE consuming about 10k function evaluations can give better solution quality. The predictive capability of a DNN for a particular design is dependent to a large extent on the distribution and size of training data and to a lesser extent on model architecture and hyperparameters. The degree to which the approximate fitness landscape matches the actual fitness landscape decides the performance of the DE. It is seen that the models trained on larger datasets generally fare better with the exception of the models trained on the 64k sized dataset. For smaller datasets, the distribution of the data points is seen to matter more than the dataset size. The proposed SSIM based loss functions are seen to perform better than MSE loss for a given dataset. In subsequent experiments, DNN surrogate is used to initialize instead of random initialization (henceforth called surrogate initialization).

Fig. 5. Results of DE optimization using only surrogate fitness evaluations. Each DNN is identified by the training loss used and the dataset size. The number of hidden layers for a dataset size of 32k is 1, 2 for 64k and 3 for 128k and 320k respectively. The training loss function is denoted by color codes: Gray color - MSE, Green - SSIM-MAE, Blue - SSIM-MSE.

Download Full Size | PDF

The next experiment considers SADE for antireflection coating design and the results are summarized in Fig. 6. For comparison, we use the results of a plain DE with parameters of the optimization: population = 40, mutation radius = 0.8, crossover probability = 0.7 and number of iterations = 256. The first sub-experiment (Fig. 6 top subfigure) considers nine cases combining three different losses (SSIM-MSE, SSIM-MAE and MSE) and three different dataset sizes (32k, 128k and 320k). It is observed that NFE is reduced significantly (up to 80% reduction) for all cases. SADE is seen to perform much better than surrogate-only DE. Larger datasets are seen to work better and can be seen to outperform plain DE in solution quality with a high reduction in NFE. Smaller datasets are seen to exhibit larger variance. SSIM based losses are seen to perform better than MSE for all dataset sizes albeit with more variance. The variance problem is addressed later in the paper.

Fig. 6. Results of SADE optimizations for designing antireflection coating. Results are grouped into top and bottom sets and the parameters of the DNN surrogate is tabulated alongside the corresponding boxplot (Sz - size of training dataset, Init - indicates whether surrogate initialization is used or not, H - number of hidden layers in the network containing 128 neurons each, NFE - average number of exact function calls). The bottom set used the SSIM-MSE loss for training. For comparison, results of a plain DE run are added to both the sets.

Download Full Size | PDF

The second sub-experiment (Fig. 6 bottom subfigure) was to assess whether initializing the assisted-DE with a surrogate-initialization step is helpful. Eight cases were considered using four dataset sizes (32k, 64k, 128k, 320k) with and without surrogate initialization. It is noted that surrogate initialization leads to a small decrease in the NFE for most dataset sizes. The solution quality is poorer for smaller datasets while for larger datasets, the median fitness is seen to improve somewhat, The variance is seen to increase in the case of models with initialization for all datasets.

In the next experiment, we investigate a technique to reduce the variance seen in surrogate-assisted DE. Four separate islands with a population of 20 each is considered. The DE algorithm is run as usual in parallel for all the four islands for 64 iterations. The best individual in each of the island is identified first. The best individual in island 1 is migrated to island 2, in island 2 is moved to island 3 and so on in a round robin fashion ( i.e. $1 \longrightarrow 2$, $2 \longrightarrow 3$, $3 \longrightarrow 4$ and $4 \longrightarrow 1$). The DE algorithm is again run for a further 64 iterations. The best individual out of all the four islands is considered the final optimum. The entire sequence described above is repeated 10 times and the fitness values and NFE counts are shown in Fig. 7. The plain DE has a NFE of 10,320 and the surrogate-assisted DEs leads to about 50% reduction in computation. In comparison to surrogate-assisted DE described earlier, the computational load is increased nearly two-fold. However, the optimization outcome is dramatically improved. Even for small datasets, the surrogate-assisted DE performs better leading to improvement in median fitness as well as reduction in variance. The proposed loss functions can lead to a smooth trade-off between solution quality and computation.

Fig. 7. Performance of a 4-island DE with and without surrogate assistance. A total of 128 iterations is used with a round robin best individual migration step after the 64th iteration. The algorithm is run for 10 times and the boxplots of best fitness and NFE is shown along with the parameters of the DNN surrogate. Population $P =20$ for all cases. $r_{mut}=0.8$, $\rho _{cross} = 0.7$ for plain DE and $r_{mut}=0.4$, $\rho _{cross} = 0.4$ for assisted DE.

Download Full Size | PDF

In the final experiment we consider SADE using surrogates trained on-the-fly. Note that a DE run with population of 80 for 256 iterations needs 20k function calls. Thus we see that the amount of training data available for training the surrogates is quite less. However, the surrogates are locally accurate and can still be useful. Figure 8 shows the results of this experiment comparing nine different cases (each case in turn is a ensemble of 10 separate runs). Three values of population: 160, 80 and 40 are considered with the plain DE as the reference. Higher population counts lead to improved results but with diminishing returns. Surrogates trained with the SSIM-MSE training loss are seen to outperform plain DE in terms of the median fitness. A closer look at the fitness evolution as a function of number of DE iterations Fig. 9 reveals that SADE saturates faster indicating early stopping can be beneficial.

Fig. 8. Performance of SADE with on-the-fly trained surrogate models without initialization or use of historical data. Color codes: tan - plain DE, gray - MSE, blue - SSIM-MSE. The population $P$ is varied for each run and the average NFE is listed. All the runs were for 256 iterations.

Download Full Size | PDF

Fig. 9. A closer look at the results seen in Fig. 8 showing the evolution of the fitness for different number of iterations.

Download Full Size | PDF

4. Conclusions

In summary, we report and demonstrate the utility of a workflow that combines conventional evolutionary optimization and makes efficient usage of all available data including historical simulation data and data generated during the course of the optimization. The problem of targeting a particular spectral response is a common feature in various branches of optics and for devices operating in other regions of the electromagnetic spectrum for which the similarity based training losses are particularly relevant. Although, DL is being increasingly applied for inverse design and to accelerate optimization in photonics and nanophotonics, it’s ambit of applicability is restricted to cases where reasonably large datasets can be acquired. In scenarios, where moderately large datasets can be generated, conventional optimizations can already bypass the methodological complexities of dataset generation and model training and validation. The results of this study are thus particularly relevant to the inverse design of three dimensional structures which are computationally expensive to simulate and are thus difficult to optimize. The approach can be easily generalized to other kinds of evolutionary optimizations like genetic algorithms and particle-swarm optimization.

Additionally, the proposed variety of loss functions enables the possibility of using multiple surrogates [12] simultaneously to assist DE. Future work will investigate the possibilities offered by ensembles of DNNs trained on the various loss functions introduced here. The neural networks used here have explored fully connected layers with a feed-forward connection pattern. In the situation of insufficient training data, we find that increasing the complexity of the network may lead to overtraining. Recently proposed architectures [29,47] like Densenets where a layer receives input from all previous layers not just from the one preceding it may prove useful for problems of higher dimensionality.

Funding

Department of Science and Technology, Ministry of Science and Technology, India (SR/NM/NS-65/2016).

Disclosures

The author declares no conflicts of interest.

References

1. A. F. Koenderink, A. Alu, and A. Polman, “Nanophotonics: Shrinking light-based technology,” Science 348(6234), 516–521 (2015). [CrossRef]

2. S. Molesky, Z. Lin, A. Y. Piggott, W. Jin, J. Vucković, and A. W. Rodriguez, “Inverse design in nanophotonics,” Nat. Photonics 12(11), 659–670 (2018). [CrossRef]

3. S. D. Campbell, D. Sell, R. P. Jenkins, E. B. Whiting, J. A. Fan, and D. H. Werner, “Review of numerical optimization techniques for meta-device design [Invited],” Opt. Mater. Express 9(4), 1842–1863 (2019). [CrossRef]

4. N. Lebbe, C. Dapogny, E. Oudet, K. Hassan, and A. Gliere, “Robust shape and topology optimization of nanophotonic devices using the level set method,” J. Comput. Phys. 395, 710–746 (2019). [CrossRef]

5. A. Michaels and E. Yablonovitch, “Leveraging continuous material averaging for inverse electromagnetic design,” Opt. Express 26(24), 31717–31737 (2018). [CrossRef]

6. A. Y. Piggott, E. Y. Ma, L. Su, G. H. Ahn, N. V. Sapra, D. Vercruysse, A. M. Netherton, A. S. Khope, J. E. Bowers, and J. Vuckovic, “Inverse-designed photonics for semiconductor foundries,” ACS Photonics 7(3), 569–575 (2020). [CrossRef]

7. E. B. Whiting, S. D. Campbell, L. Kang, and D. H. Werner, “Meta-atom library generation via an efficient multi-objective shape optimization method,” Opt. Express 28(16), 24229–24242 (2020). [CrossRef]

8. P.-I. Schneider, X. Garcia Santiago, V. Soltwisch, M. Hammerschmidt, S. Burger, and C. Rockstuhl, “Benchmarking five global optimization approaches for nano-optical shape optimization and parameter reconstruction,” ACS Photonics 6(11), 2726–2733 (2019). [CrossRef]

9. R. Bellman, Dynamic Programming (Dover Publications Inc., Mineola, N.Y, 2003), reprint edition ed.

10. Y. Jin, “A comprehensive survey of fitness approximation in evolutionary computation,” Soft Comput. 9(1), 3–12 (2005). [CrossRef]

11. Y. Jin, H. Wang, T. Chugh, D. Guo, and K. Miettinen, “Data-Driven Evolutionary Optimization: An Overview and Case Studies,” IEEE Trans. Evol. Computat. 23(3), 442–458 (2019). [CrossRef]

12. Y. Jin, “Surrogate-assisted evolutionary computation: Recent advances and future challenges,” Swarm Evol. Comput. 1(2), 61–70 (2011). [CrossRef]

13. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature 521(7553), 436–444 (2015). [CrossRef]

14. N. Baker, F. Alexander, T. Bremer, A. Hagberg, Y. Kevrekidis, H. Najm, M. Parashar, A. Patra, J. Sethian, S. Wild, and K. Willcox, “Workshop Report on Basic Research Needs for Scientific Machine Learning: Core Technologies for Artificial Intelligence,” Tech. rep., USDOE Office of Science (SC), Washington, D.C. (United States) (2019).

15. K. A. Brown, S. Brittman, N. Maccaferri, D. Jariwala, and U. Celano, “Machine Learning in Nanoscience: Big Data at Small Scales,” Nano Lett. 20(1), 2–10 (2020). [CrossRef]

16. F. Scarselli and A. Chung Tsoi, “Universal Approximation Using Feedforward Neural Networks: A Survey of Some Existing Methods, and Some New Results,” Neural Networks 11(1), 15–37 (1998). [CrossRef]

17. R. K. Tripathy and I. Bilionis, “Deep UQ: Learning deep neural network surrogate models for high dimensional uncertainty quantification,” J. Comput. Phys. 375, 565–588 (2018). [CrossRef]

18. J. Peurifoy, Y. Shen, L. Jing, Y. Yang, F. Cano-Renteria, B. G. DeLacy, J. D. Joannopoulos, M. Tegmark, and M. Soljačić, “Nanophotonic particle simulation and inverse design using artificial neural networks,” Sci. Adv. 4(6), eaar4206 (2018). [CrossRef]

19. R. S. Hegde, “Deep learning: A new tool for photonic nanostructure design,” Nanoscale Adv. 2(3), 1007–1023 (2020). [CrossRef]

20. J. Jiang, M. Chen, and J. A. Fan, “Deep neural networks for the evaluation and design of photonic devices,” Nature Reviews Materials pp. 1–22 (2020).

21. Z. Liu, D. Zhu, S. P. Rodrigues, K.-T. Lee, and W. Cai, “Generative Model for the Inverse Design of Metasurfaces,” Nano Lett. 18(10), 6570–6576 (2018). [CrossRef]

22. Z. Liu, D. Zhu, K.-T. Lee, A. S. Kim, L. Raju, and W. Cai, “Compounding Meta-Atoms into Metamolecules with Hybrid Artificial Intelligence Techniques,” Adv. Mater. 32(6), 1904790 (2020). [CrossRef]

23. W. Ma, F. Cheng, and Y. Liu, “Deep-Learning-Enabled On-Demand Design of Chiral Metamaterials,” ACS Nano 12(6), 6326–6334 (2018). [CrossRef]

24. A. M. Hammond, E. Potokar, and R. M. Camacho, “Accelerating silicon photonic parameter extraction using artificial neural networks,” OSA Continuum 2(6), 1964–1973 (2019). [CrossRef]

25. J. Jiang and J. A. Fan, “Dataless training of generative models for the inverse design of metasurfaces,” arXiv:1906.07843 [physics] (2019).

26. J. Jiang and J. A. Fan, “Global optimization of dielectric metasurfaces using a physics-driven neural network,” Nano Letters p. acs.nanolett.9b01857 (2019).

27. O. Hemmatyar, S. Abdollahramezani, Y. Kiarashinejad, M. Zandehshahvar, and A. Adibi, “Full color generation with Fano-type resonant HfO₂ nanopillars designed by a deep-learning approach,” Nanoscale 11(44), 21266–21274 (2019). [CrossRef]

28. C. C. Nadell, B. Huang, J. M. Malof, and W. J. Padilla, “Deep learning for accelerated all-dielectric metasurface design,” Opt. Express 27(20), 27523–27535 (2019). [CrossRef]

29. T. Elsken, J. H. Metzen, and F. Hutter, “Neural Architecture Search: A Survey,” arXiv:1808.05377 [cs, stat] (2019).

30. R. S. Hegde, “Accelerating Optics Design Optimizations with Deep Learning,” Opt. Eng. 58(06), 1 (2019). [CrossRef]

31. R. S. Hegde, “Photonics Inverse Design: Pairing Deep Neural Networks with Evolutionary Algorithms,” IEEE J. Sel. Top. Quantum Electron. 26(1), 1–8 (2020). [CrossRef]

32. R. Storn and K. Price, “Differential Evolution – A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces,” J. Glob. Optim. 11(4), 341–359 (1997). [CrossRef]

33. U. B. Schallenberg, “Antireflection design concepts with equivalent layers,” Appl. Energy 45(7), 1507–1514 (2006). [CrossRef]

34. A. V. Tikhonravov and J. A. Dobrowolski, “Quasi-optimal synthesis for antireflection coatings: a new method,” Appl. Opt. 32(22), 4265–4275 (1993). [CrossRef]

35. J. A. Dobrowolski, A. V. Tikhonravov, M. K. Trubetskov, B. T. Sullivan, and P. G. Verly, “Optimal single-band normal-incidence antireflection coatings,” Appl. Opt. 35(4), 644–658 (1996). [CrossRef]

36. A. V. Tikhonravov, “Some theoretical aspects of thin-film optics and their applications,” Appl. Opt. 32(28), 5417–5426 (1993). [CrossRef]

37. S. W. Anzengruber, E. Klann, R. Ramlau, and D. Tonova, “Numerical methods for the design of gradient-index optical coatings,” Appl. Opt. 51(34), 8277–8295 (2012). [CrossRef]

38. Y. Zhao, F. Chen, Q. Shen, and L. Zhang, “Optimal Design of Graded Refractive Index Profile for Broadband Omnidirectional Antireflection Coatings Using Genetic Programming,” Prog. Electromagn. Res. 145, 39–48 (2014). [CrossRef]

39. J.-M. Yang and C.-Y. Kao, “Efficient evolutionary algorithm for the thin-film synthesis of inhomogeneous optical coatings,” Appl. Opt. 40(19), 3256–3267 (2001). [CrossRef]

40. M. Ebrahimi and M. Ghasemi, “Design and optimization of thin film polarizer at the wavelength of 1540 nm using differential evolution algorithm,” Opt. Quantum Electron. 50(4), 192 (2018). [CrossRef]

41. Y. Jin, M. Olhofer, and B. Sendhoff, “A framework for evolutionary optimization with approximate fitness functions,” IEEE Trans. Evol. Computat. 6(5), 481–494 (2002). [CrossRef]

42. P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for Activation Functions,” Arxiv 1710.0594, 1–13 (2017).

43. Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image Quality Assessment: From Error Visibility to Structural Similarity,” IEEE Trans. Image Process. 13(4), 600–612 (2004). [CrossRef]

44. D. Breakey and C. Meskell, “Comparison of metrics for the evaluation of similarity in acoustic pressure signals,” J. Sound Vib. 332(15), 3605–3609 (2013). [CrossRef]

45. S. J. Byrnes, “Multilayer optical calculations,” arXiv 1603.02720, 1–20 (2018).

46. F. Chollet, “Keras: the python deep learning library,” Github Repository (2015).

47. G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” arXiv:1608.06993 [cs] (2018).

Sample-efficient deep learning for accelerating photonic inverse design

Abstract

1. Introduction

2. Problem description and optimization methodology

2.1 DNN architecture

2.2 Structural similarity based loss functions

3. Results and discussion

3.1 Training progress with different loss functions

3.2 Surrogate-assisted DE

4. Conclusions

Funding

Disclosures

References

Cited By

Figures (9)

Equations (8)

OSA Continuum