Inverse design of the MMI power splitter by asynchronous double deep Q-learning

Xiaopeng Xu; Yu Li; Weiping Huang

doi:10.1364/OE.440782

1. Introduction

Power splitter, which divides energy of the input source into two or more output ports at a selected/desired ratio, has wide applications in optoelectronic and photonic integrated circuit designs. For the past few years, researchers have proposed various methods and structures, such as the directional coupler (DC) [1,2], Y-branch [3], mode converter [4], etc., to achieve arbitrary SR. But due to long beating length for different modes in the DC device, its coupling region could reach a hundred micrometers, with less than 100 nm flat bandwidth. The multimode interference (MMI) based power splitter structure could be fabricated on the silicon-on-insulator (SOI) platform, whose beating length could be greatly reduced to less than 5 µm with broad and flat splitting wavelength band [5].

There are generally two types of design methods for the MMI power splitter: one is to combine electromagnetic simulations with the numerical/analytical optimization algorithms, such as the adjoint method [6,7], objective-first algorithm [8,9], direct binary search (DBS) [10], and particle swarm optimization (PSO) [11], etc. These methods are flexible and straightforward during the design/optimization process. However, with more and more parameters included in the simulation for device performance improvements, the high amount of computational time may increase exponentially [12]. Then, the other approach is to use artificial neural network (NN) to inversely design the power splitters. The conditional variational autoencoder model [13] and composite encoder-decoder model [14] have been proposed, which after training can generate designs instantaneously according to the specified requirements. But preparation of the training dataset could be highly time consuming and tedious, as it involves a large amount of numerical simulations, as well as the database labeling and filtering processes, etc.

In the field of artificial intelligence, reinforcement learning (RL) [15] is an excellent algorithm without having to prepare a huge dataset for training. The basic concept of RL is to provide an environment where the neural network (agent) can explore the subject according to the pre-set rules and the associated rewards, such that agents can make optimized decisions in order to obtain the highest reward values. RL is also more consistent with human cognition process, which was later combined with the deep learning neural network, to give the deep reinforcement learning (DRL) method [16].

In the recent years, there are also many teams using DRL to explore the nanophotonic devices, such as the multilayer thin-film optimizations [17], the moth-eye structure designs for ultra-broadband absorption in the visible and near infrared waveband [18], and the color purification/generation by dielectric nanostructures [19], etc. All of the above use similar DRL methods to parameterize the structure, such that the actions of the agent are linked directly to the values of those structural parameters. If a vast number of parameters have to be considered, there would be at least twice more actions to be set [17–19], which makes the DRL method hard to converge for the MMI splitter design, due to hundreds of positions possible for the etchings in the interference region.

Here, we propose a novel A-DDQN method based on DRL to inversely design the MMI power splitter in a small footprint (2×2 µm²), to obtain the specified SR for a broad wavelength range (1200∼1650 nm). The agent moves only in limited directions, such that the action types in our case can be greatly reduced to decide where the holes are to be etched. Also, structures designed by our method seem to have fewer holes comparatively, which shortens the lithography time and offer more room for the detailed tuning if more stable ratio over a wider bandwidth is desired. The network trained for the previous ratio could be re-used to design devices with different SRs, which can save a high amount of time for the network re-training. The exploration history of the agents can also be shared between each other to improve the learning efficiency significantly. More importantly, the human intervention for the dataset pre-processing can be relieved, where only the target SR is to be set at the beginning. With the path-related data collected along the target-finding process by the A-DDQN method, efficient inverse design for the MMI splitter can be achieved, as well as for more complex photonic devices, etc.

2. Structure and method

2.1 MMI Splitter

In our calculation, we use the three-port MMI structure as in Fig. 1(a) for the basic model, where the left port is for the input, and the right two ports (port1 and port2) are for the outputs. The material used for the MMI (blue region) is Si, which is surrounded by SiO₂ as the oxide protection (gray region). The refractive indices of those two materials are around 3.5 and 1.45 respectively, whose exact values and dispersion characteristics at different wavelengths are given by the default material library of the Lumerical FDTD package. The taper length is 3 µm, and the widths at both ends of the tapers are 0.5 µm (for the input/output ports) and 1 µm (for the connections to the central region, whose length is 2µm), respectively. The distance between two nearest circular holes is 100 nm, whose radius is 40 nm for each, as is similar to references as in [13,20,21]. The maximum number of holes per row or column is 20, which are not allowed to overlap with each other to prevent fracture of the interference region during etching and the circular holes can be patterned using the electron-beam lithography technique [22]. The input light used in our simulation is TE polarized, and MMI power splitting process is calculated by the FDTD solver [23]. The functionality of A-DDQN agent is to decide where to etch holes in the interference region, so that the optical power can be split according to the desired ratio.

Fig. 1. (a) The schematic structure of MMI power splitter. (b) Power distribution in the interference region of MMI at 1650 nm wavelength. (c) Transmission spectra of the two output ports.

Download Full Size | PDF

For the structure as shown in Fig. 1(a), the 1:1 splitting ratio can be achieved with no etchings in the interference region, whose transmission of the two ports (T₁, T₂) are almost identical as shown in Fig. 1(c). The return loss (R) is also plotted in the figure, whose value can be monitored as part of the criteria to guarantee an optimized design. The field propagation at λ=1650 nm is in Fig. 1(b) to demonstrate the power splitting process.

2.2 Asynchronous DDQN

In order to optimize the MMI splitter for any arbitrary ratios, asynchronous DDQN is proposed here as in Fig. 2, to improve the computational efficiency as compared to the traditional DQN and DDQN [16,24] (whose working principles are briefly described in the Appendix). The multiple DDQNs are initialized with shared weights and S-A-R-S-A (state-action-reward-state-action) sequences [25], where DDQN agents are trained in their own environments, asynchronously. This scheme is similar to the particle swarm optimization (PSO) method, where agents (as particles in PSO) can explore the parameter domain interactively [11]. Here, a shared buffer is used to store the sequences for agents to increase their learning efficiency, and the trained weights θ will be saved in the global NN model and shared among other agents to improve their optimization ability. The computation resources can also be utilized more adequately to speed up the whole inverse design process.

Fig. 2. The schematic diagram of A-DDQN.

Download Full Size | PDF

The A-DDQN algorithm can be given as follows:

oe-29-22-35951-i001

Each DDQN agent complies with the temporal-difference (TD) algorithm [26], whose training process is given as follows:

(1) Initialize two identical NNs, i.e., the evaluation- and target-NN are with random weights θ and θ’ (or loaded from the global NN model G), θ=θ’. Set the threshold step TS for agents to start learning, set the maximum number of exploration steps M for the agents in each episode, and set the target-NN updated frequency C;
(2) Initialize the simulation environment to get the current state s_t (t=1,2,…,M);
(3) The agent uses the ε greedy algorithm [27] to determine whether the action is from the NN or chosen randomly, to form new state s_t+1. If the action is to be generated from NN, then s_t is fed to the evaluation-NN for the action with maximum Q value to be selected (as described in Appendix);
(4) Simulate s_t+1 and obtain the reward r_t. Save the sequence s_t, a_t, r_t, s_t+1 to the buffer for the agent to be trained. If the performance of device meets our requirements, the training is stopped and the weights are saved/uploaded to the global NN model G. Otherwise, continue;
(5) If the number of sequences stored in the buffer B is greater than the learning-threshold step TS, then a mini-batch of sequences is chosen from the buffer to train the evaluation-NN. After a few training steps C, the weights of target-NN are updated by the evaluation-NN, whose weights are uploaded to the global network and saved for other agents to be used.
(6) Repeat Step (3) to (5). The agent explores structures until performance of the MMI meets our requirements. But if the number of repetitions exceeds the maximum number of explorations M in one episode, the environment is reinitialized.

We have to note that there is no preference for the network to take any specified action in the early stage of exploration due to lack of training. Therefore, in order to increase the network exploration capability, we set the greedy value ε to be less than 1, such that during Step (3) the agent will take an action randomly, if the random number (from 0 to 1) generated prior to decision is greater than ε. Otherwise, it will make a decision according to NN. During the training process, ε will grow (i.e., increase by 0.05 for every 200 explorations) such that the agents will gradually depend on NN to make decisions. The learning threshold and the maximum number of explorations are pre-set depending on the complexity of the MMI design.

The interference area of MMI is evenly divided into a 20×20 grid-matrix. In the initial state of the environment, the coordinates are randomly assigned in the central region as marked by the red box in Fig. 3(c). And the initial state s₁ consists of the hole distribution and coordinates of the agent’s present position. In order to assign actions to different hole-etching cases, we set 16 types of actions for the agent, i.e., moving in 8 directions with/without etchings, respectively. If the agent chooses action from 0 to 7 as in Fig. 3(a), then only the position is updated without etching; if from 8 to 15, then the position and etching are both updated as in Fig. 3(b). For examples in Fig. 3(c), when the action in the present state is 12, its position (black dot) would be moved one space to the right, and then do the etching (yellow dot). If the coordinates are out of the interference region after one move, a new random position will be set within that region. After an action is chosen by the agent to form a new state, the FDTD algorithm is used to carry out the MMI simulations for design evaluations, and provide a reward value with the help of A-DDQN to guide the next step action. Compared with the previous methods [17–19], the action does not represent the change in values for a vast number of parameters (i.e., 400 holes for at least 800 actions), but means the change in the hole-etching locations (i.e., 8 directions for only 16 actions) as in Fig. 3, such that the total number of actions can be reduced dramatically. Also, if more degree of freedom is desired for a more versatile design, properties such as the hole size and material types can be added besides the 16 moving directions, as well as the three-dimensional (3D) FDTD simulations being used to prepare for the real device fabrication.

Fig. 3. The sketch of 8 actions for the agent to (a) move but without etching, and (b) both move and etch. (c) The sketch of a random state, with the agent taking Action 12 as an example. The black point is the present position, and the yellow point is a hole etched by Action 12.

Download Full Size | PDF

For detailed network configuration, we can use the one as shown in Fig. 4 for the network structure and hyperparameters as obtained from optimization. The size of the convolution kernel is 3, the stride is 2, and the padding is 1. The number of channels in each convolution layer is 32, 64 and 64, respectively. Here, a 21×21 matrix is constructed to represent the etching distributions and the agent’s present coordinates, where 0 in the first 20 rows/columns means no etching at this point, and 1 indicates a circle with 40 nm radius is to be etched. The two numbers in Row 21 are the agent’s present coordinates. In order to process the 2D matrix as a picture, convolution layers are used in the NN, as commonly applied in the image processing fields [28]. The state s is fed to NN through the convolution, activation and batch normalization processes sequentially, and then passes a linear layer to generate the Q values (as described in the Appendix) for 16 actions. Finally, the action with the highest Q will be selected for the next move.

Fig. 4. The schematic diagram of the neural network for the A-DDQN.

Download Full Size | PDF

The network uses Adam optimizer [29] with a learning rate of 0.0005, and Smooth_L1 loss function [30] as shown in Eq. (1), x and y in the formula represent the left and right side of Eq. (7) in the Appendix, respectively. Here, the Pytorch framework in Python is used to build the simulation network.

(1)$$Smoot{h_{L1}}(x,y) = \left\{ {\begin{array}{lc} {0.5{{(x - y)}^2}}&{if\textrm{ }|{x - y} |< 1}\\ {|{x - y} |- 0.5}&{otherwise} \end{array}} \right.$$

(2)$$Rward = \left\{ {\begin{array}{ll} { - 100000}&{\textrm{ }if\textrm{ termination}}\\ {\left\{ {\begin{array}{ll} {\textrm{mean}(\frac{{{T_1}}}{{{T_2}}})}&{\textrm{ }if\textrm{ }\left|{\frac{{{T_1}}}{{{T_2}}} - ratio} \right|< \delta }\\ {\textrm{mean}(\frac{{{T_2}}}{{{T_1}}})}&{\textrm{ }else\textrm{ }if\textrm{ }\left|{\frac{{{T_2}}}{{{T_1}}} - ratio} \right|< \delta }\\ {\max (\min (\frac{{{T_2}}}{{{T_1}}}),\min (\frac{{{T_1}}}{{{T_2}}}))}&{\textrm{ }else} \end{array}} \right\}}&{\textrm{ }else\textrm{ }if\textrm{ max(min}({T_1}),\min ({T_2})) \ge 0.5}\\ { - 10}&{\textrm{ }else} \end{array}} \right.$$

In addition to state s and action a, another core part of RL is the reward r to be pre-set manually. A suitable reward scheme is crucial to guide the agent and improve the learning efficiency, as given here by Eq. (2). In the exploration process, the agent is not allowed to take an action that would make the etching out of the interference region. If it happens, the exploration will be terminated, and the agent receives a large negative reward (e.g., −100,000), such that a new position will be randomly chosen in the central region. If the port with high transmittance is larger than 0.5 for the whole band (1200∼1650 nm), the reward needs to be calculated according to the ratio formula, where δ represents the maximum allowed error between the target ratio and the actual one, and is set to 10∼13% of the target ratio for our structures designed here. If the power ratio of the device is within the allowed error range, the reward is calculated by averaging T₁/T₂, or T₂/T₁. In order to improve the agent in obtaining the target structure and prevent it from get negative reward for a long time (leading to an ineffective learning process), we assign the sparse reward [31] to the agent according to the ratio of the two ports of MMI to ameliorate that. Otherwise, if transmittance of the output ports T₁ and T₂ are both smaller than 0.5, we would not adopt this design for its high insertion loss, and the agent will receive a −10 negative reward. Since the structure of our device is vertically symmetric, we do not need to assign which of the two output ports has the larger transmission in Eq. (2). In practice, we can flip the etching area according to the demand to achieve the power ratio exchange of the two output ports.

Here, four different ratios, 3:2 (or 1.5:1), 2:1, 7:3 (or 2.33:1) and 3:1, are designed sequentially to facilitate NN training, i.e., as the ratio increases, the agents continue to learn based on the previously trained smaller-ratio NN. Compared with the direct design for a large splitting-ratio MMI, this continued learning approach can usually converge more easily for the target. The greedy value ε in Step (3) gradually increases from 0.5 to 0.95, that is, at the beginning of network training, the action has only 50% likelihood to be decided by the agent, and the remaining 50% is from random selection. Later in the training, agents will gradually decide the actions, but with up to 5% chance for the network to try random actions to avoid from falling into a local minimum.

3. Results

By using A-DDQN, we design the MMI for four different ratios, with low insertion loss, wide bandwidth and small footprint required. The hole distribution of these samples are shown in Fig. 5(a)(d)(g)(j). It can be seen that the number of etched holes is proportional to the power ratio between the two ports, and the MMI designed by our method has relatively fewer holes, whose distribution is also concentrated in certain area. In the design process, the agent usually etches holes near the low transmission port to block the light, which is consistent with the human cognition. The total number of optimization steps depends highly on the path/history of the etching process, which contains high proportion of random actions [27].

Fig. 5. Four samples with ratio 1.5:1, 2:1, 2.33:1 and 3:1 designed by A-DDQN, respectively. (a)(d)(g)(j) Distributions of the etched holes; (b)(e)(h)(k) their transmission spectra from the two output ports; and (c)(f)(i)(l) power distributions in the interference region of MMI at 1650 nm wavelength, for the four samples.

Download Full Size | PDF

Figure 5(b)(e)(h)(k) show the transmittance spectra for four splitters, whose wavelength ranges are all from 1200 to 1650 nm with 451 sampling points. The red and blue dashed lines are the transmission of the two ports, respectively. The magenta and orange dotted lines are for the lowest and highest SR, as compared to the target one as indicated in the upper right corner of each figures. For Fig. 5(c)(f)(i)(l), the power distributions in the interference region and the surrounding area of the MMI are shown for the wavelength of 1650 nm.

Spectra for the total transmission of the two ports, and their SRs are shown in Fig. 6(a)(b), respectively, where we can be seen that the total transmissions are all above 80%, and their ratios across a broad range of 450 nm are within 10∼13% of the desired design target.

Fig. 6. Spectra for (a) the total transmission and (b) the power splitting ratio, for each of the above four cases.

Download Full Size | PDF

We repeat the DRL designs for 10 trials of each group as summarized in Fig. 7. The average transmission over the whole wavelength range (as y-axis) and their ratios (as x-axis) are marked by dots, where the dash lines indicate the target power ratio. We can see that as SR increases, total power of the output ports decreases, which may be caused by the intensified scattering in the interference region if more holes are etched.

Fig. 7. The averages of total transmission and power ratio spectra for the four groups, where the dash lines indicate the targets.

Download Full Size | PDF

To further analyze the wavelength dependence of the four ratio groups, we plot the scattered points for different wavelengths at 1200 nm, 1310 nm, 1550 nm and 1650 nm, respectively, as in Fig. 8. The horizontal axis represents ratio of the two ports at the selected wavelength, and the vertical axis represents total transmission of the two ports. By comparing the figures, it can be seen that the scattering loss is low and the splitting power is relatively centered for the four selected wavelengths. The increase in fluctuations for higher SR is due to the allowed criteria is set to ±10∼13%, such that it spreads relatively wider in the x-axis at larger SR values.

Fig. 8. The total transmission and ratio of the four ratio groups at (a) 1200 nm, (b) 1310 nm, (c) 1550 nm and (d) 1650 nm.

Download Full Size | PDF

We also count the number of etched holes in 40 trials as in Fig. 9, whose average values for the four groups are 7.2, 16.3, 17.2 and 25.7, respectively. Here, fewer etchings means more simplification during the device production process, as compared to the structures with 100∼200 holes as in Refs. [13,21,32]. The return loss is also small whose maximum value is 0.0850, with an average of 0.0103. Since there is no need to prepare the dataset, and the previously trained model can be used again for continued learning, the design time can be greatly reduced from about 100 hours [13,32] to 10 hours approximately. Generally, it takes less than 48 hours here to design the four MMIs, and in some cases, it even takes less than an hour to design a power splitter with lower SR.

Fig. 9. The histogram of etched holes for the 40 trials.

Download Full Size | PDF

To varify our design for the real device fabrication, varFDTD (or 2.5D FDTD) is used by simplifying our 3D geometry into a 2D set of effective indices to use the 2D FDTD solver. This has been compared with the designs done by 3D FDTD as in Fig. 10, where slight difference can be found to estimate the designed structure, and the 2.5D simulation can have much higher computitional speed over 3D to balance the accuracy for a primrary design [33].

Fig. 10. The MMI with 2:1 power ratio optimized by A-DDQN in 2.5D FDTD simulation. (a) Distributions of the etched holes; (b) the transmission spectra from the two output ports in 2.5D and 3D FDTD simulation.

Download Full Size | PDF

In Table 1, we compare several main optimization methods to our proposal for the broadband MMI splitter designs. It can be seen that A-DDQN has high advantages in the overall computational time, as well as a smaller footprint and fewer number of holes, which can be used without preparing the dataset as in the other NN methods.

Table 1. Comparison of MMI Power Splitters Optimized by Different Methods

View Table

From the above discussions, we can see that the A-DDQN method has more dynamic characteristics in exploration for a target structure, as it is more path-related during the optimization/design process. By removing the requirement for a prepared database, the method is able to explore more freely with self-learning ability, which not only is suitable for the design of MMI power splitter, but also for problems that has a time sequence or evolution path, such as the optical communication systems with various time-dependent and nonlinear optical components.

4. Conclusions

The A-DDQN algorithm is proposed to design the MMI power splitters for low insertion loss and stable power ratio from 1200 to 1650 nm wavelength range, where the agent is trained for hole etchings step by step. By sharing the weights and S-A-R-S-A sequences, the agents in A-DDQN can effectively reduce the training time. Also, this method can simplify the structure for fabrication by etching relatively fewer holes to obtain certain power splitting ratio. If a new splitter is to be designed, the previously trained NN can be used to continue training to speed up the exploration.

The number of steps required to design the MMI ranges from hundreds to thousands of steps, and the whole training process for the four MMIs usually takes less than 48 hours during the design process, which are carried out on the AMD EPYC 7742 processor with no GPU implemented. In addition, if the GPU is used in the A-DDQN training, the speed could also be accelerated.

Appendix: DQN and DDQN

Compared to the training process of the traditional neural network, the training of DQN is based on the temporal-difference(TD) method [26]. And the training samples are not the conventional combination of features and labels, but sequences of S (state)-A (action)-R (reward)-S-A [25] as Eq. (3).

(3)$$\cdots \to {s_t} \to {a_t} \to {r_t} \to {s_{t + 1}} \to {a_{t + 1}} \to \cdots $$

The current state of DQN is s_t. The agent needs to estimate the Q value [16] of each action according to the current state s_t and select the corresponding action at with the highest value, as shown in Fig. 11. The decision-making process is denoted by π in Eq. (4).

(4)$$\pi (s )= \arg \mathop {\max }\limits_a {Q^\pi }({s,a} )$$

Fig. 11. The schematic diagram of decision-making process for the DQN

Download Full Size | PDF

The reward r_t is from the environment, which is obtained by analyzing the new state s_t+1 after making decision. The agent then continues to do the above mentioned steps for the new state s_t+1, until a satisfactory reward value is obtained or the process is terminated for some other reason, i.e., exploration steps exceed the maximum, etc. The total value of the whole process R_t is calculated as Eq. (5).

(5)$${R_t} = {r_t} + \gamma {r_{t + 1}} + {\gamma ^2}{r_{t + 2}} + \cdots + {\gamma ^T}{r_T} = {r_t} + \gamma {R_{t + 1}}$$

(6)$${Q^\pi }({{s_t},{a_t}} )= {r_t} + \gamma {Q^\pi }({{s_{t + 1}},{a_{t + 1}}} )$$

Here, γ is a discount factor, representing the impact of the value of subsequent actions on R_t. We assume that the subsequent action value has gradually smaller influence, i.e., γ is between 0 and 1, which is usually set to 0.9. After a few steps, the training of agent starts with the historical S-A-R-S-A sequences, and is updated by Eq. (6) [24]. Ultimately, the agent can generate the highest reward through decisions in any state.

In this situation, the agent always chooses the action with maximum Q value, usually resulting in overestimation [24] when it makes decision. In order to solve the serious overestimation in DQN, double DQN (DDQN) is first proposed by the DeepMind [24], which evolves from the double Q-Learning method [35]. There are two neural networks in DDQN, i.e., target-NN and evaluation-NN, as shown in Fig. 12. And the networks are updated by Eq. (7).

(7)$${Q^{{\pi _{\textrm{eval}}}}}({{s_t},{a_t}} )= {r_t} + \gamma {Q^{{\pi _{target}}}}({{s_{t + 1}},{\pi_{eval}}({{s_{t + 1}}} )} )$$

The target-NN is used for value estimation, and the evaluation-NN is used for decision-making in the training process. And weights of the evaluation-NN are updated in the real time with training by Eq. (7), while weights of the target-NN will be updated afterwards in a few steps. Therefore, the target-NN has the same network structure as the evaluation-NN to receive the weights. This method can relieve the overestimation by decoupling DQN into two networks, which is validated that DDQN can have a superior performance over DQN as in Ref. [24].

Fig. 12. The schematic diagram for DDQN.

Download Full Size | PDF

Funding

National Key Research and Development Program of China (2018YFA0209000).

Acknowledgements

Thanks to Junlei Han for his help of building the neural network model.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. H. C. Chung, T. C. Wang, Y. J. Hung, and S. Y. Tseng, “Robust silicon arbitrary ratio power splitters using shortcuts to adiabaticity,” Opt. Express 28(7), 10350–10362 (2020). [CrossRef]

2. S. Zhao, W. Liu, J. Chen, Z. Ding, and Y. Shi, “Broadband arbitrary ratio power splitters based on directional couplers with subwavelength structure,” IEEE Photonics Technol. Lett. 33(10), 479–482 (2021). [CrossRef]

3. Z. Lin and W. Shi, “Broadband, low-loss silicon photonic Y-junction with an arbitrary power splitting ratio,” Opt. Express 27(10), 14338–14343 (2019). [CrossRef]

4. B. Wu, C. Sun, Y. Yu, and X. Zhang, “Integrated optical coupler with an arbitrary splitting ratio based on a mode converter,” IEEE Photonics Technol. Lett. 32(1), 15–18 (2020). [CrossRef]

5. K. Kojima, M. H. Tahersima, T. Koike-Akino, D. K. Jha, Y. Tang, Y. Wang, and K. Parsons, “Deep neural networks for inverse design of nanophotonic devices,” J. Lightwave Technol. 39(4), 1010–1019 (2021). [CrossRef]

6. A. Y. Piggott, J. Lu, K. G. Lagoudakis, J. Petykiewicz, T. M. Babinec, and J. Vučković, “Inverse design and demonstration of a compact and broadband on-chip wavelength demultiplexer,” Nat. Photonics 9(6), 374–377 (2015). [CrossRef]

7. S. Molesky, Z. Lin, A. Y. Piggott, W. Jin, J. Vucković, and A. W. Rodriguez, “Inverse design in nanophotonics,” Nat. Photonics 12(11), 659–670 (2018). [CrossRef]

8. J. Han, J. Huang, J. Wu, and J. Yang, “Inverse designed tunable four-channel wavelength demultiplexer,” Opt. Commun. 465, 125606 (2020). [CrossRef]

9. J. Lu and J. Vučković, “Nanophotonic computational design,” Opt. Express 21(11), 13351–13367 (2013). [CrossRef]

10. T. Fujisawa and K. Saitoh, “Bayesian direct-binary-search algorithm for the efficient design of mosaic-based power splitters,” OSA Continuum 4(4), 1258–1270 (2021). [CrossRef]

11. J. C. C. Mak, C. Sideris, J. Jeong, A. Hajimiri, and J. K. Poon, “Binary particle swarm optimized 2 ( 2 power splitters in a standard foundry silicon photonic platform,” Opt. Lett. 41(16), 3868–3871 (2016). [CrossRef]

12. D. Melati, Y. Grinberg, M. K. Dezfouli, S. Janz, P. Cheben, J. H. Schmid, A. Sanchez-Postigo, and D.-X. Xu, “Mapping the global design space of nanophotonic components using machine learning pattern recognition,” Nat. Commun. 10(1), 4775 (2019). [CrossRef]

13. Y. Tang, K. Kojima, T. Koike-Akino, Y. Wang, P. Wu, Y. Xie, M. H. Tahersima, D. K. Jha, K. Parsons, and M. Qi, “Generative deep learning model for inverse design of integrated nanophotonic devices,” Laser Photonics Rev. 14(12), 2000287 (2020). [CrossRef]

14. N. J. Dinsdale, P. R. Wiecha, M. Delaney, J. Reynolds, M. Ebert, I. Zeimpekis, D. J. Thomson, G. T. Reed, P. Lalanne, K. Vynck, and O. L. Muskens, “Deep learning enabled design of complex transmission matrices for universal optical components,” ACS Photonics 8(1), 283–295 (2021). [CrossRef]

15. L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: a survey,” J. Artificial Intell. Res. 4, 237–285 (1996). [CrossRef]

16. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature 518(7540), 529–533 (2015). [CrossRef]

17. A. Jiang, Y. Osamu, and L. Chen, “Multilayer optical thin film design with deep Q learning,” Sci. Rep. 10(1), 12780 (2020). [CrossRef]

18. T. Badloe, I. Kim, and J. Rho, “Biomimetic ultra-broadband perfect absorbers optimised with reinforcement learning,” Phys. Chem. Chem. Phys. 22(4), 2337–2342 (2020). [CrossRef]

19. I. Sajedian, T. Badloe, and J. Rho, “Optimisation of colour generation from dielectric nanostructures using reinforcement learning,” Opt. Express 27(4), 5874–5883 (2019). [CrossRef]

20. L. Su, A. Y. Piggott, N. V. Sapra, J. Petykiewicz, and J. Vučković, “Inverse design and demonstration of a compact on-chip narrowband three-channel wavelength demultiplexer,” ACS Photonics 5(2), 301–305 (2018). [CrossRef]

21. K. Wang, X. Ren, W. Chang, L. Lu, D. Liu, and M. Zhang, “Inverse design of digital nanophotonic devices using the adjoint method,” Photonics Res. 8(4), 528–533 (2020). [CrossRef]

22. C. Vieu, F. Carcenac, A. Pépin, Y. Chen, M. Mejias, A. Lebib, L. Manin-Ferlazzo, L. Couraud, and H. Launois, “Electron beam lithography: resolution limits and applications,” Appl. Surf. Sci. 164(1-4), 111–117 (2000). [CrossRef]

23. “FDTD Solutions, Lumerical Solutions, Inc., Vancouver, BC, Canada, 2020, [Online]. Available: http://www.lumerical.com/tcad-products/fdtd/.”.

24. H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” arXiv:1509.06461 (2015).

25. A. Asghari, M. K. Sohrabi, and F. Yaghmaee, “Task scheduling, resource provisioning, and load balancing on scientific workflows using parallel SARSA reinforcement learning agents and genetic algorithm,” J. Supercomput 77(3), 2800–2828 (2021). [CrossRef]

26. R. S. Sutton, “Learning to predict by the methods of temporal differences,” Mach. Learn. 3(1), 9–44 (1988). [CrossRef]

27. M. Tokic, “Adaptive ε-greedy exploration in reinforcement learning based on value differences,” KI 2010: Advances in Artificial Intelligence, 203–210 (2010).

28. I. Sajedian, J. Kim, and J. Rho, “Finding the optical properties of plasmonic structures by image processing using a combination of convolutional neural networks and recurrent neural networks,” Microsyst. Nanoeng. 5(1), 27 (2019). [CrossRef]

29. N. S. Keskar and R. Socher, “Improving generalization performance by switching from Adam to SGD,” arXiv:1712.07628 (2017).

30. P. J. Huber, “Robust Estimation of a Location Parameter,” in Breakthroughs in Statistics: Methodology and Distribution (SpringerNew York, 1992), pp. 492–518.

31. Y. Wu and Y. Tian, “Training agent for first-person shooter game with actor-critic curriculum learning,” The 5th International Conference on Learning Representations (2017).

32. K. Xu, L. Liu, X. Wen, W. Sun, N. Zhang, N. Yi, S. Sun, S. Xiao, and Q. Song, “Integrated photonic power divider with arbitrary power ratios,” Opt. Lett. 42(4), 855–858 (2017). [CrossRef]

33. “2.5D varFDTD solver introduction,” https://support.lumerical.com/hc/en-us/articles/360034917213-varFDTD.

34. M. H. Tahersima, K. Kojima, T. Koike-Akino, D. Jha, B. Wang, C. Lin, and K. Parsons, “Deep neural network inverse design of integrated photonic power splitters,” Sci. Rep. 9(1), 1368 (2019). [CrossRef]

35. H. van Hasselt, “Double Q-learning,” the 24th Annual Conference on Neural Information Processing Systems (2010).

Split ratio	Total transmission	Bandwidth(nm)	Footprint (µm²)	Number of holes	Method	Dataset size (setup time)	Design Time
1:3	∼80%	30	3.6×3.6	∼100	FSM [32]	NA	120h
1:1	∼90%	40	2.6×2.6	∼200	Adjoint [21]	NA	1.2h
2:7	∼90%	200	2.6×2.6	∼150	ResNet [34]	20,000(∼336 h)	22min
3:7	∼90%	550	2.25×2.25	∼200	A-CVAE [13]	16,000(∼87 h)	5min
1:3	∼85%	450	2×2	<50	A-DDQN[ours]	NA	10h

Inverse design of the MMI power splitter by asynchronous double deep Q-learning

Abstract

1. Introduction

2. Structure and method

2.1 MMI Splitter

2.2 Asynchronous DDQN

3. Results

4. Conclusions

Appendix: DQN and DDQN

Funding

Acknowledgements

Disclosures

Data availability

References

Data availability

Cited By

Figures (12)

Tables (1)

Equations (7)

Optics Express