Inference of process variations in silicon photonics from characterization measurements

Zhengxing Zhang; Sally I. El-Henawy; Carlos A. Ríos Ocampo; Carlos A. Ríos Ocampo; Duane S. Boning

doi:10.1364/OE.489164

1. Introduction

Integrated silicon photonics is a rising technology platform that adopts existing CMOS fabrication infrastructure to meet the future demands of high-rate data communication and to enable other novel applications in sensing, LiDAR, quantum computation, and more [1–3]. To increase yield and performance of photonic components and circuits, the effect of manufacturing variations in silicon photonics are starting to be studied and reported [4–7]. One of the key steps in process variation analysis is to find the magnitude and spatial distributions of process variations in the actual fabrication. These variations are usually inferred from measurements of spatially replicated test structures [5–7].

However, the accuracy of such inference is highly dependent on the design of the test structures. For example, [5] uses ring resonators (RRs) as the test structure, and extracts the variations of total thickness of silicon-on-insulator layer $T$ and waveguide width $w$ from the resonance wavelength and group index. While the mapping is unique and sufficient for the extraction, the group index is not very sensitive to the variation of $T$ and $w$, and can be easily affected by measurement noise. Thus the extraction results in [5] show a strong correlation between these two variations; this is not aligned with the physics of the manufacturing process, because the variations come from two different and independent processes, i.e., deposition and etching.

In order to achieve high accuracy in such inference, the test structure needs to be carefully designed so that its performance is sensitive to the targeted variations, and such that one can decouple the effects of different variation components. As an example, [6] uses a combination of two different Mach-Zehnder interferometers (MZIs) to infer the same variations as [5], and is able to decouple the effects and produces variation estimates having more realistic weak correlation. However, finding test structures that achieve this separation can take many cycles of design, fabrication, and measurement, with substantial cost in time and resources.

In this work, we develop a Bayesian-based method to extract the systematic geometric process variation information from design characterization measurements for different devices that are not spatially replicated. The method is applied to characterization data from multiple silicon nitride RRs (Fig. 1), and the inference of the width $w$, thickness $T$, and partial etch depth $h$ variations on the silicon nitride ridge waveguide (Fig. 2(a)) gives results with good confidence. Since such characterization experiments are widely used for photonic component design and process design kit (PDK) development, our method provides an alternative, efficient, and accessible approach to study process variations in silicon photonics, and thus facilitate the design of high-yield silicon photonic circuit applications in the future.

Fig. 1. (a) Layout of the characterization design, consisting of ring resonators with varying radius (20–160 $\mathrm {\mu }$m) and gap width (0.6–1.6 $\mathrm {\mu }$m). (b) Typical insertion loss (IL) wavelength response from the characterization measurements. Data shown for the RR device with radius 100 $\mathrm {\mu }$m, gap width 1.6 $\mathrm {\mu }$m. Other devices have different resonance behavior, but share the same baseline (curve without resonances).

Download Full Size | PDF

Fig. 2. (a) Cross section and geometric parameters of the silicon nitride ridge waveguide used in the ring resonators of Fig. 1(a). (b) Overall workflow to extract process variations from the RR measurement data. (c) Inferred geometric variations of each device from naive approach with nominal value adjustment. One outlier that is very far from the center is removed from the plot.

Download Full Size | PDF

The original data and source code for implementation are available in Ref. [8].

2. Data extraction and challenges

The characterization data is from 48 silicon nitride all-pass RRs with varying radius and gap width, as shown in Fig. 1(a). Fabrication was performed in the MIT Microsystems Technology Laboratories (MTL) cleanroom facilities on a 6-inch wafer (WaferPro, LLC) with single-sided 200 nm stoichiometric Si$_{3}$N$_{4}$ on 3 $\mathrm {\mu }$m thermal SiO$_{2}$. A single electron-beam lithography step and reactive ion etch step was used to pattern the ridge waveguides and grating couplers into the silicon nitride layer, leaving a 35 nm slab. After resist removal and cleaning, a 2 $\mathrm {\mu }$m thick film of SiO$_{2}$ was deposited and subsequently annealed in nitrogen at 950$^{\circ }$ to remove hydrogen and densify the film [9]. The characterization measurements were performed using a home-built wafer-scale automated system, a 250-$\mathrm {\mu }$m-pitch fiber array (PLC Connections), and an optical vector analyzer in the C and L-band (LUNA OVA 5000). A typical measurement frequency response is shown in Fig. 1(b).

In our measurements, all of the RR devices share almost exactly the same baseline insertion loss resulting from the grating couplers, which thus can be evaluated and removed by averaging and smoothing the responses of RRs with no observable resonance. We denote the response of the $i$-th device after this detrend as $I_i(\lambda )$, where $\lambda$ is wavelength.

For each device with observable resonances, we use the workflow shown in Fig. 2(b) to extract its variation values. In particular, for each resonant peak of each device, the resonance wavelength, peak depth, and full width at half maximum (FWHM) are extracted by nonlinear curve fitting based on a Lorentzian model, which is is an approximate form of the theoretical RR model response [10] when the peak is narrow and sharp enough:

(1)$$\hat{I}(\lambda; \boldsymbol{\theta}) = \theta_0 - \frac{\theta_1}{1 + (\lambda - \theta_2)^2/\theta_3^2},$$

where $\boldsymbol {\theta }=[\theta _0,\theta _1,\theta _2,\theta _3]^\mathrm {T}$ are the fitting parameters. Physical properties, i.e., effective index $n_\text {eff}$, group index $n_\text {g}$, coupling coefficient $t$, and round-trip propagation loss $\alpha,$ are then calculated for each peak based on the theoretical RR model [10]. We use vector $\boldsymbol {y}_i$ to represent the physical properties at all of the peaks of the $i$-th device:

(2)$$\boldsymbol{y}_i = [n_\text{eff}(\lambda_c), n_\text{g}(\lambda_1'), \ldots, n_\text{g}(\lambda_{K_i}'), L(\lambda_0),\ldots, L(\lambda_{K_i})]^\mathrm{T}.$$

where $\lambda _0,\ldots, \lambda _{K_i}$ are resonant wavelengths, with nonlinear transformation of the coupling coefficient as $L=\ln \arccos t$. Since $n_\text {g}$ is determined by the wavelength response of $n_\text {eff}$, using $K_i+1$ wavelength points of $n_\text {eff}$ is equivalent to using the $K_i$ wavelength points of $n_\text {g}$ in between $\lambda _j'=\frac {1}{2}(\lambda _{j-1}+\lambda _j)$, plus an arbitrary wavelength point of $n_\text {eff}$. Here we choose the $n_\text {eff}$ at a fixed wavelength $\lambda _c= 1574\,\text {nm}$ based on the fitting result to reduce error (see details in Section 2.2 and Section 3). The round-trip loss $\alpha$ is not included for variation extraction here since it can be strongly influenced by other types of variations, e.g., line edge roughness.

A first-order sensitivity matrix $\mathbf {S}_i$ is built for each device $i$ based on simulation that connects geometric variations $\boldsymbol {p}_i = [\Delta w_i, \Delta T_i, \Delta h_i]^\mathrm {T}$ to the change of the physical properties at all the peaks:

(3)$$\boldsymbol{y}_i = \boldsymbol{\bar{y}}_i + \mathbf{S}_i\cdot \boldsymbol{p}_i,$$

where the nominal values of the physical properties of the peaks $\boldsymbol {\bar {y}}_i$ are also obtained from simulation. It should be noted that we ignore the material refractive index variation in our model, which can be significant and potentially affect the extraction results for silicon nitride devices. However, if one is interested in such variation, one can simply add the index variation $\Delta n_i$ into the model $\boldsymbol {p}_i$, and use the same method presented in this paper to extract this or other additional variation sources.

2.1 Auxiliary gradient parameters

It is worth noting that the physical properties in $\boldsymbol {y}_i$ are not occurring at the same location: $n_\text {eff}$ and $n_\text {g}$ are properties of the ring, and $t$ (or $L$) is a property of the ring-waveguide coupler. Considering the size of the ring, it is typical to assume that these properties are experiencing different geometric variations. Therefore, a more rigorous formulation for the variations is $\boldsymbol {p}_i = [\Delta w_i, \Delta T_i, \Delta h_i, \Delta w'_i, \Delta T'_i, \Delta h'_i]^\mathrm {T}$, where the first three indices correspond to the variations at the ring, and the latter three correspond to the variations at the coupler.

In this paper, we are using a different but equivalent formulation:

(4)$$\boldsymbol{p}_i = [\Delta w_i, \Delta T_i, \Delta h_i, g_{w,i}, g_{T, i}, g_{h, i}]^\mathrm{T},$$

where $g_{X,i}=(\Delta X'_i - \Delta X_i)/R_i$ for $X=w, T, h$ can be seen as the spatial gradient of the geometric variation. Then when constructing the sensitivity matrix $\mathbf {S}_i$, we have

(5)$$\frac{\partial n_\text{eff}}{\partial g_{X, i}} = \frac{\partial n_\text{g}}{\partial g_{X, i}} = 0, \quad \frac{\partial L}{\partial X_i} = \frac{\partial L}{R_i\partial g_{X, i}} = \frac{\partial L}{\partial X'_i}$$

for $X=w, T, h$.

In the main sections of this paper, we focus on the variation distribution of $w$, $T$, and $h$, and only use the gradient parameters for auxiliary purposes.

2.2 Nominal fitting design

Of the various sources of potential error in the extraction, one that can be easily fixed is the nominal value errors from simulation. This is not a problem for repetitive test structures, since a deviation in nominal value can only result in a systematic mean shift in the results. But since we have different devices, such deviation is different for each device, and therefore introduces extra variance to the results. This issue can be ameliorated by adjusting the nominal values according to the extracted data, i.e., shift and scale the values in each group so that they fit the physical properties from data.

We denote the radius and gap width of the $i$-th RR device as $R_i$ and $\chi _i$, respectively. Effective index and group index are only dependent on the ring and thus are only a function of the radius; the coupling coefficient is a function of both.

In order to simplify the modeling, we choose the problem formulation (2) so that we only need to model the nominal values of effective index at one certain wavelength $\lambda _c$. We denote the simulated nominal effective index for radius $R$ at wavelength $\lambda _c$ as $\hat {n}_\text {eff}(R)$. We now consider the possible error in simulation. We assume the actual nominal value is a “stretch and shift” version of the simulation:

(6)$$\bar{n}_\text{eff}(R) = c_0 + c_1\hat{n}_\text{eff}(R).$$

This gives a linear nominal model

(7)$$\bar{n}_{\text{eff}, i}(\lambda_c) = \bar{n}_\text{eff}(R_i) = \boldsymbol{a}_{n_\text{eff}, i}\boldsymbol{w}_{n_\text{eff}},$$

where $\boldsymbol {a}_{n_\text {eff},i} = [1, \hat {n}_\text {eff}(R_i)], \,\boldsymbol {w}_{n_\text {eff}} = [c_0, c_1]^\mathrm {T}$.

We can apply a similar approach to model the nominal group index, but we also need to consider the wavelength dependency. A simple adaptation is to first fit the simulated nominal values to a linear wavelength model:

(8)$$\hat{n}_\text{g}(\lambda, R) = \hat{\eta}_0(R) + \lambda\hat{\eta}_1(R),$$

where $\hat {\eta }_0(R)$ and $\hat {\eta }_1(R)$ are the fitting coefficients. Then we perform the same “stretch and shift” approach:

(9)$$\bar{\eta}_m(R) = b_{m, 0} + b_{m, 1}\hat{\eta}_m(R), \quad m=0,1,$$

which then gives $\bar {n}_\text {g}(\lambda, R) = \bar {\eta }_0(R) + \lambda \bar {\eta }_1(R)$ and

(10)$$\bar{\boldsymbol{n}}_{\text{g}, i} = \begin{pmatrix}\bar{n}_\text{g}(\lambda'_{i, 1}, R_i)\\ \vdots \\ \bar{n}_\text{g}(\lambda'_{i, K_i}, R_i)\end{pmatrix} = \mathbf{A}_{n_\text{g}, i}\boldsymbol{w}_{n_\text{g}},\, \mathbf{A}_{n_\text{g}, i} = \begin{pmatrix} 1 & \hat{\eta}_0(R_i) & \lambda'_{i, 1} & \lambda'_{i, 1}\hat{\eta}_1(R_i) \\ \vdots & \vdots & \vdots & \vdots \\ 1 & \hat{\eta}_0(R_i) & \lambda'_{K_i, 1} & \lambda'_{K_i, 1}\hat{\eta}_1(R_i) \\ \end{pmatrix},\, \boldsymbol{w}_{n_\text{g}} = \begin{pmatrix} b_{0,0}\\ b_{0,1}\\ b_{1,0}\\ b_{1,1}\end{pmatrix}.$$

For the nominal coupling coefficient model, we choose a different approach. This is because the simulation for the coupling coefficient is more complicated and less accurate, and not suitable for the “stretch and shift” scheme. On the other hand, by using the nonlinear transformation $L=\ln \arccos {t}$, we are able to use a simple polynomial model for the nominal value of $L$:

(11)$$\bar{L}(\lambda, R, \chi) = (\boldsymbol{f}(\lambda)\circ\boldsymbol{g}(R,\chi))^\mathrm{T}\boldsymbol{w}_L,$$

where $\boldsymbol {f}(\lambda )$ and $\boldsymbol {g}(R, \chi )$ are the quadratic polynomial expansions of wavelength and design parameters, respectively. Therefore, we have

(12)$$\bar{\boldsymbol{L}}_i = [\bar{L}(\lambda_{i,0},R_i,\chi_i),\ldots, \bar{L}(\lambda_{i,K_i},R_i,\chi_i)]^\mathrm{T} = \mathbf{A}_{L, i}\boldsymbol{w}_L,$$

where $\mathbf {A}_{L, i} = \left \{[\boldsymbol {f}(\lambda _{i,0}), \ldots,\boldsymbol {f}(\lambda _{i,K_i})]\circ \boldsymbol {g}(R_i,\chi _i)\right \}^\mathrm {T}$.

Combining (7), (10), and (12), we obtain

(13)$$\bar{\boldsymbol{y}}_i = \mathbf{A}_i \boldsymbol{w}, \quad \mathbf{A}_i = \mathrm{blkdiag}(\boldsymbol{a}_{n_\text{eff}, i}, \mathbf{A}_{n_\text{g}, i}, \mathbf{A}_{L, i}).$$

2.3 Naive approach

Combining the nominal fitting in Section 2.2, the problem formulation (3) becomes

(14)$$\boldsymbol{y}_i = \mathbf{A}_i\boldsymbol{w} + \mathbf{S}_i\boldsymbol{p}_i.$$

For a naive approach, we first find the coefficient $\boldsymbol {w}$ by minimizing the difference between the measurement data $\boldsymbol {y}_i$ and $\mathbf {A}_i \boldsymbol {w}$ across all the devices, which is a least-squares regression problem. Then we solve each $\boldsymbol {p}_i$ using least-squares regression:

(15)$$\boldsymbol{p}_i = (\mathbf{S}_i^\mathrm{T}\mathbf{S}_i)^{{-}1}\mathbf{S}_i^\mathrm{T}(\boldsymbol{y}_i - \mathbf{A}_i\boldsymbol{w}) = \mathbf{S}_i^+(\boldsymbol{y}_i - \mathbf{A}_i\boldsymbol{w}),$$

where $\mathbf {S}_i^+ = (\mathbf {S}_i^\mathrm {T}\mathbf {S}_i)^{-1}\mathbf {S}_i^\mathrm {T}$ is the pseudo-inverse of $\mathbf {S}_i$ that satisfies $\mathbf {S}_i^+\mathbf {S}_i=\mathbf {I}$. However, there are multiple sources of error that hugely impact the accuracy. Since we are not using test structures specifically designed to extract variation, some of the RRs may be not sensitive enough to the variations, and thus the extracted values are dominated by measurement noise; or they are not able to decouple the different variations, i.e., $\mathbf {S}_i$ is ill-conditioned, and thus even small noise in the extraction is exacerbated in the final inference. As shown in Fig. 2(c), the naive method gives distributions with large variance, strong correlation and occasionally nonphysical variation values, even after applying the nominal value adjustment.

Moreover, there are multiple solutions for the effective index in the extraction based on different mode number $M$:

(16)$$n_\text{eff}(\lambda_k) = \frac{M\lambda_k}{2\pi R}.$$

For rings with small radius, such solutions are widely separated, and it is easy to select correctly based on their distance to the nominal value. But when the radius of the ring is larger, these solutions become close to each other and are hard to separate and select, which introduces additional error if we pick the wrong mode number.

3. Error estimation

As noted in Section 2, one of the challenges of the problem comes from the error in the measurement and extraction process. Therefore, if we can manage to obtain information about the extraction error, we may be able to achieve more accurate inference by including those error estimates.

The most direct approach for error estimation is to perform multiple measurements on the same devices, which will not only provide a good estimation of extraction error, but also reduce the noise in the results. However, in some cases, we may need to perform analysis on historical measurement data where the devices have only been measured once, as is the case in this paper. In this section, we present how we can still perform estimation of extraction error even from single measurements on each device across a set of characterization devices.

A rather simple approach for error estimation is to estimate the covariance of the fitting coefficients in (1) from the fitting residual and the linearized version of the model around the fitting results [11]. Then we can estimate the covariance matrix $\mathbf {\Sigma }_i$ of $\boldsymbol {y}_i$ from the linearized relationship between $\boldsymbol {y}_i$ and the coefficients $\boldsymbol {\theta }$. For simplicity, here we assume that the error of extracted properties at different peaks are all uncorrelated, i.e., $\mathbf {\Sigma }_i$ is a diagonal matrix.

However, this approach is not very accurate, since it assumes that the measurement noise is independent and identically distributed (i.i.d.) at different wavelength points, and the entire estimation process uses linear approximations at multiple steps which accumulate the error. One way to visualize this issue is to look at the wavelength dependency of these extracted properties, which is supposed to behave as some smooth and low-order polynomial response based on simulation. Since the effect of process variations is highly correlated across the wavelength, the uncorrelated “rough” perturbations seen in the extracted response (e.g., at 1582 and 1592 nm) in Fig. 3(a) must come from the extraction error. While the raw estimation $\hat {\mathbf {\Sigma }}_i$ from the nonlinear fitting residual here indeed shows a correct relative trend among different peaks, it is often too small overall to explain all the uncorrelated perturbations.

Fig. 3. (a) Extracted group index values of one RR device (radius 60 $\mathrm {\mu }$m, gap width 1.6 $\mathrm {\mu }$m) (blue dots) with standard deviation of extracted error estimated from the nonlinear fitting residual (inner blue error bars) and after the adjustment (outer red error bars). A linear weighted fitting response (dotted line) is shown as reference. (b) One instantiation of the toy example (20), with clean data $\hat{\mathsf{y}}_i$ (black crosses), observed data $\mathsf{y}_i$ (blue dots), and noise standard deviation $\sigma_i$ (error bars) shown for each sample.

Download Full Size | PDF

Fortunately, this visualization also motivates an approach to improve the estimation. We can fit the extracted wavelength responses to a low-order polynomial function (e.g., a linear or quadratic function) using the raw estimated $\hat {\mathbf {\Sigma }}_i$ as weights, then choose a proper scaling factor $\gamma _i$ such that $\mathbf {\Sigma }_i=\gamma _i\hat {\mathbf {\Sigma }}_i$ matches the residual from the weighted fitting.

Letting $z_k$ denote the extracted value at wavelength $\lambda _k$, we can build the probabilistic model as

(17)$$\mathsf{z}_k = \boldsymbol{f}(\lambda_k)^\mathrm{T}\boldsymbol{w}_z + \mathsf{e}_k, \quad \mathsf{e}_k\sim\mathcal{N}(0, \gamma\hat{\sigma}_k^2),$$

where $\boldsymbol {f}(\lambda )$ is a low-order polynomial expansion of $\lambda$. A simple calculation for the ML estimator will give

(18)$$\tilde{\boldsymbol{w}}_z = \left(\sum_{k'=1}^K \sigma_{k'}^{{-}2} \boldsymbol{f}(\lambda_{k'})\boldsymbol{f}(\lambda_{k'})^\mathrm{T}\right)^{{-}1}\sum_{k=1}^K \sigma_k^{{-}2}z_k\boldsymbol{f}(\lambda_k), \quad \tilde{\gamma} = \frac{1}{K}\sum_{k=1}^K \sigma_k^{{-}2} \left(z_k - \boldsymbol{f}(\lambda_k)^\mathrm{T}\tilde{\boldsymbol{w}}_z\right)^2.$$

We note that finding $\boldsymbol {w}_z$ is just a weighted linear regression where the weight is inversely proportional to the variance of the extracted error, and the scaling factor is decided by the residual of the weighted fitting. We perform this scaling adjustment to $n_\text {eff}$, $n_\text {g}$, and $L$ of each device separately. But for the effective index $n_\text {eff}$, since the problem formulation (2) only uses one single wavelength point $\lambda _c$, we just need to find the corresponding $\boldsymbol {w}_z$ and calculate the fitting value at $\lambda _c$:

(19)$$n_\text{eff}(\lambda_c) = \boldsymbol{f}(\lambda_c)^\mathrm{T}\tilde{\boldsymbol{w}}_z, \quad \mathrm{var}\left[n_\text{eff}(\lambda_c)\right] = \boldsymbol{f}(\lambda_c)^\mathrm{T} \mathrm{cov}\left[\tilde{\boldsymbol{w}}_z\right]\boldsymbol{f}(\lambda_c),$$

where $\mathrm {cov}\left [\tilde {\boldsymbol {w}}_z\right ]$ is the covariance of the estimated parameters based on the weighted fitting residual.

As seen in Fig. 3(a), this improved approach provides a fairly accurate estimation of the extraction error level.

4. Bayesian approach: a toy example

Once we have error information, we can introduce a Bayesian model to incorporate it into our analysis. In order to illustrate this idea, we start from a simple 1D toy example:

(20)$$\mathsf{y}_i = \hat{\mathsf{y}}_i + \mathsf{e}_i,$$

where the observed data $\{y_i\}$, $y_i \sim \mathsf {y}_i$, consist of normally distributed clean data $\hat {\mathsf {y}}_i\sim \mathcal {N}(m, \sigma ^2)$ and noise with different variances: $\mathsf {e}_i\sim \mathcal {N}(0, \sigma _i^2)$. Figure 3(b) shows one instantiation of this example, with number of data $N=40$, $\mu =0$, $\sigma =1$, and $\sigma _i$ varying from 0.01 to 10.

Suppose that all the $\sigma _i$ are known, and we want to estimate $\mu$ and $\sigma$ from observed data $\{{y}_i\}$. The maximal likelihood (ML) estimator of the problem is the $(\mu, \sigma )$ that maximizes the probability of the observed data: $p_\mathsf {y}(\{{y}_i\}) = \prod _{i=1}^N \mathcal {N}({y}_i; \mu, \sigma ^2 + \sigma _i^2)$. One approach to find the ML estimates is the expectation-maximization (EM) algorithm [12], which is an iterative method commonly used to find the ML estimates of parameters in a probabilistic model. For the toy problem, the algorithm starts from a guess of the parameters $(\mu, \sigma )$, and updates the guess in each iteration using the following steps:

• E-step: from the current guess $(\mu, \sigma )$, find the posteriors $p_{\hat {\mathsf {y}}_i|\mathsf {y}_i}=\mathcal {N}(m_i, s_i^2)$: $(21)$$m_i = \frac{\sigma^2 {y}_i + \sigma_i^2 \mu}{\sigma^2 + \sigma_i^2}, \quad s_i^2 = \frac{\sigma^2\sigma_i^2}{\sigma^2 + \sigma_i^2} \; .$$$
• M-step: update the guess to $(\tilde {\mu }, \tilde {\sigma })$ using $p_{\hat {\mathsf {y}}_i|\mathsf {y}_i}$: $(22)$$\tilde{\mu} = \frac1N \sum_{i=1}^N m_i, \quad \tilde{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N \left[s_i^2 + (m_i - \tilde{\mu})^2\right] \; .$$$

We run the algorithm on the instantiation shown in Fig. 3(b), using the statistics of the noisy data $\{\mathsf {y}_i\}$ as the starting point, and compare the result with the ones directly calculated from the clean data. As shown in Fig. 4(a), the algorithm converges to error $<5{\%}$ within 20 iterations. Intuitively, the algorithm starts from the samples with smaller error variance and uses the estimation to adjust the large error samples. The E-step (21) also provides an estimation for the clean data with confidence interval (shown in Fig. 4(b)), which will be used to estimate the process variation for each device in the next section.

Fig. 4. (a) The estimated mean $\mu$ and standard deviation $\mu$ at each iteration of the EM algorithm, for the toy example shown in Fig. 3(b). The algorithm starts from the estimation of the raw data. As a reference, dashed line shows the estimation from the actual clean data. (b) The posterior ($m_i$ as the red dots, $s_i$ as the error bars) of the clean data estimated from the EM algorithm for the toy example shown in Fig. 3(b). The actual clean data is shown in black crosses.

Download Full Size | PDF

5. Implementation

We assume that the process variations $\boldsymbol {p}_i = \boldsymbol {\mathsf {p}}_i$ and the errors from the properties extraction $\boldsymbol {\mathsf {e}}_i$ are independent random variables following normal distributions with zero mean, and the mode numbers $\mathsf {n}_i$ are independent random uniformly distributed integer variables:

(23)$$\boldsymbol{\mathsf{p}}_i\sim\mathcal{N}(0, \mathbf{\Lambda}^{{-}1}), \quad \boldsymbol{\mathsf{e}}_i\sim\mathcal{N}(0, \mathbf{\Lambda}_i^{{-}1}), \quad \mathsf{n}_i\sim \mathcal{U}_{z}(-\infty, +\infty) \; .$$

Then we can build a probabilistic model for the extracted property values:

(24)$$\boldsymbol{\mathsf{y}}_i = \mathbf{A}_i \boldsymbol{w} + \mathbf{S}_i \boldsymbol{\mathsf{p}}_i + \mathsf{n}_i \boldsymbol{d}_i + \boldsymbol{\mathsf{e}}_i,$$

where we write the nominal adjustment as a linear regression term $\mathbf {A}_i \boldsymbol {w}$ with weights $\boldsymbol {w}$ to be solved (as shown in Section 2.2), and $\boldsymbol {d}_i$ corresponds to the difference in result when choosing a different mode number based on (16). We have shown how to estimate $\mathbf {\Lambda }_i=\mathbf {\Sigma }_i^{-1}$ in Section 3, so the problem now is to find the parameters $(\boldsymbol {w}, \mathbf {\Lambda }_i)$.

Let $\hat {\boldsymbol {\mathsf {y}}}_i=\mathbf {A}_i \boldsymbol {w} + \mathbf {S}_i \boldsymbol {\mathsf {p}}_i$ denote the clean data. Then this is the same type of problem as in the toy example in Section 4, but with multiple dimensions, mapping between $\boldsymbol {\mathsf {p}}$ and $\hat {\boldsymbol {\mathsf {y}}}$, and an extra discrete random variable $\mathsf {n}_i$. By using the EM algorithm, we can specify the iterative steps as:

• E-step: from current guess $(\boldsymbol {w}, \mathbf {\Lambda })$, find the posterior $p_{\hat {\boldsymbol {\mathsf {y}}}_i|\boldsymbol {\mathsf {y}}_i}$: $(25)$$p_{\hat{\boldsymbol{\mathsf{y}}}_i|\boldsymbol{\mathsf{y}}_i}(\hat{\boldsymbol{y}}_i|\boldsymbol{y}_i) = \sum_{k\in\mathbb{Z}}\mathcal{N}_{z}(k; \nu_i, \rho_i^{{-}1})\mathcal{N}(\hat{\boldsymbol{y}}_i; \boldsymbol{\mu}_i - k\boldsymbol{\delta}_i, \mathbf{R}_i^{{-}1}),$$$ where $\begin{array}{c} \mathbf{R}_i = \mathbf{S}_i^{+,\mathrm{T}}\mathbf{\Lambda} \mathbf{S}_i^+{+} \mathbf{\Lambda}_i, \quad \boldsymbol{\mu}_i = \mathbf{R}_i^{{-}1}\left(\mathbf{S}_i^{+,\mathrm{T}}\mathbf{\Lambda} \mathbf{S}_i^+ \mathbf{A}_i\boldsymbol{w} + \mathbf{\Lambda}_i \boldsymbol{y}_i\right), \quad \boldsymbol{\delta}_i = \mathbf{R}_i^{{-}1}\mathbf{\Lambda}_i\boldsymbol{d}_i, \\ \rho_i = \boldsymbol{d}_i^\mathrm{T}\mathbf{\Lambda}_i\boldsymbol{d}_i - \boldsymbol{\delta}_i^\mathrm{T}\mathbf{R}_i\boldsymbol{\delta}_i, \quad \nu_i = \rho_i^{{-}1}\left(\boldsymbol{d}_i^\mathrm{T}\mathbf{\Lambda}_i \boldsymbol{y}_i - \boldsymbol{\delta}_i^\mathrm{T}\mathbf{R}_i\boldsymbol{\mu}_i\right). \end{array}$
• M-step: update the guess to $(\tilde {\boldsymbol {w}}, \tilde {\mathbf {\Lambda }})$ based on the posterior (25) (Here we actually perform the M-step using expectation conditional maximization (ECM) algorithm [13], which does not require the exact optimal solution in the M-step): $(26)$$\tilde{\mathbf{\Lambda}} = \left[\frac1N \sum_{i=1}^N \mathbf{S}_i^+(\mathbf{V}_i + \boldsymbol{r}_i\boldsymbol{r}_i^\mathrm{T})\mathbf{S}_i^{+,\mathrm{T}}\right]^{{-}1}, \quad \tilde{\boldsymbol{w}} = \left(\sum_{i'=1}^N \mathbf{A}_{i'}^\mathrm{T}\mathbf{S}_{i'}^{+,\mathrm{T}}\mathbf{\Lambda} \mathbf{S}_{i'}^+ \mathbf{A}_{i'}\right)^{{-}1}\sum_{i=1}^N \mathbf{A}_i^\mathrm{T}\mathbf{S}_i^{+,\mathrm{T}}\mathbf{\Lambda} \mathbf{S}_i^+ \boldsymbol{m}_i,$$$ where $\boldsymbol{m}_i = \boldsymbol{\mu}_i - \mathbb{E}_{z}[\nu_i,\rho_i^{{-}1}]\boldsymbol{\delta}_i, \quad \mathbf{V}_i = \mathbf{R}_i^{{-}1} + \mathrm{var}_{z}[\nu_i, \rho_i^{{-}1}]\boldsymbol{\delta}_i\boldsymbol{\delta}_i^\mathrm{T}, \quad \boldsymbol{r}_i = \boldsymbol{m}_i - \mathbf{A}_i\boldsymbol{w}.$

Here we use $\mathcal {N}_{z}$ to denote the discrete normal distribution:

(27)$$\mathcal{N}_{z}(k; \nu, \rho^{{-}1}) \propto \exp \left[-\frac{\rho}{2}(k-\nu)^2\right], \quad k\in\mathbb{Z},$$

and $\mathbb {E}_{z}[\nu, \rho ^{-1}], \mathrm {var}_{z}[\nu, \rho ^{-1}]$ as its corresponding expectation and variance. Although the discrete normal distribution does not have an analytical explicit expression, we can numerically calculate the expectation value and variance within a short time. We show the details of this efficient calculation in Section 5.1.

5.1 Discrete Gaussian Distribution

For the discrete normal distribution as shown in (27), the expectation and variance are

(28)$$\mathbb{E}_{z}\left[\nu, \rho^{{-}1}\right] = \frac{\textstyle\sum_{k\in\mathbb{Z}} k e^{-\rho(k-\nu)^2/2}}{\textstyle\sum_{k'\in\mathbb{Z}} e^{-\rho(k'-\nu)^2/2}}, \quad \mathrm{var}_{z}\left[\nu, \rho^{{-}1}\right] = \frac{\textstyle\sum_{k\in\mathbb{Z}} k^2 e^{-\rho(k-\nu)^2/2}}{\textstyle\sum_{k'\in\mathbb{Z}} e^{-\rho(k'-\nu)^2/2}} - \mathbb{E}_{z}\left[\nu, \rho^{{-}1}\right]^2.$$

We can see that they satisfy the symmetries

(29a)$$\mathbb{E}_{z}\left[\nu + 1, \rho^{{-}1}\right] = \mathbb{E}_{z}\left[\nu, \rho^{{-}1}\right] + 1, \quad \mathrm{var}_{z}\left[\nu + 1, \rho^{{-}1}\right] = \mathrm{var}_{z}\left[\nu, \rho^{{-}1}\right], $$

(29b)$$\mathbb{E}_{z}\left[-\nu, \rho^{{-}1}\right] ={-}\mathbb{E}_{z}\left[\nu, \rho^{{-}1}\right], \quad \mathrm{var}_{z}\left[-\nu, \rho^{{-}1}\right] = \mathrm{var}_{z}\left[\nu, \rho^{{-}1}\right]. $$

Therefore, we only need to focus on the range of $\nu \in [0, 0.5]$. On the other hand, we have

(30a)$$\mathbb{E}_{z}\left[\nu,\rho^{{-}1}\right] \rightarrow \nu, \, \mathrm{var}_{z}\left[\nu,\rho^{{-}1}\right] \rightarrow \rho^{{-}1} \quad (\rho\rightarrow 0), $$

(30b)$$\mathbb{E}_{z}\left[\nu,\rho^{{-}1}\right] \rightarrow \frac{1}{1+e^{\rho(0.5-\nu)}}, \, \mathrm{var}_{z}\left[\nu,\rho^{{-}1}\right] \rightarrow \frac{e^{\rho(0.5-\nu)}}{\left(1+ e^{\rho(0.5-\nu)}\right)^2} \quad (\rho\rightarrow\infty,\,\nu\in[0, 0.5]). $$

In this way, we can approximate the values using (30) when $\rho$ is large (i.e., $\rho > 80$) or small (i.e., $\rho < 1$).

For the values in the middle range, i.e., $\rho \in [1, 80],\,\nu \in [0, 0.5]$, we can use the finite series approximation of (28) to pre-calculate the values for some parameters of $\rho, \nu$ in a dense grid, then use the linear interpolation for any arbitrary parameters to speed up the calculation time in the EM algorithm.

5.2 Extraction results

For our RR design measurement data, the algorithm converges to distributions with standard deviations

(31)$$\sigma_w=28\,\mathrm{nm},\quad \sigma_T=0.8\,\mathrm{nm},\quad \sigma_h=3.8\,\mathrm{nm}$$

within 600 iterations. In our case, the solution has relatively large $\rho _i$, hence we can roughly approximate the posterior (25) by a normal distribution: $p_{\hat {\boldsymbol {\mathsf {y}}}_i|\boldsymbol {\mathsf {y}}_i}(\hat {\boldsymbol {y}}_i|\boldsymbol {y}_i) \approx \mathcal {N}(\hat {\boldsymbol {y}}_i; \boldsymbol {m}_i, \mathbf {V}_i)$. Notice that $\boldsymbol {p}_i = \mathbf {S}_i^+(\hat {\boldsymbol {y}}_i - \mathbf {A}_i \boldsymbol {w})$; we thus have

(32)$$p_{\boldsymbol{\mathsf{p}}_i|\boldsymbol{\mathsf{y}}_i}(\boldsymbol{p}_i|\boldsymbol{y}_i) \approx \mathcal{N}(\boldsymbol{p}_i; \mathbf{S}_i^+ \boldsymbol{r}_i, \mathbf{S}_i^+\mathbf{V}_i\mathbf{S}_i^{+,\mathrm{T}}),$$

which can serve as the estimation of the process variation for each device with a desired confidence interval, as shown in Fig. 5(a). Compared to the naive method in Fig. 2(c), the variance is largely reduced as the error is suppressed. While the correlation among variations still persists, it is substantially weaker than in the naive approach. The values of the standard deviations are also consistent with our understanding of the physics of the manufacturing process, where the total thickness of the silicon-nitride-on-insulator layer $T$ has much smaller variation compared to the partial etch width $w$ and depth $h$.

Fig. 5. (a) Inferred geometric variations of each device from the Bayesian approach. Shading shows the $4\sigma$ ellipsoid for each device, which correspond to the 75% confidence interval for a 3-dimensional normal distribution. (b) Spatial distribution of the thickness $T$ variation based on the Bayesian inference, smoothed and interpolated from Gaussian process model.

Download Full Size | PDF

We can also generate spatial maps of the variations based on the estimated posterior (32). Figure 5(b) shows an example of the spatial map of the thickness $T$ variation, which is interpolated and smoothed using a Gaussian process model. It should be noted that the values at the edge of the map are results of extrapolation, and can be inaccurate. Since we do not separate different levels of the spatial variations as in Ref. [7], this map also includes random residual errors and could present short spatial correlations that are nonphysical for intra-die systematic variation. Some of the correlation and extreme values in the extraction results in Fig. 5 can also be the result of ignoring the material index variation; these estimates could potentially be improved by including $\Delta n_i$ into the model.

Finally, the standard deviations of the gradient parameters are 0.36 nm/$\mathrm {\mu }$m, 0.066 nm/$\mathrm {\mu }$m, and 0.14 nm/$\mathrm {\mu }$m for $g_w$, $g_T$, and $g_h$, respectively, which are slightly larger than expected; these numbers suggest a bigger difference between the ring and the coupler than the standard deviations shown in (31). This can be due to the relatively large error in extracting the coupling coefficients $t$, which these gradient parameters solely depend on. It might also suggest that the narrow gap area of the coupler is suffering from larger variation in the etching process.

6. Conclusions

We present a Bayesian approach to infer geometric variation distributions from characterization measurements of spatially non-replicated silicon photonic devices, that incorporates error information into the inference model to reduce the effect of measurement noise. We also provide a reliable approach to estimate the variance of the extracted property errors by combining a nonlinear fitting residual approach and a wavelength dependency approach, serving as an important piece that enables the Bayesian inference.

The results from silicon nitride RR characterization measurements are reported with standard deviations of 28 nm, 0.8 nm, and 3.8 nm for $w$, $T$, and $h$, respectively. The method shows the potential of using such design characterization measurements rather than replicated devices as variation test structures, and thus provides an efficient and accessible approach to study process variations in silicon photonics.

Future work could explore the possibility of further including the simulation error of the sensitivity matrix $\mathbf {S}_i$ and prior knowledge into the Bayesian model. As another avenue, both the auxiliary parameters and our generated spatial maps of variations suggest the possibility of an advanced model that incorporates the spatial information as well. Such future directions might help in achieving reliable variation inferences with even less error.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [8].

References

1. J. E. Bowers, T. Komljenovic, M. Davenport, J. Hulme, A. Y. Liu, C. T. Santis, A. Spott, S. Srinivasan, E. J. Stanton, and C. Zhang, “Recent advances in silicon photonic integrated circuits,” in Next-Generation Optical Communication: Components, Sub-Systems, and Systems V, vol. 9774 (SPIE, 2016), pp. 977402.

2. X. Wang, W. Shi, H. Yun, S. Grist, N. A. Jaeger, and L. Chrostowski, “Narrow-band waveguide Bragg gratings on SOI wafers with CMOS-compatible fabrication process,” Opt. Express 20(14), 15547–15558 (2012). [CrossRef]

3. C. Sun, M. T. Wade, Y. Lee, et al., “Single-chip microprocessor that communicates directly using light,” Nature 528(7583), 534–538 (2015). [CrossRef]

4. W. A. Zortman, M. R. Watts, and D. C. Trotter, “Determination of wafer and process induced resonant frequency variation in silicon microdisk-resonators,” in Adv. in Optical Sciences Cong., (OSA, 2009), p. IMC5.

5. Z. Lu, J. Jhoja, J. Klein, X. Wang, A. Liu, J. Flueckiger, J. Pond, and L. Chrostowski, “Performance prediction for silicon photonics integrated circuits with layout-dependent correlated manufacturing variability,” Opt. Express 25(9), 9712–9733 (2017). [CrossRef]

6. Y. Xing, J. Dong, S. Dwivedi, U. Khan, and W. Bogaerts, “Accurate extraction of fabricated geometry using optical measurement,” Photonics Res. 6(11), 1008–1020 (2018). [CrossRef]

7. Y. Xing, J. Dong, U. Khan, and W. Bogaerts, “Capturing the effects of spatial process variations in silicon photonic, circuits,” ACS Photonics 10, 928 (2022). https://doi.org/10.1021/acsphotonics.2c01194.

8. Z. Zhang, “zhx-zhang/Characterization_Inference,” Zenodo, 2023, https://doi.org/10.5281/zenodo.7972288.

9. Q. Wilmart, H. El Dirani, N. Tyler, D. Fowler, S. Malhouitre, S. Garcia, M. Casale, S. Kerdiles, K. Hassan, C. Monat, X. Letartre, A. Kamel, M. Pu, K. Yvind, L. K. Oxenløwe, W. Rabaud, C. Sciancalepore, B. Szelag, and S. Olivier, “A versatile silicon-silicon nitride photonics platform for enhanced functionalities and applications,” Appl. Sci. 9(2), 255 (2019). [CrossRef]

10. W. McKinnon, D.-X. Xu, C. Storey, E. Post, A. Densmore, A. Delâge, P. Waldron, J. Schmid, and S. Janz, “Extracting coupling and loss coefficients from a ring resonator,” Opt. Express 17(21), 18971–18982 (2009). [CrossRef]

11. D. M. Bates and D. G. Watts, Nonlinear Regression Analysis and its Applications (Wiley, 1988), chap. 2, pp. 32–66.

12. F. Dellaert, “The expectation maximization algorithm,” Tech. Rep. GIT-GVU-02–20, Georgia Institute of Technology (2002).

13. X.-L. Meng and D. B. Rubin, “Maximum likelihood estimation via the ECM algorithm: A general framework,” Biometrika 80(2), 267–278 (1993). [CrossRef]

Inference of process variations in silicon photonics from characterization measurements

Abstract

1. Introduction

2. Data extraction and challenges

2.1 Auxiliary gradient parameters

2.2 Nominal fitting design

2.3 Naive approach

3. Error estimation

4. Bayesian approach: a toy example

5. Implementation

5.1 Discrete Gaussian Distribution

5.2 Extraction results

6. Conclusions

Disclosures

Data availability

References

Data availability

Cited By

Figures (5)

Equations (36)

Optics Express