Real-time FPGA prototyping of a 15GBaud SP-16QAM coherent optical receiver with optimal interpolating for clock recovery and equalization

Jingwei Song; Yan Li; Yan Li; Jifang Qiu; Yong Zuo; Wei Li; Xiaobin Hong; Hongxiang Guo; Jian Wu; Jian Wu

doi:10.1364/OE.463512

1. Introduction

With the explosion of data traffic in optical network, optical transceivers with compact size, low power consumption, large capacity and high flexibility are quite desirable [1–4]. One critical technique that enables advanced optical transceivers is digital signal processing (DSP) [5,6]. To evaluate the performance of the application-specific integrated circuit (ASIC) in digital coherent optics (DCO) transceivers, field programmable gate array (FPGA) is always used as the prototype [7]. Therefore, to study the feasibility of algorithms in real-time systems, it’s essential to implement the DSP modules in the FPGA.

As for the DSP scheme of coherent optical receiving, one crucial block is the clock recovery (CR). Due to the clock asynchrony between the transmitter (Tx) side and the receiver (Rx) side in real transmission [8], the receiver needs to adjust the sampling instant of the analog-to-digital convertor (ADC) or recover the signals at the optimal instant by interpolating [9]. Another important block is channel equalization. In a high-speed data transmission system, the signal’s bandwidth is quite high and can be easily limited by components with low-pass characteristics in the link, leading to inter-symbol interference (ISI). Thus, the channel equalizer is used to suppress ISI and improve signal quality.

At present, in both coherent and IM/DD systems, the most frequently used setup of DSP flow is to connect the CR module and the channel equalization module in series and make them work independently [10–13]. According to the reported researches, feedback or feedforward type CR has been applied in various transmission scenarios [14,15]. Several types of timing error detectors (TED) have been proposed and investigated [16–19]. Linear, polynomial, spline or sinc functions are commonly used as the interpolator [20,21,22]. As for the equalizer, either static or adaptive type has been proved to improve signal quality [23,24]. In addition, preceding research has proved that clock recovery and equalization can work jointly to provide better performance in DP-QPSK and PAM4 systems via offline DSP [25,26].

When reviewing the above-mentioned works, we found the traditional scheme can cost lots of computation resources since interpolation and equalization are realized using two individual convolution operations. While the optimal interpolation (OI) technique mentioned in [27] shows great potential to lower the complexity. The OI is aimed to design an interpolator that can minimize the mean-square-error of the output signals. It inspires us that if the OI can be applied in CR, the need for equalizer will be reduced. However, at present, the performance of OI-based CR has not been evaluated and its advantage against the traditional schemes is also not discussed. Therefore, in this paper, we make a deeper investigation on OI-based clock recovery (OI-CR): 1) we compared it with traditional schemes where CR is cascaded by an equalizer in real 15GBaud 16QAM transmission; 2) we implemented the 15GBaud single-polarization (SP-) 16QAM coherent receiver DSP in a single FPGA chip and carried out a real-time back-to-back transmission to test its sensitivity. Since our target is to develop a real-time receiver for system with limited received power and complexity, such as free-space optical communication system, where chromatic dispersion and fiber nonlinearity can be ignored, we mainly focus on the receiver’s sensitivity in the experiment. The results showed that, compared to the traditional scheme with CR followed by an equalizer, the OI-CR lowered the logic utilization of FPGA: the number of LUT, CARRY8 and DSP48 was reduced by 35%, 50% and 11%, respectively. With a received optical power (ROP) of -33dBm, the bit-error-rate (BER) of OI-CR is 6.9×10⁻⁴ while the BER of traditional scheme with a 4-tap interpolator and 7-tap equalizer is 1.5×10⁻³. The OI-CR improves the sensitivity by ∼2dB.

The remainder of the paper is organized as follows: First, we present the traditional scheme of CR and equalization. Second, we introduce the OI-CR scheme and explain the design flow of the OI filter. Third, the complexity and performance of several CR and equalizing schemes are compared using offline DSP. Forth, the FPGA implementation of 60Gbps coherent receiver DSP using different schemes is introduced and the real-time back-to-back transmission experiment is carried out.

2. Principle of the clock recovery and equalization schemes

In the following part of this paper, the DSP scheme with N-tap interpolator cascaded by an M-tap AEQ is named “Traditional N + M”. For example, using a Lagrange cubic interpolator followed by a 7-tap AEQ is named “Traditional Lagrange 4 + 7” and an 11-tap sinc interpolator without AEQ will be named “Traditional sinc 11 + 0”. And an OI-CR with N-tap OI filter will be named “N-tap OI-CR”.

2.1 Traditional clock recovery and equalizing scheme

The “Traditional N + M” scheme is illustrated in Fig. 1. It is composed of clock recovery with traditional interpolators based on linear, Lagrange polynomial and sinc functions, and an AEQ based on multi-modulus algorithm (MMA). In both “Traditional N + M” scheme and “N-tap OI-CR” the Gardner algorithm is used as the timing error detector (TED), which is implemented in a feedback structure [16].

Fig. 1. Conventional clock recovery algorithm followed by an MMA adaptive equalizer. S(t): the input signal; T_s: the sampling period; MMA: multi-modulo algorithm.

Download Full Size | PDF

The interpolating process can be described as a convolution operation [9]

(1)$$s^{\prime}(n) = \sum\limits_{k = 1}^{{N_h}} {h(k)s(n - k)} $$

where s′(n) is the recovered signal, h(k) is the impulse response of the interpolator, s(n) is the signal sampled by ADC and N_h is the number of taps of the interpolating FIR filter. When the interpolation is realized by Lagrange polynomials or sinc functions, h(k) is obtained by

(2)$${h_L}(k) = \prod\limits_{i \ne k}^{1 \le i \le N} {\frac{{\Delta T - i{T_s} + \lceil{{N / 2}} \rceil {T_s}}}{{(k - i){T_s}}}}$$

(3)$${h_s}(k) = \textrm{sinc(}k - \lceil{{N / 2}} \rceil + \Delta T/{T_s}\textrm{)}$$

where h_L(k) and h_s(k) is the impulse response of the Lagrange polynomial interpolator and the sinc interpolator truncated by a rectangle window, respectively [9]. N is the number of taps of the FIR filter. ΔT is the sampling phase error. T_s is the sampling period. ${\cdot}\; $is the ceiling function. The output of interpolator will be used for Gardner timing error detection, which is described as

(4)$$TE(n) = [s^{\prime}(2n + 1) - s^{\prime}(2n - 1)] \cdot s^{\prime}(2n)$$

where TE(n) is the n-th timing error estimated for symbol s′(2n). Then, the estimated timing error will be fed into the phase error calculation module, which updates the sampling phase error by

(5)$$PE(k) = PE(k - 1) + \frac{\mu }{N}\sum\limits_{i = 1}^N {TE(i)}$$

where PE(k) is the k-th sampling phase error. µ is the step size. N is the number of timing errors used for average filtering, which can filter out the out-band noise of estimated timing errors. This method is very efficient in FPGA or ASIC implementation, where the processing units are organized in parallel. In a parallel structure, numerous timing errors based on different symbols will be calculated in a single clock period. And they will be summed up by parallel adders and divided by bit-shifter. The following AEQ is realized based on the T-spaced MMA:

(6)$${e_n} = |\mathrm{Re} [x(n)]{|^2} + |{\mathop{\rm Im}\nolimits} [x(n)]{|^2} - R_{ref}^2$$

(7)$${h_{i,k}} = {h_{i - 1,k}} + \mu \sum\limits_{n = 1}^{{N_p}} {{e_n} \times (\mathrm{Re} [y(n)] \times \mathrm{Re} [x(n - k)] + {\mathop{\rm Im}\nolimits} [y(n)] \times {\mathop{\rm Im}\nolimits} [x(n - k)])}$$

where e_n is the cost function of MMA, x(n) is the n-th input of the equalizer, R_ref is the radius of reference symbol, h_i,k is the k-th filter coefficient in the i-th iteration and µ is the step size.

2.2 Clock recovery and equalizing scheme using optimal interpolation

The block diagram of OI-CR is illustrated in Fig. 2(a). Compared with the traditional scheme in Fig. 1, the OI-CR scheme uses only one convolution block and the phase error estimated by Gardner TED is used as the index of the look-up-table, which stores the coefficients of a set of OI filters obtained by previous training. And the training scheme is explained in Fig. 2(b).

Fig. 2. (a) Clock recovery scheme using optimal interpolation. (b) The detailed view of the training block. (c) the MFR of 2047-tap sinc under different sampling phase error. S(t): the input signal; T_s: the sampling period; LUT: look-up-table. ${h_{opt,i}}$: the impulse response of the OI filter for the sampling phase error ${T_i}$; n: the number of phase errors for training.

Download Full Size | PDF

As is explained in Fig. 2(b), The major target of training is to find the optimal coefficients under different sampling phase errors. First, the sampled signals, which are influenced by the clock mismatch and ISI, are sent to a sampling phase error elimination block to ensure that the signals used for training are sampling-phase-error-free. Besides, to ensure the accuracy of the training results, we desire to remove only the sampling phase error of the sampled signals while the spectrum of the phase-error-free signals should be same with the sampled signals. Therefore, the phase error elimination operation is implemented using another CR block using serial processing with Gardner TED and a 2047-tap sinc interpolator, whose MFR nearly remains constant under different phase errors (see Fig. 2(c)). After the phase error elimination operation, the signals are delayed by n different delays ranging from -0.5 to 0.5 times of sampling period to simulate the influence of sampling phase errors. The n sets of the delayed signals are sent to n T/2-spaced AEQ. These AEQs are based on MMA which has been previously described in Eq. (6) and Eq. (7). To get the coefficients of OI filter, signals with a length of 4.19×10⁶ are collected and are sent to the training block when the ROP is -19dBm. After these equalizers converge, their coefficients are used as the OI filters and stored in an LUT. Each of them can adjust the sampled signals to their optimal sampling instant and linearly equalize them under a certain phase error and will be used in both offline and real-time DSP. It should be added that, the MMA can mis-converge as it is fully blind [28]. In the process of training, 16-QAM signals with ROP of -19dBm was fed into MMA-based equalizer. The quality of the signals is good enough. Thus, no mis-convergence has happened.

2.3 Complexity analysis

In practical implementation, the complexity should be taken into consideration. In this section, only the computation resources used by FPGA chips will be considered. Therefore, the complexity of training block in OI-CR scheme will be excluded because 1) the training block works offline in the computer; 2) in a fixed channel with constant filtering characteristics, the training only needs to be performed once, which can be regarded as part of the initialization operation when the receiver is employed in a transmission link for the first time. The complexity of “Traditional N + M” structure is presented by the number of real-valued multipliers as

(8)$${N_{mult}} = {N_{interp}} + {N_{TED}} + {N_{eq}} + {N_{upd}}$$

Here N_mult, N_interp, N_TED, N_eq and N_upd denote the number of multipliers used by the whole scheme, the convolution in interpolation, the Gardner TED, the convolution in equalization, and the updated module based on gradient descent for the AEQ’s coefficients, respectively. Given the value of N and M, they are obtained as

Where N_p is the number of parallel lanes (after down-sampling to 1SPS) in the FPGA implementation. The factor of 2 in N_interp, N_TED and N_eq is induced as the calculation for the real and imaginary part is processed separately using real-valued multipliers, and another factor of 2 in N_interp is introduced because the signals fed into interpolation is sampled in 2SPS. N_upd is obtained based on the MMA algorithm, where the coefficients of FIR filter are updated using Eq. (6) and Eq. (7). The first term of N_upd denotes the complexity of calculating the modulus of the signal in Eq. (6), and the second term gives the number of multipliers for updating the FIR filter in Eq. (7).

3. Experiment results based on offline digital signal processing

3.1 Experimental setup

In this paper, all the present results were obtained based on the back-to-back transmission experiment setup as showed in Fig. 3. In our transmission experiment, a commercial 4-port laser source with a nominal linewidth of 100KHz was used to provide two individual optical carriers that were used as the transmitter carrier and the local oscillator (LO) carrier. At the Tx side, a PAM4 PPG provided two electrical 15Gbaud PAM4 signals to drive the IQ modulator (with 3 dB bandwidth of 22 GHz) and generate 15GBaud 16QAM signals. The transmitted optical signals were attenuated and transferred to an EDFA that boosted the signals to 8dBm. Meanwhile, an OBPF was used to filter out the out-band noise. Then, the signals were input into an ICR (with 3dB bandwidth of 22GHz and an integrated transimpedance amplifier) and converted to electrical signals. Finally, two 6-bit ADCs (the effective number of bits at 14 GHz is 4.3) driven by an individual clock. The clock signal is generated by an external RF generator (with phase noise of -92dBc @ 20 GHz @ 10KHz offset). The frequency of the clock signal is 15 GHz to make the ADCs sampled the received signals at 30GSa/s, which achieves two samples per symbol (SPS), and transferred them to an FPGA. The data were directly processed by the real-time DSP algorithm in the FPGA, or transmitted to a computer for offline DSP.

Fig. 3. Back-to-back transmission experiment setup. PC: polarization controller. PPG: pulse pattern generator. VOA: variable optical attenuator. EDFA: Erbium Doped Fiber Amplifier. OBPF: optical bandpass filter. ICR: integrated coherent receiver. LO: local oscillator. CR: clock recovery. AEQ: adaptive equalizer (optional). CFR: carrier frequency recovery. CPR: carrier phase recovery. BERT: bit error rate test. ILA: integrated logic analyzer used to monitor and collect the internal signals.

Download Full Size | PDF

The right side of Fig. 3 demonstrates the offline DSP flow. The signals sampled by the two ADCs are transferred from the FPGA to the computer via the ILA IP core for offline DSP, where sampled signals with a length of 4.19×10⁶ are processed by one of two schemes: the traditional scheme uses Gardner CR with Lagrange polynomial or sinc interpolator followed by an adaptive equalizer (AEQ); the other scheme uses OI in Gardner CR and disables the AEQ. The AEQ is transferred into a parallel structure, where the number of quantization bits is 8, the number of parallel lanes is 128 and the feedback delay is set to 16 clock periods. After that, the carrier frequency and phase are recovered and the bit error rate (BER) is measured. The real-time DSP will be introduced in section 4.1.

3.2 Frequency domain analysis of the “Traditional N + M” and “N-tap OI-CR”

Equation (2) and Eq. (3) show that, the impulse response of the above-mentioned interpolation methods will vary according to the estimated sampling phase error ΔT. We present the magnitude frequency response (MFR) of several traditional interpolators and the OI filter with different phase errors in Fig. 4 (see the solid lines). To analyze the T-spaced AEQ and the interpolator with time spacing of T/2 as a whole, the overall MFR was obtained based on the convolution of the interpolator’s taps and the up-sampled AEQ’s taps. The up-sample function is used to change the time spacing of the equalizer from T to T/2 by inserting one zero between two adjacent taps.

Fig. 4. Magnitude frequency responses under different sampling phase errors of: (a) 2-tap Lagrange interpolator, i.e., linear interpolation; (b) 4-tap Lagrange interpolation, i.e., cubic interpolation; (c) 11-tap sinc interpolation truncated by a Kaiser window. (d) 11-tap optimal interpolation filter obtained by offline training. T_s: the sampling period.

Download Full Size | PDF

It can be observed in inset (a-c) that, for Lagrange polynomial interpolation, the passband width of the interpolating filter will be reduced as the phase error increases from 0 to 0.5 T_s. For the 11-tap sinc interpolator truncated by a Kaiser window with a shape factor of 2.5, when the phase error changes from 0 to 0.5 T_s, the ripple effect is slightly enhanced, and the bandwidth is also reduced. Furthermore, the MFR of traditional interpolators is independent of the channel characteristics. As a result, to suppress the ISI caused by the channel and interpolators, an extra equalizer will be needed. The dashed lines in Fig. 4 present the overall MFR of the interpolator and the 7-tap T-spaced AEQ. In contrast, the MFR of the OI filter in inset (d) is shaped to adapt the transmission channel so that the OI filter can perform channel equalization during the interpolation process. However, limited by the number of taps of the OI filter, the MFR cannot remain completely constant with the change of the sampling phase error. Although these ripples can be mitigated by using window functions, the window functions can undesirably change the MFR. Besides, the BER performance remains unchanged under different sampling phase errors with OI filter. Thus, in this paper, we choose to let the training block to determine the coefficients by itself. It is obvious that, when the frequency is higher than 0.5 times of the baud rate Rs (>π/2), the MFRs of traditional schemes are different under different sampling delta phases. And the shapes of overall MFR in traditional schemes are different with the OI’s MFR. The reason for these phenomena is that the T-spaced equalizer can only cover the frequency domain range from 0 to Rs/2 while the OI filter is operated with a tap spacing of T/2 and covers the frequency domain range from 0 to Rs [29,30].

It should be pointed out that, to analyze the performance of DSP schemes under different sampling phase errors, we have to use the received signals with different sampling phase errors. The method of getting the signals with arbitrary sampling phase errors have been introduced in section 2.2.

Figure 5 shows the spectrum of the sampled signals and the signals processed by different DSP schemes. To better observe the difference under different sampling phase errors, we only showed the spectrum under sampling phase errors of 0Ts and 0.5Ts. The spike around zero frequency is induced by the DC offset of the IQ modulator’s bias voltage controller. When the AEQ is disabled, the high-frequency component is severely attenuated due to the low-pass characteristic of the devices in the link (the IQ modulator with 3 dB bandwidth of 22 GHz, the coherent receiver with 3 dB bandwidth of 22 GHz and the ADC with 3 dB bandwidth of 16 GHz), and can be different under different phase errors when the 2-tap or 4-tap Lagrange interpolator is used. Also, the spectrum of 11-tap OI-CR is quite different from the spectrum of traditional schemes for the same reason that they have distinct MFRs.

Fig. 5. The spectrum (plotted in linear scale) of the sampled signals and the signals processed by: (a) the traditional scheme with 2-tap Lagrange interpolation with or without AEQ; (b) the traditional scheme with 4-tap Lagrange interpolation with or without AEQ; (c) the traditional scheme with 11-tap sinc interpolation with or without AEQ; (d) the 11-tap OI-based scheme.

Download Full Size | PDF

3.3 Experiment results based on offline DSP

As is discussed in section 3.2, the overall performance of the Rx DSP can be influenced by the sampling phase errors as showed in Fig. 6. To obtain these results, the sampling phase error of the sampled signals is removed, and constant sampling phase errors are added to the signal. It indicates that, for the traditional scheme, when the AEQ is disabled, the performance is unstable under different phase errors using linear (Lagrange 2 + 0) or cubic (Lagrange 4 + 0) interpolators since the MFR of these two interpolators are quite unstable under different phase errors. Without AEQ, the BER floor is ∼2×10⁻³ limited by the channel-induced ISI. When AEQ is enabled, the BER of linear and cubic interpolator drops below 1×10⁻⁴, while with the 11-tap OI filter, the BER can be as low as ∼3×10⁻⁵. In Fig. 6, we also evaluated the performance of T/2-spaced AEQ in “Trad. Lagrange 4 + 7/11” scheme. It turns out that, the T/2-spaced AEQ degrades the performance compared to the T-spaced AEQ with the same number of taps.

Fig. 6. BER versus the sampling phase error with different schemes.

Download Full Size | PDF

We also directly inputted the sampled signals into different clock recovery and equalization schemes and tested their BER performance. The results in Fig. 7 show that, when the equalizer is disabled, the BER of pure Lagrange interpolation is always higher than ∼3×10⁻³ due to the ISI induced by the transmission channel. When the AEQ is enabled with different values of N + M, the lowest BER is obtained under N = 4 (5.0×10⁻⁴, 1.3×10⁻⁴ and 7.8×10⁻⁵ for N + M = 7, 11 and 15). As for the optimal interpolator, without the help of AEQ, the BER reaches 8.4×10⁻⁵ and 7.0×10⁻⁵ with 11-tap and 17-tap optimal interpolator, respectively. And the total number of multipliers is estimated based on Table 1 when N_p is 128. The “traditional Lagrange 4 + 7”, “traditional Lagrange 4 + 11”, “11-tap OI-CR” and “17-tap OI-CR” cost 7040, 9600, 5888 and 8960 multipliers, respectively. The results show that, the optimal interpolator can provide similar or even better performance than that of the AEQ-enabled scheme and lower the demand for multipliers. In the process of offline processing, we have found that if an extra AEQ is placed at the OI-CR’s output, it can further improve the performance. For example, a 7-tap AEQ can reduce the BER of 11-tap OI-CR from 8.4×10⁻⁵ to 5.81×10⁻⁵. In this paper, an extra AEQ will not be used for further improve the performance of OI-CR because 1) we aim to discuss the feasibility of replacing the traditional CR and AEQ with OI-CR; 2) an extra AEQ will induce higher complexity for real-time system.

Fig. 7. BER versus the number of taps of the interpolator.

Download Full Size | PDF

Table 1. Number of real-valued multipliers used by different modules

View Table | View all tables in this article

4. Real-time back-to-back transmission experiment

4.1 FPGA implementation

In the previous part of this paper, the “traditional Lagrange N + M” scheme is used to compare with the OI-based clock recovery. According to Table 1, the number of multipliers used by 11-tap OI-CR is 83.6% of the multipliers used by traditional Lagrange 4 + 7. To accurately evaluate the performance and hardware complexity of these two schemes, we implemented and tested the “11-tap OI-CR” and the “traditional Lagrange 4 + 7” scheme in an FPGA chip.

4.1.1 Rx DSP with optimal interpolating clock recovery

To realize the Rx DSP within an FPGA, the signals need to be processed in parallel since the clock frequency of FPGA is usually lower than several hundred MHz while the symbol rate of signals and the sampling rate of ADC is several tens GHz. Figure 8 shows the DSP structure of the FPGA-based receiver. The signals were sampled at 30GSa/s with 6-bit resolution and stored in the buffer. The buffer delay was controlled by the estimated phase error to choose one of the two samples per symbol that was closer to the optimal sampling instant. Besides, the buffer is also used to handle the frequency offset between Tx clock and Rx clock, which is realized by changing the delay of buffer by 1 when the sampling phase error exceeds the compensation range. Besides, to avoid loss of data, the sampling clock is slightly faster than the Tx clock. The data sent out from the buffer was convoluted with the optimal FIR filter, whose coefficients were previously obtained via offline training and stored in the LUT. The results of the convolution operation were delivered to the Gardner TED to update the estimated phase error. The phase error was used as the index of the LUT, whose depth was set to 32 in our FPGA implementation, i.e., the resolution of the TED was 1/32 sampling period. The output signals of convolution were transferred to the carrier frequency recovery and carrier phase recovery modules and finally transmitted to a computer through the ILA IP core for further analysis.

Fig. 8. DSP architecture of the coherent optical receiver with the 11-tap optimal interpolating clock recovery. GT: Gigabit transceiver that performs deserialization. Buffer: Buffering the input signal and controls the delay of channels. CFR: carrier frequency recovery. CPR: carrier phase recovery. ILA: integrated logic analyzer used to monitor and collect the internal signals.

Download Full Size | PDF

4.1.2 Rx DSP with Lagrange 3rd interpolating clock recovery and adaptive equalizer

The same hardware platform was used to realize the “traditional Lagrange 4 + 7” scheme. The FPGA-based DSP structure was modified as described in Fig. 9, and the differences between these two schemes were highlighted in red: The LUT used for storing the previously obtained FIR coefficients was replaced by a Lagrange 3^rd-order polynomial generator; The output of the convolution module was firstly sent to a 7-tap adaptive blind equalizer based on MMA algorithm. The remaining modules were kept the same as 4.1.1. In both OI-based scheme and traditional scheme, the bit widths of multipliers used in interpolation is 6, 8 and 14 for input signal, FIR’s coefficients and the multiplier’s output, respectively. The bit widths of multipliers used in equalization is 8, 8 and 16 for input signals, equalizer’s coefficients and the multiplier’s output, respectively. And no accuracy is lost during the summation operation. Besides, in order to meet the timing requirements, the clock frequency is set to 117.1875MHz, which is obtained by 30 GHz/256, where 30 GHz is the sampling rate and 256 is the number of parallel lanes.

Fig. 9. DSP architecture of the coherent optical receiver with the “traditional Lagrange 4 + 7” scheme. Lagrange 3^rd polynomial generator: Generating the 4 coefficients of Lagrange 3^rd-order interpolator as Eq. (3) described. AEQ: 7-tap adaptive blind equalizer.

Download Full Size | PDF

4.1.3 Resources utilization

The resources utilization of two different schemes is listed in Table 2.

Table 2. Resource utilization of the Rx DSP

View Table | View all tables in this article

Where CR is the Gardner-based clock recovery, AEQ is the MMA-based adaptive equalizer, and Other refers to the carrier recovery, the ADC-FPGA interface and modules used for debugging. In Table 2, LUT refers to the basic programmable element of FPGA, rather than the look-up-table used in the OI-CR scheme, and DSP48 is used as the dedicated real-valued multiplier. It should be noticed that the number of used DSP48 is different compared to the estimated number of multipliers in Table 1, since some multipliers are implemented using other hardware resources including LUTs, Registers and CARRY8, and some of the multipliers in interpolation are optimized as some results of multiplication will not be used in the following modules.

The difference of resources utilization between these two schemes proved that the “traditional Lagrange 4 + 7” scheme increased the demand of most resources except for BRAM due to the storage requirement for coefficients of the optimal interpolator.

4.2 Real-time experiment results

We carried out a real-time experiment in the back-to-back link illustrated in Fig. 1. The received signals were processed by the FPGA with the aforementioned two DSP schemes in real-time.

By adjusting the VOA, we obtained the BER performance of these two schemes under different received optical power (ROP). The results are illustrated in Fig. 10. We collected tens to hundreds sets of recovered constellations for each BER during about 5 minutes of running in FPGA. each constellation is composed of 131072 16-QAM symbols. The final BER was obtained by averaging the BER of each set of the recovered constellations.

Fig. 10. BER versus received optical power in real-time DSP on FPGA.

Download Full Size | PDF

Figure 10 shows that, both 11-tap OI-CR and traditional Lagrange 4 + 7 scheme could work properly on a single FPGA chip. For 11-tap OI-CR, the lowest BER reached 2.4×10⁻⁴ at -21dBm, and when the ROP dropped to -40dBm, the BER was 2.2×10⁻². For traditional Lagrange 4 + 7 scheme, the lowest BER reached 4.8×10⁻⁴ at -21dBm, and the BER was 1.3×10⁻² when the ROP dropped to -37dBm. When the ROP was set to -40dBm, the AEQ was severely influenced by the noise and the coefficients failed to coverage to the right value, leading to the failure of the following DSP modules. Besides, the BER of these two schemes could not fall below 2×10⁻⁴as the ROP increased, which was caused by the limited effective number of bits of the ADC, the feedback delay and the quantization noise introduced by fixed-point calculation.

In conclusion, with the 11-tap optimal interpolator, the CR module could achieve clock recovery and channel equalization within a single FIR filter. It outperformed the traditional Lagrange 4 + 7 scheme in a constant transmission channel and lowered the resource utilization in the FPGA implementation.

5. Conclusion

In this paper, for the first time, we investigated the performance of the optimal interpolator for clock recovery and equalization and implement the OI-CR in an FPGA chip for real-time evaluation. We proposed a training scheme for the optimal interpolator. We performed both offline and FPGA-based real-time back-to-back transmission experiments and compared several different schemes. We made an FPGA prototype for 15GBd SP-16QAM to implement the Rx DSP using 11-tap optimal interpolator and Rx DSP with a 4-tap Lagrange cubic interpolator followed by a 7-tap adaptive equalizer. They were evaluated in a back-to-back transmission under different received optical powers. The results proved that the DSP scheme with the optimal interpolator works properly under the ROP above -40dBm. Unfortunately, limited by the number of available computation resources in this FPGA chip, we didn’t carry out the experiment over long-haul fiber transmission system, which is meaningful for a more practical OI-CR scheme. As a conclusion, compared with the traditional scheme, the OI-CR provided better sensitivity and lowered the resource utilization in both offline experiment and FPGA implementation. The OI-CR could be a potential solution towards low-complexity real-time DSP systems.

Funding

National Key Research and Development Program of China (2019YFB1803601); National Natural Science Foundation of China (61675034, 61875019, 62021005).

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but maybe obtained from the authors upon reasonable request.

References

1. Y. Loussouarn, E. Pincemin, M. Pan, G. Miller, A. Gibbemeyer, and B. Mikkelsen, “Multi-rate multi-format CFP/CFP2 digital coherent interfaces for data center interconnects, metro, and long-haul optical communications,” J. Lightwave Technol. 37(2), 538–547 (2019). [CrossRef]

2. Hideyuki Nasu, Kazuya Nagashima, Toshinori Uemura, Atsushi Izawa, and Yozo Ishikawa, “>1.3-Tb/s VCSEL-Based On-Board Parallel-Optical Transceiver Module for High-Density Optical Interconnects,” J. Lightwave Technol. 36(2), 159–167 (2018). [CrossRef]

3. Jau-Ji Jou, Tien-Tsorng Shih, and Chun-Lin Chiu, “400-Gb/s optical transmitter and receiver modules for on-board interconnects using polymer waveguide arrays,” OSA Continuum 1(2), 658–667 (2018). [CrossRef]

4. Fuad E. Doany, Benjamin G. Lee, Daniel M. Kuchta, Alexander V. Rylyakov, Christian Baks, Christopher Jahnes, Frank Libsch, and Clint L. Schow, “Terabit/Sec VCSEL-Based 48-Channel Optical Module Based on Holey CMOS Transceiver IC,” J. Lightwave Technol. 31(4), 672–680 (2013). [CrossRef]

5. H. Sun, M. Torbatian, M. Karimi, et al., “800G DSP ASIC Design Using Probabilistic Shaping and Digital Sub-Carrier Multiplexing,” J. Lightwave Technol. 38(17), 4744–4756 (2020). [CrossRef]

6. C. Fougstedt, O. Gustafsson, C. Bae, E. Börjeson, and P. Larsson-Edefors, “ASIC design exploration for DSP and FEC of 400-Gbit/s coherent data-center interconnect receivers,” in Optical Fiber Communication Conference (OFC) (Optica Publishing Group, 2020), paper Th2A.38.

7. N. Kikuchi, T. Yano, and R. Hirai, “FPGA Prototyping of Single-Polarization 112-Gb/s Transceiver for Optical Multilevel Signaling with Intensity and Delay Detection,” J. Lightwave Technol. 34(8), 1762–1769 (2016). [CrossRef]

8. K. Kikuchi, “Clock recovering characteristics of adaptive finite-impulse-response filters in digital coherent optical receivers,” Opt. Express 19(6), 5611 (2011). [CrossRef]

9. R. W. Schafer and L. R. Rabiner, “A Digital Signal Processing Approach to Interpolation,” Proc. IEEE 61(6), 692–702 (1973). [CrossRef]

10. M. Luo, J. Li, T. Zeng, L. Meng, L. Xue, L. Yi, and X. Li, “Real-time coherent UDWDM-PON with dual-polarization transceivers in a field trial,” J. Opt. Commun. Netw. 11(2), A166–A173 (2019). [CrossRef]

11. N. Iiyama, M. Fujiwara, T. Kanai, H. Suzuki, J. Kani, and J. Terada, “Clock conversion for burst-mode digital coherent QPSK receivers in a PON upstream transmission with a 100-ppm clock mismatch,” Opt. Express 29(2), 1265 (2021). [CrossRef]

12. J. Zhang, X. Xiao, J. Yu, J. S. Wey, X. Huang, and Z. Ma, “Real-time FPGA demonstration of PAM-4 burst-mode all-digital clock and data recovery for single wavelength 50G PON application,” in Optical Fiber Communication Conference (Optica Publishing Group, 2018), paper M1B.7.

13. D. Wang, Z. Su, H. Jiang, G. Liang, Q. Zhan, and Z. Li, “Modified square timing error detector with large chromatic dispersion tolerance for optical coherent receivers,” Opt. Express 29(13), 19759 (2021). [CrossRef]

14. D. Schmidt, B. Lankl, J. K. Fischer, J. Hilt, and C. Schubert, “Real-time implementation of a parallelized feedforward timing recovery scheme for receivers in optical access networks,” in European Conference and Exhibition on Optical Communications (2012), pp. 1–3.

15. A. K. M. Delwar Hossain, M. Aurangozeb, and Hossain, “Burst mode optical receiver with 10 ns lock time based on concurrent DC offset and timing recovery technique,” J. Opt. Commun. Netw. 10(2), 65–78 (2018). [CrossRef]

16. F. M. Gardner, “A BPSK/QPSK Timing-Error Detector for Sampled Receivers,” IRE Trans. Commun. Syst. 34(5), 423–429 (1986). [CrossRef]

17. K. Mueller and M. Muller, “Timing Recovery in Digital Synchronous Data Receivers,” IRE Trans. Commun. Syst. 24(5), 516–531 (1976). [CrossRef]

18. A. Josten, B. Baeuerle, E. Dornbierer, J. Boesser, D. Hillerkuss, and J. Leuthold, “Modified Godard timing recovery for non-integer oversampling receivers,” Appl. Sci. 7(7), 655 (2017). [CrossRef]

19. L. Huang, D. Wang, A. P. T. Lau, C. Lu, and S. He, “Performance analysis of blind timing phase estimators for digital coherent receivers,” Opt. Express 22(6), 6749 (2014). [CrossRef]

20. Michael Rice, Digital Communications: A Discrete-Time Approach (Pearson, 2009), Chap. 8.

21. V. Valimaki and A. Haghparast, “Fractional Delay Filter Design Based on Truncated Lagrange Interpolation,” IEEE Signal Process. Lett. 14(11), 816–819 (2007). [CrossRef]

22. F. M. Gardner, “Interpolation in Digital Modems—Part I: Fundamentals,” IRE Trans. Commun. Syst. 41(3), 501–507 (1993). [CrossRef]

23. J. X. Cai, O. Sinkin, H. Zhang, Y. Sun, A. Pilipetskii, G. Mohs, and N. S. Bergano, “ISI Compensation up to Nyquist Channel Spacing for Strongly Filtered PDM RZ-QPSK using Multi-Tap CMA,” in Optical Fiber Communication Conference (Optica Publishing Group, 2012), paper JW2A.47.

24. J. Wang, C. Xie, and Z. Pan, “Reducing Equalizer Complexity in Coherent Receivers for Nyquist Spectrally Shaped Systems with Matched Filters,” in Optical Fiber Communication Conference/National Fiber Optic Engineers Conference (Optica Publishing Group, 2013), paper OTu2I.3.

25. X. Zhou, X. Chen, W. Zhou, Y. Fan, H. Zhu, and Z. Li, “All-digital timing recovery and adaptive equalization for 112 Gbit/s POLMUX-NRZ-DQPSK optical coherent receivers,” J. Opt. Commun. Netw. 2(11), 984–990 (2010). [CrossRef]

26. Honghang Zhou, Yan Li, Dan Lu, Lei Yue, Chao Gao, Yuyang Liu, Ruibin Hao, Zhixi Zhao, Wei Li, Jifang Qiu, Xiaobin Hong, Hongxiang Guo, Yong Zuo, and Jian Wu, “Joint clock recovery and feed-forward equalization for PAM4 transmission,” Opt. Express 27(8), 11385–11395 (2019). [CrossRef]

27. D. Kim and M. J. Narasimha, “Design of optimal interpolation filter for symbol timing recovery,” IRE Trans. Commun. Syst. 45(7), 877–884 (1997). [CrossRef]

28. F. P. Guiomar, S. Member, S. B. Amado, S. Member, A. Carena, G. Bosco, S. Member, S. Member, and A. Nespola, “Fully Blind Linear and Nonlinear Equalization for 100G PM-64QAM Optical Systems,” J. Lightwave Technol. 33(7), 1265–1274 (2015). [CrossRef]

29. R. D. Gitlin and S. B. Weinstein, “Fractionally-Spaced Equalization: An Improved Digital Transversal Equalizer,” Bell Syst. Tech. J. 60(2), 275–296 (1981). [CrossRef]

30. A. Momtaz and M. M. Green, “An 80 mW 40 Gb/s 7-Tap T/2-Spaced Feed-Forward Equalizer in 65 nm CMOS,” IEEE J. Solid-State Circuits 45(3), 629–639 (2010). [CrossRef]

Scheme	N_interp	N_TED	N_eq	N_upd	N_mult
Trad. N + M	2×2×N_p×N	2×N_p	2×N_p×M	2×N_p+3×N_p×M	N_p×(4N+5M+4)
N-tap OI-CR	2×2×N_p×N	2×N_p	0	0	N_p×(4N+2)

	DSP with 11-tap optimal interpolator			DSP with “traditional Lagrange 4 + 7”
Resource	CR	Other	Amount	CR	AEQ	Other	Amount	Diff.
LUT	49K	345K	395K (23%)	35K	227K	345K	607K (35%)	+54%
Reg.	35K	364K	400K (12%)	24K	47K	368K	438K (13%)	+9.8%
CARRY8	3.7K	21.4K	25K (12%)	1.6K	28K	21K	50K (23%)	+101%
BRAM	256	524.5	780.5 (29%)	128	0	524.5	652.5 (24%)	-16%
DSP48	2912	877	3789 (31%)	1064	2304	877	4245 (35%)	+12%

Scheme	N_interp	N_TED	N_eq	N_upd	N_mult
Trad. N + M	2×2×N_p×N	2×N_p	2×N_p×M	2×N_p+3×N_p×M	N_p×(4N+5M+4)
N-tap OI-CR	2×2×N_p×N	2×N_p	0	0	N_p×(4N+2)

	DSP with 11-tap optimal interpolator			DSP with “traditional Lagrange 4 + 7”
Resource	CR	Other	Amount	CR	AEQ	Other	Amount	Diff.
LUT	49K	345K	395K (23%)	35K	227K	345K	607K (35%)	+54%
Reg.	35K	364K	400K (12%)	24K	47K	368K	438K (13%)	+9.8%
CARRY8	3.7K	21.4K	25K (12%)	1.6K	28K	21K	50K (23%)	+101%
BRAM	256	524.5	780.5 (29%)	128	0	524.5	652.5 (24%)	-16%
DSP48	2912	877	3789 (31%)	1064	2304	877	4245 (35%)	+12%

Real-time FPGA prototyping of a 15GBaud SP-16QAM coherent optical receiver with optimal interpolating for clock recovery and equalization

Abstract

1. Introduction

2. Principle of the clock recovery and equalization schemes

2.1 Traditional clock recovery and equalizing scheme

2.2 Clock recovery and equalizing scheme using optimal interpolation

2.3 Complexity analysis

3. Experiment results based on offline digital signal processing

3.1 Experimental setup

3.2 Frequency domain analysis of the “Traditional N + M” and “N-tap OI-CR”

3.3 Experiment results based on offline DSP

4. Real-time back-to-back transmission experiment

4.1 FPGA implementation

4.1.1 Rx DSP with optimal interpolating clock recovery

4.1.2 Rx DSP with Lagrange 3rd interpolating clock recovery and adaptive equalizer

4.1.3 Resources utilization

4.2 Real-time experiment results

5. Conclusion

Funding

Disclosures

Data availability

References

Data availability

Cited By

Figures (10)

Tables (2)

Equations (8)

Optics Express