Low-margin optical networking at cloud scale [Invited]

Mark Filer; Jamie Gaudette; Yawei Yin; Denizcan Billor; Zahra Bakhtiari; Jeffrey L. Cox

doi:10.1364/JOCN.11.000C94

Journal of Optical Communications and Networking
Vol. 11,
Issue 10,
pp. C94-C108
(2019)
•https://doi.org/10.1364/JOCN.11.000C94

Low-margin optical networking at cloud scale [Invited]

Mark Filer, Jamie Gaudette, Yawei Yin, Denizcan Billor, Zahra Bakhtiari, and Jeffrey L. Cox

Open Access

Get PDF
Email
Share
Get Citation
Copy Citation Text
Mark Filer, Jamie Gaudette, Yawei Yin, Denizcan Billor, Zahra Bakhtiari, and Jeffrey L. Cox, "Low-margin optical networking at cloud scale [Invited]," J. Opt. Commun. Netw. 11, C94-C108 (2019)

Export Citation
- BibTex
- Endnote (RIS)
- HTML
- Plain Text
Citation alert
Save article
Spotlight Summary

More Like This

Elastic Optical Networking in the Microsoft Cloud [Invited]
Mark Filer, et al.
J. Opt. Commun. Netw. 8(7) A45-A54 (2016)

Design considerations for low-margin elastic optical networks in the nonlinear regime [Invited]
Seb J. Savory, et al.
J. Opt. Commun. Netw. 11(10) C76-C85 (2019)

Practical considerations for near-zero margin network design and deployment [Invited]
David W. Boertjes, et al.
J. Opt. Commun. Netw. 11(9) C25-C34 (2019)

Related Topics
Optics & Photonics Topics
?

The topics in this list come from the Optics and Photonics Topics applied to this article.

About this Article
History
- Original Manuscript: April 19, 2019
- Revised Manuscript: August 13, 2019
- Manuscript Accepted: August 13, 2019
- Published: September 19, 2019
Virtual Issues
Journal of Optical Communications and Networking Low-Margin Optical Networks (2019)

September 20, 2019 Spotlight on Optics

Abstract

Every day, customers across the globe connect to cloud service provider servers with requests for diverse types of data, requiring instantaneous response times and seamless availability. The physical infrastructure which underpins those services is based on optics and optical networks, with the focus of this paper being on Microsoft’s approach to the optical network. Maintaining a global optical networking infrastructure which meets these customer needs means Microsoft must utilize solutions which are highly tailored and optimized for the application space which they address, with appropriately streamlined solutions for metropolitan data center interconnect and long-haul portions of the network. This paper presents Microsoft’s approach for tackling these challenges at cloud scale, highlighting the low-margin solutions which are employed. We provide a survey of Microsoft’s regional network design and corresponding optical network architectures, and present volumes of real-time polled metrics from the thousands of lines systems and tens of thousands of transceivers deployed today. We close by describing our approach to a unified software-defined networking toolset which ultimately enables the velocity and scale with which we can grow and operate this critical network infrastructure.

1. INTRODUCTION

Within metropolitan areas and across the wide area network (WAN), cloud service providers (CSPs) must offer extremely high bandwidths with near-perfect service availability and appropriate latencies to meet customer demands for diverse types of data, including enterprise cloud applications and email, VOIP, streaming video, IoT, search, and cloud storage. Microsoft believes this can be achieved most effectively by distributing data centers (DCs) throughout a given metropolitan area, interconnecting them with optical transport systems—an application space which has been coined “data center interconnect” (DCI) [1,2]. A grouping of these geographically distributed data centers within a metropolitan area can be generically referred to as a “region,” and Microsoft refers to this design comprehensively as the regional network architecture.

Within a region, separation between DCs must be far enough to avoid multiple site failures during catastrophic events, but the physical network connecting them cannot exceed the round-trip latency requirements of the application layer, typically less than a few milliseconds. All distributed data centers in the regional network are connected to each other over numerous diverse point-to-point DCI systems, carrying hundreds of 100G inter-switch links on each, allowing the region to operate effectively as one mega data center with petabits per second of low-latency inter-DC capacity. Between regions, where fiber resources are more constrained by an order of magnitude or more, the criteria for optimization are different: traffic is less latency sensitive, distances are much greater, and spectral efficiency is critical.

Due to scale, DCI systems are almost always based on some form of dense wavelength division multiplexing (DWDM) system. Traditional DCI systems have leveraged existing coherent transceivers designed for long-haul applications, burdened with a variety of features to optimize reach and performance for thousands of kilometers of fiber transmission. It is common to reuse this hardware within regional networks even though the regional latency requirements limit fiber distance to tens of kilometers [3]. At the scale of Microsoft’s regional network, the additional space and power consumption makes traditional hardware impractical. Further, the complexity of the long-haul features and proprietary “pizza box” form factors, which require special training for installation and troubleshooting and additional software-defined networking (SDN) toolsets, hinder mass global deployment. Several years ago, Microsoft partnered with the industry to create a focused DCI solution based on 100G DWDM four-level pulse amplitude modulation (PAM4) transceivers [4]. Packaged in the industry standard QSFP-28 form factor, these transceivers consume only 4.5 W, and plug directly into data center switches, eliminating external media conversion and associated layer-3 to layer-1 links [5]. To eliminate the complexities associated with dispersion compensation, we use an open line system (OLS) centrally managed by our SDN controller and equipped with automated gain setting and dispersion detection and control.

Fig. 1. Microsoft global DC + WAN footprint.

Download Full Size | PDF

Similarly for the long haul, Microsoft has made the transition from operating vendor-managed proprietary, closed line systems to OLS-based systems with optical sources residing directly on the routers [5–7]. The line systems are optimized for the Microsoft use case; namely, point-to-point inter-region connectivity capable of transporting foreign (i.e., alien) optical signals over long distances with maximal optical signal-to-noise ratio (OSNR) and spectral efficiency. Due to the decoupling of optical sources from the line system, Microsoft owns the end-to-end link budgeting and performance service level agreements for the deployed technologies and can accordingly operate them with tighter margins than we would get with “off-the-shelf” proprietary line systems and transponders.

In this paper, we seek to highlight how Microsoft operates what would traditionally be viewed as low-margin networks at scale. We begin by providing a survey of Microsoft’s regional network design, emphasizing how we have transitioned from a “mega data center” model to a distributed regional model in recent years. We then discuss some of the implementation specifics of our DCI and long-haul optical systems, calling attention to areas where we diverge from traditional architecture, deployment, and operational models in these application spaces which enable us to meet the cost, scale, and availability requirements we set for customers. Finally, we review performance and reliability requirements of our deployed infrastructure, supported by data captured from the Microsoft private network, including polled metrics from over 50,000 DWDM PAM4 devices and data from our long-haul coherent infrastructure. The sample sets presented have been chosen based on installation date alone, with no pruning or selection. We close with a discussion on deployment velocity and the software-defined control and automation needed to support such low-margin networks at scale.

2. MICROSOFT REGIONAL ARCHITECTURE

In order to provide high-availability, high-performance cloud computing services to customers in all parts of the world, Microsoft must have compute and storage presence as close to the customers as economics allow—which generally means in every major city across the globe (Fig. 1) [8]. Up until the past few years, Microsoft realized this by employing a “mega data center” model of establishing regional presence, whereby large singular campus facilities were constructed in a city to create the region. These facilities were a single campus composed of multiple large data center buildings (labeled “DC” in Fig. 2). These buildings were interconnected to one another via gray optics and bulk fiber through the four core network rooms [“CNR” A through D in Fig. 2(a)], which also served as inter-region ingress and egress points for traffic traversing Microsoft’s private WAN. The campus-based regions generally met the needs at the time, providing capacity to support a given geography with local, low-latency resources that could scale up to the size of a particular campus.

Fig. 2. (a) Legacy “mega DC” architecture and (b) current regional architecture.

Download Full Size | PDF

However, there were some drawbacks to the singular campus approach which were not fully appreciated at conception. First was that construction times tended to be prolonged with the large monolithic facilities, and the “day 1” cost to establish such large infrastructure may not be justifiable in the early days of a region. Second was that, although the campuses were large, the capacity of a region may need to be scaled beyond a singular campus—multiple campuses would ultimately be required to meet the cloud-scale growth. Lastly, large campus structures are not fully resilient to catastrophic campus-level failures, raising the potential for an entire region to fail and cause customer impact.

A natural evolution was to retain the same fundamental design but allow the DC facilities to be distributed over a wider geographic area within a region and to include multiple campuses in the design, effectively extending the logical topology typically observed inside a data center (referred to as the data center Clos fabric [9,10]) across a metropolitan area. In Microsoft’s case, to interconnect the multiple sites in a region, we identify and designate two diverse facilities as regional network gateways [“RNG” in Fig. 2(b)]. Data centers are then redundantly connected back to each of the RNGs by dark fiber, ensuring that the server-to-server latency (primarily due to the fiber itself) remains low enough to meet the application layer requirements. For Microsoft, the design requires that no single DC-to-RNG path exceeds 60 km.

This distributed regional architecture addresses some of the shortcomings of the previous singular campus design. Establishing and scaling of a region can now be done much more quickly and efficiently, as individual DCs and RNGs can be procured and constructed much more rapidly than the singular large campus. As DCs do not need to connect directly to one another, but only through the RNGs, scaling a region becomes a simple matter of acquiring additional DC facilities and tying them back into the RNGs via dark fiber paths (assuming RNGs are sized appropriately). Lastly, the geographical diversity of the individual DCs gives rise to the notion of availability zones [11,12] [“AZ set” in Fig. 2(b)], which provides regional resiliency under catastrophic failure conditions within a given DC site [13,14]. The impact of the shift to a regional architecture is evident in the number of deployed 100G DWDM ports over time (Fig. 3), where growth has been exponential year over year. These represent ports supporting both the intra-region fabric and the inter-region WAN, but the numbers are dominated by the fabric connectivity by nearly 2 orders of magnitude.

Fig. 3. Impact of regional architecture on Microsoft deployed 100G DWDM ports.

Download Full Size | PDF

While there are some similarities in the approach Microsoft has taken in its optical design for intra- and inter-region portions of the network, there are some significant differences in design drivers, and therefore optical network architectures, mainly driven by application requirements and availability of fiber resources. The intra-region network application layer requires that the round-trip latency between any two data centers be of the order of 1–2 ms to support applications where, e.g., storage and compute resources could be served out of DC1 and DC2, respectively. At the same time, the individual data centers (or availability zones) must be geographically separated enough to be tolerant to natural disaster or catastrophic site failure. Given that dark fiber resources are relatively plentiful and inexpensive within a region, it follows that maximum spectral efficiency is not a primary design goal and can be traded off for features such as cost, power efficiency, and deployment velocity.

On the other hand, inter-region long-haul routes are driven by different application requirements (primarily traffic bound for the Internet via edge and gateway sites, or internal replication and storage traffic between regions), and long-haul fibers are much more limited, making us resource constrained in that sense. Data traversing long-haul routes are usually less voluminous and less latency-sensitive than the regional workloads. However, the long-haul systems traverse fiber routes that are much longer, much sparser, and much more expensive as compared to the regional routes. This has led us to design transport systems with spectral efficiency and maximal OSNR as primary metrics.

A. Intra-Region (DCI) Optical Networks

As implied above, regional data center interconnect is essentially an extension of server-to-server connectivity inside a metro-regional network across many short-reach DWDM metro-optical systems [Fig. 2(b)]. As mentioned, in this architecture, we are effectively extending the Clos fabric [9] across a metro-optical network to achieve low-cost geographical redundancy. Connectivity inside a traditional Clos fabric is at least four-way redundant and therefore resilient to multiple path failures. Due to the high redundancy and the extremely high port count, cost, space, power, and operational simplicity dominate the engineering design of these DCI systems. Two DWDM architecture variants have emerged among the hyper-scalers: (1) ultra-low-cost DWDM pluggables and full Clos fabric and (2) high-margin transponders and optical protection with ½-Clos. Microsoft has adopted approach (1) because of the operational simplicity, much lower total cost of ownership (TCO) [15], and the option to leverage the redundant bandwidth during steady state, under non-failure conditions. Looking across our thousands of outside plant fiber pairs, through the majority of the world we observe a mean availability better than 99.9% per fiber. Through the use of much lower cost and power DWDM optics, such as the Inphi ColorZ QSFP-28 pluggable [16], we avoid complex optical protection and light the entire data center fabric day one. In other words, more than 364 days a year we offer at least $2{\times}$ the data center bandwidth to the servers compared to systems that use higher margin systems and layer-1 protection to achieve acceptable cost, space, and power. Using a bandwidth broker, we can apply back-pressure for replication with lower time sensitivity for the one day a year where critical redundancy is lost.

The Inphi ColorZ PAM4 module is described in detail in [16] and the line system in [17], but at a high level, the systems are based on 100G direct-detect PAM4 QSFP-28 pluggable modules which plug directly into existing Ethernet switch linecards. The system provides metro DWDM connectivity for up to 100 km and aggregates 4 Tb/s onto a single-mode fiber pair. The transmission distance is limited compared to coherent systems, but it is ideal for our needs since the application layer’s latency requirement already limits the maximum transmission distance. The idea at the outset of the ColorZ development was to design a solution that was just good enough to meet the needs of the application from a cost and TCO perspective—by definition, leveraging a network design capable of operating reliably at low margins. Due to the relative abundance of dark fiber connecting DCs, the trade-off between spectral efficiency for cost, space, and power can be made. Additionally, end-to-end control of these systems is greatly simplified due to the direct plug-in of the DWDM modules into the router linecards, eliminating need for the layer-1 to layer-0 conversion between connected switch ports.

The DWDM line system supporting the PAM4 solution is minimally composed of multiplexing/demultiplexing (MDM) optics, high-powered booster and pre-amplifiers, and wideband tunable optical dispersion compensation (TODC) [16,17] (Fig. 4). To deploy these systems at scale, deployment models need to be “cookie-cutter” and a significant amount of layer-0 control plane intelligence needs to be built in from day one, such as the ability to self-identify ideal amplifier gains, target per-channel launch powers, and chromatic dispersion compensation settings. Router port striping and line system configuration standards are uniform across all regions to support this, and the variable optical attenuator (VOA) at the output of the booster amplifier ensures that every deployment looks nearly identical from an OSNR and received power perspective. This common deployment “stamp” (Fig. 4) enables simple troubleshooting whenever there is an issue on an underperforming port and straightforward comparison of global performance metrics across all DWDM PAM4 deployments worldwide. Relevant deployment statistics are presented in Section 3.

Fig. 4. PAM4 DCI OLS line system “stamp.”

Download Full Size | PDF

B. Inter-Region (Long-Haul) Optical Networks

Contrary to the latency-sensitive application demands in the metro DCI network, our long-haul transmission systems are built with spectral efficiency as a primary criterion in order to maximize the lifetime of limited long-haul fiber infrastructure, and to futureproof the system as much as possible for customers’ bandwidth growth. Microsoft’s line system technologies are described in detail in [5] and [18], but in summary, the systems are composed of colorless, directional flexible grid MDM and wavelength-selective switches, dual-gain erbium-doped fiber amplifier (EDFA) and optional hybrid Raman/EDFA amplification, and are deployed primarily in point-to-point topologies. The system operates with up to ${120} \times 34\,\,{\rm{Gbaud}}$ channels in C-band at 37.5 GHz spacing and can adapt the channel plan to accommodate future transceiver technologies. Moreover, it is an OLS [5], in the sense that the DWDM optical sources are decoupled from the line system itself (Fig. 5) [19,20], providing the freedom to choose from a variety of optical source technologies, such as the cloud-optimized “pizza-box” transponders [5] or router-pluggable analog-coherent-optical- [21] or digital-coherent-optical-based [22] modules. This flexibility allows us to keep up with the pace of cutting-edge industry developments while utilizing the same line system over multiple generations of transceiver technology. In this way, line system costs can be amortized over 10 or more years, saving on TCO capital expenditures, and extending the life of our internally developed SDN tooling.

Fig. 5. Open line system concept.

Download Full Size | PDF

Fig. 6. Long-haul OLS line system “stamps.”

Download Full Size | PDF

The sources employed over the long-haul OLS are router-based integrated coherent optics (ICO), which are described in detail in [6,7,21]. The sources are bandwidth variable and allow multiple modulation formats and bit rates to suit each specific link’s fiber characteristics and delivered SNR. Namely, the sources support quadrature phase shift keying (QPSK), 8-ary quadrature-amplitude modulation (8QAM), and 16QAM modes at 100, 150, and 200 Gb/s payload rates, respectively. Having the sources integrated on the routers provides the same benefits in operational and tooling simplicity that we derive in the DCI PAM4 deployments. These sources are treated by the line system as foreign waves, and both the layer-3 devices/sources and line system are unified under a common SDN controller (Fig. 5)

Similar to the DCI line systems, the long-haul systems have been boiled down to their essential nodal variations, or “stamps,” as seen in Fig. 6. Three nodal variants exist across the private Microsoft global long haul:

(1) Optical line terminal node (OLT)—allows full ingress/egress of router-based DWDM traffic.
(2) Optical line amplifier node (OLA)—provides EDFA-only or hybrid Raman/EDFA gain at each amplifier site.
(3) Optical line reconfigurable optical add-drop multiplexer (ROADM) (OLR)—provides periodic spectral equalization and the ability to add/drop limited traffic for, e.g., edge and peering locations.

Maintaining only a minimal set of nodal variations—OLT, OLA, and OLR—with standard and uniform linecard slotting and directional conventions, allows us to deploy these systems at scale, leveraging the same deployment models and SDN toolset across OLS installations throughout the world.

An additional distinction in how we operate our networks compared to traditional long-haul systems is we noise-load the unoccupied portion of the spectrum with amplified spontaneous emission (ASE) (Fig. 7) from dedicated broadband noise-generating sources. There are multiple reasons for this, one of which being the operational and tooling simplicity of having “fully filled” line systems across all our deployments. It additionally allows us to exercise all portions of the spectrum day one to ensure there are not any hidden issues awaiting us as we continue to augment our systems with traffic-carrying signals. But a more fundamental reason ties into margins. Having a fully loaded line system from the start helps ensure there are no surprises as we continue to add channels and non-linear performance degrades, effectively operating the system at near end-of-life performance points day one [23–26]. Also, with an eye toward future line system technologies including C + L band support, having a fully loaded spectrum greatly simplifies the optical control algorithms for spectral management and equalization—of particular importance with the strong Raman scattering effects at play in multi-band systems [27].

Fig. 7. Long-haul OLS optical spectrum, showing fully filled C-band with data-carrying signals in the upper frequencies and ASE noise-loading throughout the remainder.

Download Full Size | PDF

Table 1. Global Fiber Specifications

View Table

The OLS also extends SDN control (configuration, monitoring, data collection, alerting/ticketing) down to the individual optical network elements from just the layer-3 devices where it resided historically, giving end-to-end SDN control from layer-0 up through layer-3. This frees Microsoft from having to rely on vendor-supplied proprietary, manually driven element- or network-management systems, and is necessary functionality to move Microsoft toward developing and operating a zero-touch network [28].

3. OPTICAL PERFORMANCE AND RELIABILITY

With a large, globally deployed metro-DCI and long-haul footprint, we are able to extract metrics across a statistically significant number of devices and infrastructure. This section presents data related to our fiber infrastructure, DCI PAM4 systems, and long-haul coherent systems which demonstrate the actual implementation and operation of low-margin networks at cloud scale.

Fig. 8. DCI fiber quality statistics.

Download Full Size | PDF

A. Fiber Quality and Infrastructure

A high-performing regional network requires diverse and high-quality dark fiber. Excessive fiber loss degrades optical performance, and can indicate installation negligence, repair negligence, or excessive inline connectors, all resulting in higher probability of systemic path failures during lifetime. To drive repeatable quality, we source to a global standard shown in Table 1, with separate requirements for metro and long-haul fiber. In addition to standardization, we obtain historical availability data to verify path integrity. Once delivered, all fiber pairs are tested end-to-end against the global standard and non-compliances are remediated. Availability and loss are monitored throughout the lifetime of the fiber and compared to design requirements (see Section 4.B). As shown in Fig. 8, we have achieved a nearly 90% success rate against the global standard for DCI fiber (exceptions sometimes must be made in fiber-limited parts of the world). The success rate continues to improve year-over-year as dark fiber providers adjust to the needs of the CSPs.

Fig. 9. (a) Fiber distance and (b) loss distributions of the first $ {\gt}1500$ deployed DWDM PAM4 line systems.

Download Full Size | PDF

Fig. 10. (a) Fiber distance, (b) loss, (c) type, and (d) route length distributions across the first 26,000 km of deployed Microsoft long-haul OLS installations. Fiber types: “LC” = large core (large effective area, e.g., $ {\ge}80\,\,{\unicode{x00B5} }{{\rm{m}}^2}$), “SC” = small core (small effective area, e.g., $ {\le}55\,\,{\unicode{x00B5}} {{\rm{m}}^2}$).

Download Full Size | PDF

Distributions of Microsoft owned or leased DCI fiber distances and losses are shown in Figs. 9(a) and 9(b), respectively. The dataset includes statistics across the first 1500 deployed line systems supporting the intra-region PAM4 DWDM solution to date. The limited ranges of values seen in loss and distance underscore how well-suited the PAM4 solution is to meet the needs of this application space. Distances are effectively limited to 60 km, with the vast majority less than 50 km. Likewise losses are practically limited to 19 dB, with the vast majority less than 15 dB. The overwhelming majority of fiber types in the DCI infrastructure are ITU G.652 (98%) with the remainder being a variant of G.655.

Similar distributions have been gathered for Microsoft’s owned or leased OLS long-haul infrastructure (Fig. 10)—today primarily in North America and Europe. Similar data was gathered on Microsoft’s North American legacy infrastructure in [5]. This dataset compares favorably despite the physical diversity on most paths—the North American backbone is dominated by G.655 large effective area fiber (labeled “G.655 LC”), while the European backbone is primarily G.652 or equivalent non-dispersion shifted fiber (NDSF) [Fig. 10(c)]. Spans (i.e., the portion of fiber between any two network elements) are generally long (majority $ {\ge}70\,\,{\rm{km}}$) and losses high (majority $ {\ge}16\,\,{\rm{dB}}$), particularly in North America, many of which require Raman amplification to deliver optimal OSNR. Europe is generally more forgiving in this sense due to geography and lower non-linearity fibers (i.e., NDSFs). Total route lengths, i.e., distance connecting two regions, range from 200 to 2500 km [Fig. 10(d)], with the large majority able to be addressed by 16QAM and 8QAM modulation formats.

Fig. 11. $Q$-factor distribution of the first 50,000 deployed DWDM PAM4 ports.

Download Full Size | PDF

Fig. 12. BER stability over a 50-day period.

Download Full Size | PDF

B. Optical Performance Statistics

Instantaneous pre-forward-error-correction (FEC) bit-error ratio (BER) values were polled real-time from our first 50,000 DWDM PAM4 transceivers, which are slotted into five different layer-3 switch variants. The BER values were converted to $Q$-factor, and the results are summarized in Fig. 11. We use a $Q$-factor of 8 dB as our minimum acceptable pre-FEC performance at start-of-life as indicated by the dotted red line labeled “FEC limit” in Fig. 11. Results show tight but comfortable performance margins for systems of up to 21 dB of fiber loss, with an average $Q$ margin of nearly 3.3 dB, and $ {\gt}95\% $ of the ports with more than 2.2 dB of $Q$ margin. It is worth noting that at any point in time, we expect there to be some unhealthy ports in the network—particularly in the DC and DCI fabric. The Clos architecture is fundamentally designed to be tolerant to that without any customer impact, and hence we take a statistical approach to performance margining. Looking at Fig. 11, the $Q$ distribution of PAM4 ports shows a long but statistically low tail toward the FEC limit—our automated tooling continuously trolls the health of all deployed devices by flagging these unhealthy ports for troubleshooting and replacement (see Section 4). But this is a key distinction from the way traditional carriers treat optical circuits and transport systems, where one optical circuit corresponds to one customer, and therefore even a single bit error is impactful.

The performance of non-coherent PAM4 can vary as time-dependent linear impairments, such as loss and polarization mode dispersoin, drift over lifetime. To analyze stability, we reviewed time sampled BER across five global regions with time samples taken every 6–10 min. Figure 12 shows typical results from randomly selected channels, where BER has been converted to $Q$-factor. The time-sampled data produced a $Q$-factor standard deviation of 0.13 dB and min/max $Q$-factor delta of 0.78 dB. With respect to our global average $Q$-factor of 11.3 dB, we observe good performance margin after factoring time-varying performance deviation.

Fig. 13. Temperature distribution over optical module infrastructure.

Download Full Size | PDF

At 4.5 W per module, an initial concern was heat and its impact on reliability. Figure 13 shows the measured module temperature of our first 15,000 samples. For comparison, we overlay a random sampling of gray optic module temperatures, including active optical cables, PSM4 [29], and CWDM4 [30], in the same switches hosting our DWDM PAM4 sample set. Ambient temperatures in our data centers and RNGs are generally 25°C or lower. As expected, the PAM4 modules run significantly hotter than the gray optics, but within the QSFP-28 maximum specification of 70°C. To understand if there is any measurable impact of the higher operating temperatures, we analyzed the total observed failure rates as the sum of infant failures and in-production failures and compared to failure rates of our gray optics over the past year. Results are summarized in Fig. 14 where we observe that even though the PAM4 modules run significantly hotter than our gray optics, failure rates are similar. Interestingly, nearly all of the PAM4 failures were failed on arrival and discovered during initial turn-up; we have experienced very few in-service failures of our PAM4 modules.

Fig. 14. Failure rates of deployed 100G technologies.

Download Full Size | PDF

Similar statistics for long-haul deployments are a bit more challenging to present since the deployment routes and topologies are far from uniform as they are in the DCI. However, given that we operate these links on open lines and we internally own the optical link budgeting (done through a combination of optical propagation modeling and empirical data taken in Microsoft’s labs) and performance margining, it allows us to operate these as close to the performance limits as we feel will not put our customers at risk. As such, we squeeze nearly every bit possible out of our sources by applying the highest-order modulation supported by a given route and line system.

Fig. 15. Long-haul OLS BER statistics across 26,000 km of deployed infrastructure: (a) 16QAM, (b) 8QAM, and (c) QPSK.

Download Full Size | PDF

A summary of the BER performance (converted to $Q$-factor) across our long-haul OLS coherent infrastructure is shown in Fig. 15, representing hundreds of deployed transceiver ports. The datasets are split across modulation format for clarity—(a) 16QAM, (b) 8QAM, and (c) QPSK. The FEC limit for the transceiver technologies deployed is about 5.2 dB, with the large majority of ports operating with nearly 2.0 dB or more of $Q$ margin. Again, since these systems are fully loaded day one, we do not expect further degradation of optical performance as we add more channels over time [25,26]. The $Q$ distributions are tightly grouped for the 16QAM and 8QAM routes; this is expected due to the tighter margins and higher BER floors on these modulation formats (when compared to QPSK). The QPSK performances are much more broadly distributed given that the route distances involved (max 2500 km) are not particularly challenging for QPSK. The few ports with lower margins ($ {\le}9\,\,{\rm{dB}}$$Q$) are on one of our subsea paths.

4. TOOLING AND AUTOMATION SUPPORTING LOW-MARGIN SYSTEMS AT SCALE

A. Stateful Workflow

It is true that operating lower margin optical systems increases the risk of a customer-impacting failure, even with full-Clos redundancy. In the event that we risk customer outage, we need the ability to show our partner teams at Microsoft as well as our customer exactly where the failure is, attempt immediate mitigation with automation, and communicate precisely when we will restore service. Much research has been carried out in recent years around SDN-enabled techniques for fault management, detection, identification, and healing [31–36]. These types of actions sound simple but involve a rethink of the DWDM management strategy. To be able to keep up with the growing number of devices, software and tooling must auto-remediate where possible and triage notifications with priorities and end-to-end network context. For example, due to in-built redundancies, the vast majority of DWDM failures present no change in customer experience and are therefore automatically triaged as low severity. On the other hand, some failures risk service degradation if not immediately addressed, including the dreaded “gray failure” where packets are partially lost [32,33]. With hundreds of thousands of ports, the ability to effectively distinguish between the two and auto-remediate or prioritize is critical.

With this context, the idea that the DWDM system resides in a silo, and can be managed with vendor specific network management, immediately presents a challenge. Alerting should be thought of as two-fold. The DWDM system must have the ability to clearly deliver the signals required to triangulate the fault and impact at layer-1. At the same time, we simulate customer traffic with “customer-centric probes” distributed inside our network, measuring latency and packet loss in real time and alerting when deviations are observed. These provide effective measurements of how the customers are experiencing network failures. The two alerting sources must be combined to triage and prioritize remediation. Over time, near immediate auto-mitigation of all dangerous failures is a necessity.

In ruling out the use of typical network management [37], a suitable alternative must be employed: enter the stateful workflow. Remediation of failures is separated into two stages. First, we mitigate customer impact and second we execute complete resolution. In the example of a gray failure, mitigation could be turning off an amplifier to provide clean, complete packet loss allowing the data center fabric to self-heal at the IP layer. Resolution would involve replacement of the faulty hardware or repair of the DWDM fiber path. Applied across hundreds of thousands of DWDM ports, hundreds of mitigations and remediation “workflows” are running every day. Automating workflows with and including state management allows pausing of long-running workflows. Adding an effective, self-guiding user interface enables two-way interaction with technicians who can now self-serve resolution without having to depend on optical expertise (an example of this is shown in Fig. 16—a screenshot from a technician-facing transceiver troubleshooting workflow). State management also allows coordination with the customer-centric alerting, allowing the customer probes to trigger a pause, stop, or roll back on any of the workflows that may have caused impact. The stateful workflows also allow continual self-improvement, providing data for all device configurations, as well as timelines on mitigating and resolving issues, measuring the effectiveness of our tooling, and providing data for higher layer architecture and design. The workflows are easily reviewable and composable by the operators with the most experience in resolving the issues. The review of the underlying code is abstracted from the review of the workflow process allowing the most skilled operators to directly contribute to the system that is running the network. This abstraction is achieved by the workflow system’s own domain specific language that abstracts the business logic and defines the interaction of the user with the devices. Workflow diagrams can be auto-generated for anyone to be able to review not requiring any programming experience. An example of this is seen in Figs. 17 and 18, where an auto-generated workflow diagram is shown (Fig. 17), along with corresponding YAML-based workflow output of a router linecard refresh (Fig. 18).

Fig. 16. Typical deployment workflow screenshot.

Download Full Size | PDF

Fig. 17. Screenshot taken from production tooling showing the workflow for the router linecard refresh process; “wanetmon” is an internal device health checker, “swan” is a specific WAN router variant.

Download Full Size | PDF

Fig. 18. Production tooling screenshot: YAML underpinning of the linecard deployment workflow shown in Fig. 17.

Download Full Size | PDF

Summarizing, modern network management is a collection of stateful workflows, intuitive technician focused graphical user interfaces, and an “air traffic controller” managing and coordinating the workflows across the globe, with inputs from both the DWDM telemetry and the customer-centric alerting system. The workflow integrates optical provisioning, physical verification, and fault remediation to simplify deployment and operation of DCI and long-haul systems to the greatest degree possible.

B. Configuration, Data Collection, and Alerting

A large amount of time is spent in abstracting the low-level device interaction and vendor features, then integrating them to the stateful workflow. While the workflows generally remain unchanged over time, a tremendous amount of effort is expended in developing drivers as we upgrade technology and change suppliers. We regard this time as wasteful as the stateful workflow solves business problems, while the hardware interaction is often converting vendor specific conventions to our standard libraries [37].

When attacking this complexity, we focus on ultra-reliable systems and take a minimalist approach to both the hardware and application programming interfaces (APIs). For example, the PAM4 transceiver aligns to our minimalist approach, designed specifically for regional DCI with focused Ethernet and metro-optical feature sets, with APIs standardized in the SFF [38] which require no additional software development for tooling integration beyond standard layer-3 APIs. After router configuration, automated power and dispersion optimization are executed. Once optimized, technicians can add and troubleshoot transceivers in the same manner as LC-duplex gray-optics with no special training.

Proactive metric polling [37] for the PAM4 and ICO transceivers in the layer-3 switches is performed using Simple Network Management Protocol with industry standard interface, sensor, and inventory management information bases as standardized in the SFF. Metrics for the DCI and long-haul OLSs are polled via REST API every few minutes and stored in a central database. Alerts and events are Syslog streamed over User Datagram Protocol and stored in the same central database as the performance metrics. All data is converted to vendor-agnostic schema and classification before arriving in the central database. The combination of metrics and alerts drive our vendor-agnostic, in-house network management system, giving visibility to the entire networking stack. During operation, if a link fails due to cabling or outside-plant fiber loss, real-time power levels, BER performance, and port status highlighting probable cause can be polled in subminute intervals and be presented to data center technicians, and troubleshooting can be performed without optical subject matter expertise. Returning to Fig. 16, an example of a problematic link being actioned upon can be seen under “task history.” The figure shows, in succession, a ping test through the link in question, a query on physical device metrics (transmit/receive powers, SNR, inter-symbol interference, frequency error), and subsequent engagement with our data center technicians—all within a short time period.

Proactive polling of metrics enables auto-fault remediation [31–33] for known optical failures, which reduces overall costumer outages and decreases the necessity of urgent network maintenance. Upon receiving a network alert, a workflow is automatically triggered, and the necessary mitigation steps are started. The polled metrics give us much greater resolution and history of a given issue than the information typically available on individual devices. Improvements are being made in order to enable streaming telemetry of critical metrics from devices to our control plane after initial configuration, i.e., the idea of streaming telemetry is being embraced to fully facilitate the just-in-time diagnostics and giving us even greater resolution of data [28,37].

In practice we achieve deployment velocity exceeding 1000 DWDM ports per week with only one skilled resource performing final acceptance. To further improve velocity and reliability, it is necessary to further eliminate optical system complexities. To eliminate this complexity, we have driven the development of SONiC [39] in the switching layer, which puts an open-sourced operating system with simple APIs and data models on our network devices. The optical layer is further behind in this regard. To this end, we support OpenConfig [28,37,40,41] as an example of heading in the right direction. However, the true success of operating a cloud-scale global network lies in simplicity. Therefore, we will not compromise simplicity on either hardware or software, and closely examine each single feature we require on the devices and their data models. Allowing optical hardware to accept customer-defined data models and APIs through an industry adopted means, clearly communicating the feature set and the API definition with a Swagger [42] specification, eliminates all ambiguity that is typical in a requirement list. This would allow the same configuration from different vendors to be applied to each other without intermediary drivers. The ultimate goal is declarative state device modeling along with intent-based software design, where domain-specific complexities are abstracted, and central control defines goal states instead of in-between state configurations.

5. CONCLUSION

We have provided a survey of solutions which underpin Microsoft’s deployed DCI and long-haul infrastructure and have described the methods employed to operate these low-margin networks on a global scale, utilizing a combination of optimized architectures, streamlined optical platforms, and advanced SDN capabilities. For future challenges, we will be transitioning the 100G intra-region ecosystem to 400G, leveraging 400ZR [43] for DCI applications. In similar timeframes, we will be incorporating modernized and advanced line system technologies to support future WAN capacity growth. Across all fronts, we remain focused on driving simplicity of both hardware platforms and software models, APIs, and interfaces.

Acknowledgment

The authors would like to thank the entire optical team at Microsoft Azure, without whom this work would not be possible: Madhavan Subramanian, Jeetesh Jain, Subba Rao Sadineni, Brandon Raciborski, Luis Escobedo, Danny Thornton, Ryan Morgan, Rich Baca, Larry Kemp, Prerna Singhal, Tristan Struthers, Karthik Balasubramanian, Mark Wolf, and Kraig Owen. Thanks to colleagues in Microsoft Research, Francesca Parmigiani and Thomas Karagiannis, for helpful brainstorming sessions and reviewing text, and lastly to Azure Networking senior leadership for supporting this work.

REFERENCES

1. “What is DCI?” https://www.ciena.com/insights/what-is/What-is-DCI.html.

2. IEEE ComSoc, “Data center interconnect,” http://techblog.comsoc.org/category/data-center-interconnect/.

3. AWS re:Invent, “Enterprise fundamentals: design your account and VPC architecture for enterprise operating models,” 2016, https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-enterprise-fundamentals-design-your-account-and-vpc-architecture-for-enterprise-operating-models-ent203.

4. “Inphi offers 100G PAM4 QSFP28 for 80-km data center interconnect,” https://www.lightwaveonline.com/articles/2016/03/inphi-offers-100g-wavelength-pam4-qsfp28-for-80-km-data-center-interconnect.html.

5. M. Filer, J. Gaudette, M. Ghobadi, R. Mahajan, T. Issenhuth, B. Klinkers, and J. Cox, “Elastic optical networking in the Microsoft cloud [Invited],” J. Opt. Commun. Netw.8, A45–A54 (2016). [CrossRef]

6. M. Filer, H. Chaouch, and X. Wu, “Toward transport ecosystem interoperability enabled by vendor-diverse coherent optical sources over an open line system,” J. Opt. Commun. Netw.10, A216–A224 (2018). [CrossRef]

7. M. Filer and H. Chaouch, “Transmission performance of layer-2/3 modular switch with mQAM coherent ASIC and CFP2-ACOs over flex-grid OLS with 104 channels spaced 37.5 GHz,” in Optical Fiber Communication Conference, OSA Technical Digest (online) (Optical Society of America, 2017), paper Th1D.2.

8. “How Microsoft builds its fast and reliable global network,” https://azure.microsoft.com/en-us/blog/how-microsoft-builds-its-fast-and-reliable-global-network/.

9. Packet Design, “BGP in the data center: part two,” https://www.packetdesign.com/blog/clos-architecture-in-the-data-center/.

10. A. Greenberg, “SDN for the cloud,” keynote talk at SigComm,Aug. 2015.

11. “Overview of availability zones in Azure,” https://docs.microsoft.com/en-us/azure/availability-zones/az-overview.

12. “AWS global infrastructure,” https://aws.amazon.com/about-aws/global-infrastructure/.

13. F. Dikbiyik, M. Tornatore, and B. Mukherjee, “Minimizing the risk from disaster failures in optical backbone networks,” J. Lightwave Technol.32, 3175–3183 (2014). [CrossRef]

14. S. Ferdousi, F. Dikbiyik, M. F. Habib, M. Tornatore, and B. Mukherjee, “Disaster-aware datacenter placement and dynamic content management in cloud networks,” J. Opt. Commun. Netw.7, 681–694 (2015). [CrossRef]

15. AGC Research, “Inphi’s ColorZ-Lite Technology: offering an innovative solution for Nx100 G campus connectivity,” https://www.acgcc.com/inphis-colorz-lite-technology-offering-an-innovative-solution-for-nx100-g-campus-connectivity/.

16. R. Nagarajan, M. Filer, Y. Fu, M. Kato, T. Rope, and J. Stewart, “Silicon photonics-based 100 Gbit/s, PAM4, DWDM data center interconnects,” J. Opt. Commun. Netw.10, B25–B36 (2018). [CrossRef]

17. M. Filer, S. Searcy, Y. Fu, R. Nagarajan, and S. Tibuleac, “Demonstration and performance analysis of 4 Tb/s DWDM metro-DCI system with 100G PAM4 QSFP28 modules,” in Optical Fiber Communication Conference, OSA Technical Digest (online) (Optical Society of America, 2017), paper W4D.4.

18. M. Filer, M. Cantono, A. Ferrari, G. Grammel, G. Galimberti, and V. Curri, “Multi-vendor experimental validation of an open source QoT estimator for optical networks,” J. Lightwave Technol.36, 3073–3082 (2018). [CrossRef]

19. V. Kamalov, V. Dangui, T. Hofmeister, B. Koley, C. Mitchell, M. Newland, J. O’Shea, C. Tomblin, V. Vusirikala, and X. Zhao, “Lessons learned from open line system deployments,” in Optical Fiber Communications Conference and Exhibition (OFC), Los Angeles, California, 2017, pp. 1–3.

20. V. Vusirikala, X. Zhao, T. Hofmeister, V. Kamalov, V. Dangui, and B. Koley, “Scalable and flexible transport networks for inter-datacenter connectivity,” in Optical Fiber Communications Conference and Exhibition (OFC), Los Angeles, California, 2015, pp. 1–3.

21. H. Chaouch, M. Filer, and A. Bechtolsheim, “Lessons learned from CFP2-ACO system integrations, interoperability testing and deployments,” in Optical Fiber Communication Conference, OSA Technical Digest (online) (Optical Society of America, 2017), paper Th1D.4.

22. Y. Loussouarn, E. Pincemin, Y. Pan, G. Miller, A. Gibbemeyer, and B. Mikkelsen, “Silicon photonic multi-rate DCO-CFP2 interface for DCI, metro, and long-haul optical communications,” in Optical Fiber Communication Conference, OSA Technical Digest (online) (Optical Society of America, 2018), paper M1E.5.

23. P. Poggiolini, “The GN model of non-linear propagation in uncompensated coherent optical systems,” J. Lightwave Technol.30, 3857–3879 (2012). [CrossRef]

24. A. Carena, G. Bosco, V. Curri, Y. Jiang, P. Poggiolini, and F. Forghieri, “EGN model of non-linear fiber propagation,” Opt. Express22, 16335–16362 (2014). [CrossRef]

25. D. J. Elson, G. Saavedra, K. Shi, D. Semrau, L. Galdino, R. Killey, B. C. Thomsen, and P. Bayvel, “Investigation of bandwidth loading in optical fibre transmission using amplified spontaneous emission noise,” Opt. Express25, 19529–19537 (2017). [CrossRef]

26. T. Richter, J. Pan, and S. Tibuleac, “Comparison of WDM bandwidth loading using individual transponders, shaped, and flat ASE noise,” in Optical Fiber Communications Conference and Exposition (OFC), San Diego, California, 2018, pp. 1–3.

27. M. Cantono, A. Ferrari, D. Pilori, E. Virgillito, J. L. Augé, and V. Curri, “Physical layer performance of multi-band optical line systems using Raman amplification,” J. Opt. Commun. Netw.11, A103–A110 (2019). [CrossRef]

28. E. Breverman, N. El-Sakkary, T. Hofmeister, S. Ngai, A. Shaikh, and V. Vusirikala, “Optical zero touch networking—a large operator perspective,” in Optical Fiber Communications Conference and Exhibition (OFC), San Diego, California, 2019, pp. 1–3.

29. PSM4 MSA, http://psm4.org/.

30. CWDM4 MSA, http://www.cwdm4-msa.org/.

31. D. Rafique, T. Szyrkowiec, H. Grießer, A. Autenrieth, and J.-P. Elbers, “Cognitive assurance architecture for optical network fault management,” J. Lightwave Technol.36, 1443–1450 (2018). [CrossRef]

32. S. Shahkarami, F. Musumeci, F. Cugini, and M. Tornatore, “Machine-learning-based soft-failure detection and identification in optical networks,” in Optical Fiber Communication Conference (Optical Society of America, 2018), paper M3A.5.

33. A. P. Vela, M. Ruiz, F. Fresi, N. Sambo, F. Cugini, G. Meloni, L. Potì, L. Velasco, and P. Castoldi, “BER degradation detection and failure identification in elastic optical networks,” J. Lightwave Technol.35, 4595–4604 (2017). [CrossRef]

34. Y. Xiong, Y. Li, B. Zhou, R. Wang, and G. N. Rouskas, “SDN enabled restoration with triggered precomputation in elastic optical inter-datacenter networks,” J. Opt. Commun. Netw.10, 24–34 (2018). [CrossRef]

35. M. Dzanko, M. Furdek, G. Zervas, and D. Simeonidou, “Evaluating availability of optical networks based on self-healing network function programmable ROADMs,” J. Opt. Commun. Netw.6, 974–987 (2014). [CrossRef]

36. R. Casellas, R. Martínez, R. Vilalta, and R. Muñoz, “Control, management, and orchestration of optical networks: evolution, trends, and challenges,” J. Lightwave Technol.36, 1390–1402 (2018). [CrossRef]

37. E. Breverman, N. El-Sakkary, T. Hofmeister, A. Shaikh, and V. Vusirikala, “Optical network control & management plane evolution—a large datacenter operator perspective,” in Optical Fiber Communications Conference and Exhibition (OFC), San Diego, California, 2019, pp. 1–4.

38. SNIA, “SFF specifications,” https://www.snia.org/technology-communities/sff/specifications.

39. GitHub, “What is SONiC?” https://azure.github.io/SONiC/.

40. OpenConfig, http://www.openconfig.net/.

41. E. Breverman, N. El-Sakkary, T. Hofmeister, A. Shaikh, and V. Vusirikala, “Data models for optical devices in data center operator networks,” in Optical Fiber Communications Conference and Exhibition (OFC), San Diego, California, 2019, pp. 1–3.

42. Swagger, https://swagger.io/.

43. “OIF 400ZR,” https://www.oiforum.com/technical-work/hot-topics/400zr-2.

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.

Figures (18)

Fig. 1.
Fig. 2.
Fig. 3.
Fig. 4.
Fig. 5.
Fig. 6.
Fig. 7.
Fig. 8.
Fig. 9.
Fig. 10.
Fig. 11.
Fig. 12.
Fig. 13.
Fig. 14.
Fig. 15.
Fig. 16.
Fig. 17.
Fig. 18.

Fig. 1. Microsoft global DC + WAN footprint.

View in Article | Download Full Size | PDF

Fig. 2. (a) Legacy “mega DC” architecture and (b) current regional architecture.

View in Article | Download Full Size | PDF

Fig. 3. Impact of regional architecture on Microsoft deployed 100G DWDM ports.

View in Article | Download Full Size | PDF

Fig. 4. PAM4 DCI OLS line system “stamp.”

View in Article | Download Full Size | PDF

Fig. 5. Open line system concept.

View in Article | Download Full Size | PDF

Fig. 6. Long-haul OLS line system “stamps.”

View in Article | Download Full Size | PDF

Fig. 7. Long-haul OLS optical spectrum, showing fully filled C-band with data-carrying signals in the upper frequencies and ASE noise-loading throughout the remainder.

View in Article | Download Full Size | PDF

Fig. 8. DCI fiber quality statistics.

View in Article | Download Full Size | PDF

Fig. 9. (a) Fiber distance and (b) loss distributions of the first

$ {\gt}1500$

deployed DWDM PAM4 line systems.

View in Article | Download Full Size | PDF

$ {\ge}80\,\,{\unicode{x00B5} }{{\rm{m}}^2}$

), “SC” = small core (small effective area, e.g.,

$ {\le}55\,\,{\unicode{x00B5}} {{\rm{m}}^2}$

View in Article | Download Full Size | PDF

Fig. 11.

$Q$

-factor distribution of the first 50,000 deployed DWDM PAM4 ports.

View in Article | Download Full Size | PDF

Fig. 12. BER stability over a 50-day period.

View in Article | Download Full Size | PDF

Fig. 13. Temperature distribution over optical module infrastructure.

View in Article | Download Full Size | PDF

Fig. 14. Failure rates of deployed 100G technologies.

View in Article | Download Full Size | PDF

Fig. 15. Long-haul OLS BER statistics across 26,000 km of deployed infrastructure: (a) 16QAM, (b) 8QAM, and (c) QPSK.

View in Article | Download Full Size | PDF

Fig. 16. Typical deployment workflow screenshot.

View in Article | Download Full Size | PDF

View in Article | Download Full Size | PDF

Fig. 18. Production tooling screenshot: YAML underpinning of the linecard deployment workflow shown in Fig. 17.

View in Article | Download Full Size | PDF

Tables (1)

Table 1. Global Fiber Specifications

View Table

Parameter	DCI Standard	LH Standard	Notes
Start-of-life loss at 1550 nm	$\leq 0.25 d B / k m$	$\leq 0.22 d B / k m$	As tested end-to-end, inclusive of all splices and connectors
End-of-life loss at 1550nm	$\leq 0.28 d B / k m$	$\leq 0.25 d B / k m$	Inclusive of all splices and connectors
Splice loss	$\leq 0.10 d B$	$\leq 0.08 d B$	Per splice, bidirectional average as measured on an optical time-domain reflectometer
Connector loss	$\leq 0.25 d B$ each	$\leq 0.25 d B$ each	Per connector. No intermediate connectors. Connectors at carrier demarcation only.
Reflections	$\leq - 45 d B$	$\leq - 45 d B$
Polarization mode dispersion	$\leq 0.2 p s / \sqrt{k m}$	$\leq 0.2 p s / \sqrt{k m}$

Abstract

1. INTRODUCTION

2. MICROSOFT REGIONAL ARCHITECTURE

A. Intra-Region (DCI) Optical Networks

B. Inter-Region (Long-Haul) Optical Networks

3. OPTICAL PERFORMANCE AND RELIABILITY

A. Fiber Quality and Infrastructure

B. Optical Performance Statistics

4. TOOLING AND AUTOMATION SUPPORTING LOW-MARGIN SYSTEMS AT SCALE

A. Stateful Workflow

B. Configuration, Data Collection, and Alerting

5. CONCLUSION

Acknowledgment

REFERENCES

Cited By

Figures (18)

Tables (1)

Journal of Optical Communications and Networking