Survivable resource orchestration for optically interconnected data center networks

Qiong Zhang; Qingya She; Yi Zhu; Xi Wang; Paparao Palacharla; Motoyoshi Sekiya

doi:10.1364/OE.22.000023

1. Introduction

As more applications and workloads are moving to the Cloud, geographically distributed data centers (DCs) are being deployed across optical networks. Cloud applications rely on distributed DCs for improved user experience [1, 2]. However, cloud providers may not own optical network infrastructure and count on network providers to optically interconnect distributed DCs. One example is the combination of IBM SmartCloud and AT&T virtual private networking for global cloud services. Another example is the alliance of Microsoft Azure and AT&T virtual private networking for providing a more secure and reliable connectivity for enterprise customers, which was recently announced. Usually, network providers are unwilling to expose their full network topology information to cloud providers. Hence, it is critical to investigate an overlay framework that enables cloud providers to control cloud network connections and optimize resource orchestration without having detailed network information.

Many cloud applications in distributed DCs are arranged in an aggregation communication pattern [3], whereby an aggregation DC (DC_a) collects data processed at distributed DCs and outputs final results to users, as shown in Fig. 1(a). Cloud applications can make the physically dispersed virtual machines (VMs) to operate logically as one DC by collecting results from dispersed VMs at an aggregation DC. Applications, such as cloud search and data backup, can allocate VMs close to data stored in distributed DCs and provide results at an aggregation DC for access by users. Complicated communication patterns can be constituted by scheduling a sequence of data aggregations [3, 4], as shown in Fig. 1(b).

Fig. 1 Communication patterns among distributed DCs. (a) Aggregation. (b) A sequence of aggregations.

Download Full Size | PDF

Due to the reliance on distributed DCs and aggregation DCs, survivability becomes an important issue for cloud applications. K-connect survivability is defined as at least K number of DCs (out of M original working DCs) remain connected to an aggregation DC for any failure. When a failure occurs, additional VMs can be allocated at the surviving K DCs in order to maintain resource availability by taking advantage of the mobility of VMs. We consider a single shared risk group (SRG) failure that can result in multiple failures at DC sites and in networks (due to fiber cuts, power outages, nature disasters, etc.). Assume that DC_a cannot fail; otherwise, a new cloud request should be initiated.

In this paper, we present an overlay framework that interconnects distributed DCs by virtualized optical networks. From a cloud provider point of view, the overlay network consists of multiple geographically distributed DCs, as well as direct network connections between DCs. The cloud provider does not have complete details of the physical network topology, but has some characteristics of the network connections of the overlay network, such as bandwidth, SRGs, and delay, that are obtained from the network provider. We also propose survivable resource orchestration schemes for cloud providers to allocate resources for a set of cloud requests with dynamic arrival times on overlay networks . Our proposed schemes provision the fewest working DCs to guarantee K-connect survivability for a cloud request. In addition, we propose VM allocation schemes in order to reducing the total number of VMs required on overlay networks.

Previous research [5] worked on reliable anycast or manycast by disjoint routing on physical network topologies. In [6], authors investigated resource orchestration on optical networks considering both VM resources and physical optical network topologies in order to improve resource utilization in optical networks. Our resource orchestration schemes differ from existing research in that we provision the fewest working DCs based on information of overlay networks, where physical network topology may be unavailable and routing for connections may not be possible. An earlier version of this paper appeared in [7].

2. Framework for resource orchestration

An overlay framework based on optical network virtualization is presented in Fig. 2. The overlay network comprises of point-to-point connections between DCs, as well as VMs in DCs. Connection bandwidth can be adjustable by optical network virtualization technologies [2, 8] in order to accommodate the dynamism of traffic between DCs. The underlying optical network can adopt the optical transport network (OTN) technology or flexible optical data planes (e.g., flexible transceivers) to adjust connection bandwidth. Approaches proposed in [6] can be applied to allocate resources on the optical network to overlay networks.

Fig. 2 An overlay framework for distributed DCs.

Download Full Size | PDF

Cloud providers have a centralized controller that manages DCs interconnected by the overlay network. The controller can obtain connection information, such as bandwidth, SRGs and delay, and can request connection bandwidth through network application programming interfaces (APIs) provided by software defined networks (SDN) [2, 8]. The controller receives cloud requests and performs resource orchestration for each cloud request. With the arrival of new cloud requests, additional VMs are allocated at the DCs, thereby increasing the required bandwidth of the network connections. The controller can request bandwidth increase for connections via the network API. Such a framework has the advantage of avoiding network providers to expose physical network topologies, while allowing cloud providers to easily set up cloud services, to perform resource orchestration, and to flexibly adjust connection bandwidth, without considering intermediate network devices along the connection paths.

A basic aggregation request is shown in Fig. 3(a). Complicated requests can be generated using a combination of basic requests [3, 4]. Each request must satisfy K-connect survivability. Figures 3(b) and 3(c) present two approaches that guarantee 2-connect survivability for any failure. By jointly considering failures at network connections and DCs, network resources can be saved. In Fig. 3(b), where s_i indicates SRG i, network connections are blindly protected by providing disjoint paths (dotted lines). 2-connect survivability can be guaranteed by protecting against failures at DCs separately from network connections. In Fig. 3(c), with SRG information on overlay networks, 2-connect survivability is also guaranteed by finding connections and DCs that can be jointly protected, thereby allowing for significant savings in network resource (a savings of three protection connections compared to Fig. 3(b)).

Fig. 3 Cloud requests. (a) Basic request. (b) 2-connect separate protection. (c) 2-connect by joint protection.

Download Full Size | PDF

The subset with minimum delay can be chosen when multiple subsets of DCs that satisfy K-connect survivability exist. A delay of a request is the total delay of connections between the subset of DCs and the aggregation DC (DC_a). DC_a can be allocated to a DC that is relatively close to users or relatively close to a particular subset of DCs, depending on applications.

3. K-connect survivability problem description

Given: An overlay network with N DCs sites and a set of L SRGs S = {s₁,s₂,…,s_l,…,s_L}. In the overlay network, each connection E_ij between DC_i and DC_j has network information including delay, d_ij, and a vector of associated SRGs {α_ij1,α_ij2,…,α_ijl,…, α_ijL}, where α_ijl = 1 indicates that s_l is associated with E_ij (i ≠ j); otherwise α_ijl = 0. Similarly, each DC_i is associated with a set of SRGs. Also, we are given a request that requires K DCs to remain connected to an aggregation DC_a for any failure. Joint protection is considered.

Find: the least M number of working DCs such that

1. min∑(d_aj), where 1 ≤ j ≤ M, which minimizes the total delay of a request; and
2. K number of DCs remain connected to DC_a for any failure, which guarantees K-connect survivability

A cloud request with fewer working DCs requires fewer network connections and may have lower operational cost. In addition, finding the least M working DCs that satisfy K-connect survivability can result in the fewest VMs required for a request (discussed in Section 5). We can prove that the K-connect survivability problem is NP-complete by reducing the well-known weighted set cover problem to our problem.

4. Proposed solutions

Two novel heuristic algorithms are proposed for solving the K-connect survivability problem in optically interconnected distributed DC networks. In both algorithms, a matrix is constructed for each aggregation DC_a. Each DC_j and its corresponding connection to DC_a (notated as p_aj) are paired and have a set of associated risks, which is the union of risks associated with both DC_j and its corresponding connection to DC_a. For each s_l, the matrix records 1 if p_aj is associated with s_l. Table 1 shows a matrix constructed for DC₁ in Fig. 4. Another matrix (#p_l) records the total number of currently chosen DCs that are associated with s_l. For example, in Table 1, when p₁₂ and p₁₄ are chosen, #p_l = {1, 2, 0, 2}.

Table 1. Matrices for DC₁

View Table

Fig. 4 Overlay network.

Download Full Size | PDF

DelayBased: This scheme selects DCs in an increasing order of connection delay. To obtain the least M working DCs, M is incremented from K + 1 (the smallest possible value) to N (the largest possible value). For each M, select DCs if and only if, for all risks, the number of selected DCs that are associated with any risk does not exceed (M - K) for guaranteeing K-connect survivability. That is, with DC_a as an aggregation DC, #p_l is incremented by 1 if a p_aj is chosen and α_ajl = 1. A p_aj can be chosen if and only if #p_l ≤ (M - K) for all s_l with α_ajl = 1. If M DCs are found, stop incrementing M. Finally, the highest M presents the fewest number of working DCs that satisfies K-connect survivability.

RiskBased: In the DelayBased scheme, it is possible that DCs chosen earlier are associated with many risks, resulting in more working DCs required. Hence, RiskBased selects DCs in an increasing order of total frequency of associated risks. The frequency of a risk is defined as the number of DCs that are associated with the risk. Other steps are similar to DelayBased.

5. VM allocation on an overlay network

We now discuss VM allocation after a request satisfies the K-connect survivability. In order to save the total number of VMs required on an overlay network, VMs can be shared in two steps. One is to share VMs between DCs selected within a request, taking advantage of the K-connect survivability, named intra-request VM sharing. As shown in Figs. 5(a) and 5(b), requests are given a DC_a, K, and V (the number of VMs that needs to maintain for any failure). The intra-request VM sharing is to allocate the same number of VMs (V/K) at each of selected M working DCs, so that, for any survived K DCs, the total number of VMs can be maintained. Hence, the total VMs required for a request is VM/K. We can see that finding the least M working DCs can result in the fewest VMs required for a request.

Fig. 5 (a) Request A. (b) Request B. (c) VM allocation for requests on an overlay network.

Download Full Size | PDF

The next step is to share VMs between cloud requests based on their SRG risks, named inter-request VM sharing. Figure 5(c) presents a scenario that Request A and Request B share VMs at DC₁. For any failure at DC₁ and its corresponding connection to DC₄, VM allocation at the survived DCs is able to satisfy the required number of VMs in each request since the survived DCs of Request A is risk-disjoint from the survived DCs of Request B. Hence, the total VM allocation at DC₁ is the maximum of VM allocation required for the two requests, which is 15, a saving of 10 VMs compared to the case without inter-request VM sharing.

6. Simulation results

We simulated the K-connect survivability on the 75-node CORONET [9], with each node having a DC. Overlay networks are generated with randomly chosen DCs in CORONET. The minimum delay paths are used for connections between DCs. Dynamic requests are generated by assigning aggregation DCs to each DC at generated overlay networks until 10⁵ requests are successfully allocated. Requests stay in overlay networks once they are allocated resources. The number of VMs required for each request is the product of K and a random number between 1 and 100. Assume that an arbitrary amount of bandwidth and VMs can be requested from the underlying optical infrastructure. The total number of risks in the optical network is sixty. Each physical link or DC is associated with R randomly chosen SRG risks (R = 2 or 3).

Figure 6 compares the least M working DCs required and the average delay of requests as K increases. The total number of DCs (N) in an overlay network is 10. Figure 6(a) shows that RiskBased requires up to 12% fewer working DCs than DelayBased. The least number of working DCs increases to satisfy the increasing K-connect constraint. When K = 6 (or 5) and R = 2 (or 3), the least working DCs is close to the total DCs N. Figure 6(b) shows that, as K increases, the average request delay increases due to the requirement of more working DCs and the difference in delay reduces. When K ≤ 5 (or 4) and R = 2 (or 3), RiskBased results in longer delay than DelayBased, even with fewer working DCs, since a connection with lower SRG frequency may have longer delay. When K = 6 (or 5) and R = 2 (or 3), the number of required working DCs is close to N and the choices of working DCs are limited. Thus, both schemes perform closely. Figure 6(c) shows that both RiskBased and DelayBased schemes have the same request blocking ratio due to iteratively incrementing M until resource allocation for a request is successful. As K increases, the request blocking ratio gets higher since the K-connect constraint becomes more restricted with N fixed at 10. When K is greater than 6 (or 5), the request blocking ratio is 100% for R = 2 (or 3). The request blocking ratio when R = 3 is higher than that of R = 2 since higher R results in more failed connections on overlay networks, thus less likely to find a solution satisfying the K-connect constraint.

Fig. 6 Performance as K increases (N = 10). (a) Average of the least M required vs. K. (b) Average delay per request vs. K. (c) Request blocking ratio vs. K.

Download Full Size | PDF

Figure 7 compares the least M and the average delay of requests as N increases. Here, K = 4. RiskBased requires fewer working DCs than DelayBased for different N. When R = 3 and N ≤ 12, higher N needs more working DCs since higher N results in more successful requests, each of which requires more working DCs. Both schemes have similar delay due to limited choices of working DCs. When N > 12, the least working DCs reduces as N increases since there are more than enough working DCs to satisfy the K-connect constraint and solutions with fewer working DCs can be found. RiskBased results in longer delay since a connection with lower risk frequency may have longer delay. Figure 7(c) shows that, as N increases, the request blocking ratio decreases.

Fig. 7 Performance as N increases (K = 4). (a) Average of the least M required vs. N. (b) Average delay per request vs. N. (c) Request blocking ratio vs. N.

Download Full Size | PDF

Figure 8(a) compares the average number of VMs required per request as K increases for the shared cases that apply both intra-request and inter-request VM sharing. As K increases, the average number of VMs per request increases since the number of VMs required per request is proportional to K and the least M increases. The number of VM required in RiskBased-Shared is very close to DelayBased-Shared due to the effectiveness of the inter-request VM sharing. Figure 8(b) shows the percentage of VM reduction by inter-request VM sharing, compared to the cases without inter-request VM sharing. Intra-request VM sharing is applied to all cases. The inter-request VM sharing reduces up to 23% VMs compared to the cases without inter-request sharing. DelayBased-Shared results in higher percentage of VM reduction than RiskBased-Shared since DelayBased requires much more VMs than RiskBased for the cases without inter-request VM sharing due to more DCs selected (see Fig. 6(a)).

Fig. 8 VM allocation as K increases (N = 10). (a)Average VMs per request vs. K. (b) % of VM reduction vs. K.

Download Full Size | PDF

Figure 9(a) compares the average number of VMs required per request as N increases for the shared cases. As N increases, the average number of VMs per request follows a similar trend of the least M in Fig. 7(a). In both Figs. 8(b) and 9(b), when there are enough DCs to satisfy the K-connect constraint (K ≤ 3in Fig. 8(b) and N ≥ 12 in Fig. 9(b)), R = 3 results in higher percentage of VM reduction than R = 2 due to more VMs required in cases without inter-request VM sharing. When there are limited DCs (K > 3in Fig. 8(b) and N < 12 in Fig. 9(b)), the VM reduction by inter-request VM sharing for R = 3 is smaller than that of R = 2 since it is less likely to find a better solution with inter-request VM sharing for higher R.

Fig. 9 VM allocation as N increases (K = 4). (a)Average VMs per request vs. N. (b) % of VM reduction vs. N.

Download Full Size | PDF

7. Conclusion

Resource orchestration schemes are proposed for provisioning the fewest data centers to guarantee K-connect survivability on virtualized optical overlay networks. RiskBased requires fewer working DCs, but longer request delay, compared to DelayBased. DelayBased is suitable for delay-sensitive cloud applications. Based on the K-connect survivability, intra-request and inter-request VM sharing are also proposed for reducing the total number of VMs required for overlay networks.

References and links

1. S. C. O. P. E. Alliance, “Telecom grade cloud computing,” www.scope-alliance.org (2011).

2. J. He, “Software-defined transport network for cloud computing,” in Optical Fiber Communication Conference/National Fiber Optic Engineers Conference 2013, OSA Technical Digest (online) (Optical Society of America, 2013), paper OTh1H.6. [CrossRef]

3. G. Wang, T. S. Eugene Ng, and A. Shaikh, “Programming your network at run-time for big data applications,” in Proceedings of the First Workshop on Hot Topics in Software Defined Networks (HotSDN '12). (ACM, 2012), pp. 103–108. [CrossRef]

4. Y. Zhu and M. Ammar, “Algorithms for assigning substrate network resources to virtual network components,” 25th IEEE International Conference on Computer Communications. Proceedings, April 2006. [CrossRef]

5. C. Develder, J. Buysse, A. Shaikh, B. Jaumard, M. De Leenheer, and B. Dhoedt, “Survivable optical grid dimensioning: anycast routing with server and network failure protection,” 2011 IEEE International Conference on Communications, 5–9 June 2011. [CrossRef]

6. Q. Zhang, W. Xie, Q. She, X. Wang, P. Palacharla, and M. Sekiya, “RWA for network virtualization in optical WDM networks,” in Optical Fiber Communication Conference/National Fiber Optic Engineers Conference 2013, OSA Technical Digest (online) (Optical Society of America, 2013), paper JTh2A.65. [CrossRef]

7. Q. Zhang, Q. She, Y. Zhu, X. Wang, P. Palacharla, and M. Sekiya, “Survivable resource orchestration for optically interconnected data center networks,” in 39th European Conference and Exhibition on Optical Communication (ECOC 2013), 22–26 Sept. 2013.

8. D. Simeonidou, R. Nejabati, and M. P. Channegowda, “Software-defined optical networks technology and infrastructure: enabling software-defined optical network operations [Invited],” J. Opt. Commun. Netw. 5(10), A274–A282 (2013). [CrossRef]

9. A. L. Chiu, G. Choudhury, G. Clapp, R. Doverspike, J. W. Gannett, J. G. Klincewicz, R. A. Guangzhi Li, J. Skoog, A. Strand, Von Lehmen, and Dahai Xu, “Network design and architectures for highly dynamic next-generation IP-over-optical long distance networks,” J. Lightwave Technol. 27(12), 1878–1890 (2009). [CrossRef]

Survivable resource orchestration for optically interconnected data center networks

Abstract

1. Introduction

2. Framework for resource orchestration

3. K-connect survivability problem description

4. Proposed solutions

5. VM allocation on an overlay network

6. Simulation results

7. Conclusion

References and links

Cited By

Figures (9)

Tables (1)

Optics Express

α_ijl	p₁₂	p₁₃	p₁₄	# p_l
s₁	1	1		1
s₂	1		1	2
s₃		1		0
s₄	1		1	2