skip to main content
research-article

Achieving High Availability in Inter-DC WAN Traffic Engineering

Published: 12 December 2022 Publication History

Abstract

Inter-DataCenter Wide Area Network (Inter-DC WAN) that connects geographically distributed data centers is becoming one of the most critical network infrastructures. Due to limited bandwidth and inevitable link failures, it is highly challenging to guarantee network availability for services, especially those with stringent bandwidth demands, over inter-DC WAN. We present <inline-formula> <tex-math notation="LaTeX">$\mathsf {TEDAT}$ </tex-math></inline-formula>, a novel Traffic Engineering (TE) framework for Diverse Availability Targets (DAT), where a Service Level Agreement (SLA) is defined to ensure that each bandwidth demand must be satisfied with a stipulated probability, when subjected to the network capacity and possible failures of the inter-DC WAN. <inline-formula> <tex-math notation="LaTeX">$\mathsf {TEDAT}$ </tex-math></inline-formula> has two core components, i.e., traffic scheduling and failure recovery, which are crystalized through different mathematical models and theoretically analyzed. They are also extensively compared against state-of-the-art TE schemes, using a testbed as well as real trace driven simulations across different topologies, traffic matrices and failure scenarios. Our evaluations show that, compared with the optimal admission strategy, <inline-formula> <tex-math notation="LaTeX">$\mathsf {TEDAT}$ </tex-math></inline-formula> can speed up the online admission control by <inline-formula> <tex-math notation="LaTeX">$30\times $ </tex-math></inline-formula> at the expense of less than 4&#x0025; false rejections. On the other hand, compared with the latest TE schemes like FFC and TEAVAR, <inline-formula> <tex-math notation="LaTeX">$\mathsf {TEDAT}$ </tex-math></inline-formula> can meet the bandwidth availability SLAs for 23&#x0025;&#x007E;60&#x0025; more demands under normal loads, and when network failure causes SLA violations, it can retain 10&#x0025;&#x007E;20&#x0025; more profit under a pricing and refunding model.

References

[1]
Aliababa. (2020). Data Transmission Service Level Agreement. [Online]. Available: https://www.alibabacloud.com/help/zh/doc-detail/50079.htm
[2]
Aliababa. (2020). Short Message Service (SMS) Service Level Agreement. [Online]. Available: https://www.alibabacloud.com/help/zh/doc-detail/155130.htm
[3]
O. Alipourfard, J. Gao, J. Koenig, C. Harshaw, A. Vahdat, and M. Yu, “Risk based planning of network changes in evolving data centers,” in Proc. 27th ACM Symp. Operating Syst. Principles, New York, NY, USA, 2019, pp. 414–429.
[4]
Amazon. (2020). Amazon Appflow Service Level Agreement. [Online]. Available: https://aws.amazon.com/cn/appflow/sla/
[5]
Amazon. (2020). Aws Database Migration Service (AWS DMS) Service Level Agreement. [Online]. Available: https://aws.amazon.com/cn/dms/sla/
[6]
Amazon. (2021). Amazon Compute Service Level Agreement. [Online]. Available: https://aws.amazon.com/compute/sla/?nc1=h_ls
[7]
Aryaka. (2020). Aryaka Private Wan. [Online]. Available: https://www.aryaka.com
[8]
[14]
[16]
Azure. (2021). SLA for Content Delivery Network. [Online]. Available: https://azure.microsoft.com/en-us/support/legal/sla/cdn/v1_0/
[19]
Y. Bi and A. Tang, “Uncertainty-aware optimization for network provisioning and routing,” in Proc. 53rd Annu. Conf. Inf. Sci. Syst. (CISS), Mar. 2019, pp. 1–6.
[20]
J. Bogleet al., “Teavar: Striking the right utilization-availability balance in WAN traffic engineering,” in Proc. ACM Special Interest Group Data Commun., New York, NY, USA, 2019, pp. 29–43.
[21]
Cato. (2020). Cato Managed Services. [Online]. Available: https://www.catonetworks.com
[22]
C. Chekuri, S. Khanna, and F. B. Shepherd, “The all-or-nothing multicommodity flow problem,” in Proc. 36th Annu. ACM Symp. Theory Comput. (STOC), 2004, pp. 156–165.
[23]
Dacast. (2022). High-Definition Video Streaming. [Online]. Available: https://www.dacast.com/blog/hd-live-streaming/
[24]
A. Elwalid, C. Jin, S. Low, and I. Widjaja, “MATE: MPLS adaptive traffic engineering,” in Proc. IEEE INFOCOM. Conf. Comput. Commun. 20th Annu. Joint Conf. IEEE Comput. Commun. Soc., Aug. 2001, pp. 1300–1309.
[25]
Floodlight. (2020). Floodlight Controller. [Online]. Available: https://github.com/floodlight/floodlight
[26]
B. Fortz and M. Thorup, “Optimizing OSPF/IS-IS weights in a changing world,” IEEE J. Sel. Areas Commun., vol. 20, no. 4, pp. 756–767, May 2002.
[27]
M. Ghobadi and R. Mahajan, “Optical layer failures in a large backbone,” in Proc. Internet Meas. Conf., New York, NY, USA, Nov. 2016, pp. 461–467.
[28]
P. Gill, N. Jain, and N. Nagappan, “Understanding network failures in data centers: Measurement, analysis, and implications,” in Proc. ACM SIGCOMM Conf. (SIGCOMM), New York, NY, USA, 2011, pp. 350–361.
[29]
R. Govindan, I. Minei, M. Kallahalla, B. Koley, and A. Vahdat, “Evolve or die: High-availability design principles drawn from Googles network infrastructure,” in Proc. ACM SIGCOMM Conf., Aug. 2016, pp. 1–95.
[30]
Gurobi. (2020). Gurobi is a Powerful Mathematical Optimization Solver. [Online]. Available: https://www.gurobi.com
[31]
Y. Harcholet al., “A public option for the core,” in Proc. Annu. Conf. ACM Special Interest Group Data Commun. Appl., Technol., Architectures, Protocols Comput. Commun., New York, NY, USA: Association for Computing Machinery, 2020, pp. 377–389.
[32]
C. Honget al., “Achieving high utilization with software-driven WAN,” in Proc. ACM SIGCOMM, Hong Kong, Aug. 2013, pp. 1–12.
[33]
C.-Y. Honget al., “B4 and after: Managing hierarchy, partitioning, and asymmetry for availability and scale in Google's software-defined WAN,” in Proc. Conf. ACM Special Interest Group Data Commun., New York, NY, USA, 2018, pp. 74–87.
[34]
S. Jainet al., “B4: Experience with a globally-deployed software defined WAN,” in Proc. ACM SIGCOMM, New York, NY, USA, 2013, pp. 3–14.
[35]
V. Jalaparti, I. Bliznets, S. Kandula, B. Lucier, and I. Menache, “Dynamic pricing and traffic engineering for timely inter-datacenter transfers,” in Proc. ACM SIGCOMM Conf., New York, NY, USA, Aug. 2016, pp. 73–86.
[36]
V. Jeyakumar, M. Alizadeh, D. Mazières, B. Prabhakar, A. Greenberg, and C. Kim, “EyeQ: Practical network performance isolation at the edge,” in Proc. 10th USENIX Symp. Networked Syst. Design Implement. (NSDI), Lombard, IL, USA, 2013, pp. 297–311.
[37]
C. Jiang, S. Rao, and M. Tawarmalani, “PCF: Provably resilient flexible routing,” in Proc. Annu. Conf. ACM Special Interest Group Data Commun. Appl., Technol., Architectures, Protocols Comput. Commun., New York, NY, USA, Jul. 2020, pp. 139–153.
[38]
X. Jinet al., “Optimizing bulk transfers with software-defined optical WAN,” in Proc. ACM SIGCOMM Conf., New York, NY, USA, Aug. 2016, pp. 87–100.
[39]
S. Kandula, D. Katabi, B. Davie, and A. Charny, “Walking the tightrope: Responsive yet stable traffic engineering,” in Proc. Conf. Appl., Technol., Architectures, Protocols Comput. Commun., New York, NY, USA, 2005, pp. 253–264.
[40]
S. Kandula, I. Menache, R. Schwartz, and S. R. Babbula, “Calendaring for wide area networks,” in Proc. ACM Conf. SIGCOMM, New York, NY, USA, Aug. 2014, pp. 515–526.
[42]
S. S. Krishnan and R. K. Sitaraman, “Video stream quality impacts viewer behavior: Inferring causality using quasi-experimental designs,” in Proc. Internet Meas. Conf., New York, NY, USA, 2012, pp. 211–224.
[43]
A. Kumaret al., “BWE: Flexible, hierarchical bandwidth allocation for WAN distributed computing,” in Proc. ACM Conf. Special Interest Group Data Commun., New York, NY, USA, 2015, pp. 1–14.
[44]
P. Kumaret al., “Semi-oblivious traffic engineering: The road not taken,” in Proc. USENIX NSDI, Renton, WA, USA, Apr. 2018, pp. 157–170.
[45]
L. Lamport, “The part-time parliament,” ACM Trans. Comput. Syst., vol. 16, no. 2, pp. 133–169, 1998.
[46]
J. Leeet al., “Application-driven bandwidth guarantees in datacenters,” in Proc. ACM Conf. SIGCOMM, New York, NY, USA, 2014, pp. 467–478.
[47]
H. H. Liu, S. Kandula, R. Mahajan, M. Zhang, and D. Gelernter, “Traffic engineering with forward fault correction,” in Proc. ACM Conf. SIGCOMM, New York, NY, USA, Aug. 2014, pp. 527–538.
[48]
L. Luo, H. Yu, Z. Ye, and X. Du, “Online deadline-aware bulk transfer over inter-datacenter WANs,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), Apr. 2018, pp. 630–638.
[49]
H. Mao, R. Netravali, and M. Alizadeh, “Neural adaptive video streaming with pensieve,” in Proc. Conf. ACM Special Interest Group Data Commun. New York, NY, USA: Association for Computing Machinery, Aug. 2017, pp. 197–210.
[50]
[51]
D. Mitra and Q. Wang, “Stochastic traffic engineering for demand uncertainty and risk-aware network revenue management,” IEEE/ACM Trans. Netw., vol. 13, no. 2, pp. 221–233, Apr. 2005.
[52]
Openflow. (2020). SDN and OpenFlow. [Online]. Available: https://tools.ietf.org/html/rfc7426#page-23
[53]
B. Pfaffet al., “The design and implementation of open vSwitch,” in Proc. 12th USENIX Symp. Networked Syst. Design Implement. (NSDI), Oakland, CA, May 2015, pp. 117–130.
[54]
K. Spiteri, R. Urgaonkar, and R. K. Sitaraman, “BOLA: Near-optimal bitrate adaptation for online videos,” in Proc. IEEE 35th Annu. IEEE Int. Conf. Comput. Commun. (INFOCOM), Apr. 2016, pp. 1–9.
[55]
K. Stouffer, J. Falco, and K. Scarfone, “Guide to industrial control systems (ICS) security,” NIST Special Publication, vol. 800, no. 82, p. 16, 2011.
[56]
M. Suchara, D. Xu, R. Doverspike, D. Johnson, and J. Rexford, “Network architecture for joint failure recovery and traffic engineering,” in Proc. ACM SIGMETRICS Joint Int. Conf. Meas. Model. Comput. Syst. (SIGMETRICS), New York, NY, USA, 2011, pp. 97–108.
[57]
D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage, “California fault lines: Understanding the causes and impact of network failures,” in Proc. ACM SIGCOMM Conf., New York, NY, USA, 2010, pp. 315–326.
[58]
B. Vamanan, J. Hasan, and T. N. Vijaykumar, “Deadline-aware datacenter TCP (D2TCP),” ACM SIGCOMM Comput. Commun. Rev., vol. 42, no. 4, pp. 115–126, 2012.
[59]
B. Vidalenc, L. Noirie, L. Ciavaglia, and E. Renault, “Dynamic risk-aware routing for OSPF networks,” in Proc. IFIP/IEEE Int. Symp. Integr. Network Manag. (IM), 2013, pp. 226–234.
[60]
Y. Wanget al., “R3: Resilient routing reconfiguration,” in Proc. ACM SIGCOMM Conf., New York, NY, USA, 2010, pp. 291–302.
[61]
C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron, “Better never than late: Meeting deadlines in datacenter networks,” in Proc. ACM SIGCOMM Conf., New York, NY, USA, 2011, pp. 50–61.
[62]
H. Zhanget al., “Guaranteeing deadlines for inter-datacenter transfers,” in Proc. 10th Eur. Conf. Comput. Syst. New York, NY, USA: Association for Computing Machinery, 2015, pp. 1–14.
[63]
H. Zhang, X. Shi, X. Yin, F. Ren, and Z. Wang, “More load, more differentiation—A design principle for deadline-aware congestion control,” in Proc. IEEE Conf. Comput. Commun. (INFOCOM), May 2015, pp. 127–135.
[64]
H. Zhanget al., “Boosting bandwidth availability over inter-DC WAN,” in Proc. 17th Int. Conf. Emerg. Netw. EXperiments Technol., 2021, pp. 297–312.

Index Terms

  1. Achieving High Availability in Inter-DC WAN Traffic Engineering
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image IEEE/ACM Transactions on Networking
          IEEE/ACM Transactions on Networking  Volume 31, Issue 6
          Dec. 2023
          894 pages

          Publisher

          IEEE Press

          Publication History

          Published: 12 December 2022
          Published in TON Volume 31, Issue 6

          Qualifiers

          • Research-article

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 21
            Total Downloads
          • Downloads (Last 12 months)21
          • Downloads (Last 6 weeks)2
          Reflects downloads up to 06 Jan 2025

          Other Metrics

          Citations

          View Options

          Login options

          Full Access

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Media

          Figures

          Other

          Tables

          Share

          Share

          Share this Publication link

          Share on social media