Evolve or die: High-availability design principles drawn from googles network infrastructure
R Govindan, I Minei, M Kallahalla, B Koley… - Proceedings of the 2016 …, 2016 - dl.acm.org
R Govindan, I Minei, M Kallahalla, B Koley, A Vahdat
Proceedings of the 2016 ACM SIGCOMM Conference, 2016•dl.acm.orgMaintaining the highest levels of availability for content providers is challenging in the face
of scale, network evolution and complexity. Little, however, is known about failures large
content providers are susceptible to, and what mechanisms they employ to ensure high
availability. From a detailed analysis of over 100 high-impact failure events in a global-scale
content provider encompassing several data centers and two WANs, we quantify several
dimensions of availability failures. We find that failures are evenly distributed across different …
of scale, network evolution and complexity. Little, however, is known about failures large
content providers are susceptible to, and what mechanisms they employ to ensure high
availability. From a detailed analysis of over 100 high-impact failure events in a global-scale
content provider encompassing several data centers and two WANs, we quantify several
dimensions of availability failures. We find that failures are evenly distributed across different …
Maintaining the highest levels of availability for content providers is challenging in the face of scale, network evolution and complexity. Little, however, is known about failures large content providers are susceptible to, and what mechanisms they employ to ensure high availability. From a detailed analysis of over 100 high-impact failure events in a global-scale content provider encompassing several data centers and two WANs, we quantify several dimensions of availability failures. We find that failures are evenly distributed across different network types and planes, but that a large number of failures happen when a management operation is in progress within the network. We discuss some of these failures in detail, and also describe our design principles for high availability motivated by these failures, including using defense in depth, maintaining consistency across planes, failing open on large failures, carefully preventing and avoiding failures, and assessing root cause quickly. Our findings suggest that, as networks become more complicated, failures lurk everywhere, and, counter-intuitively, continuous incremental evolution of the network can, when applied together with our design principles, result in a more robust network.
ACM Digital Library