skip to main content
10.1145/3422604.3425949acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Public Access

Towards Verified Self-Driving Infrastructure

Published: 04 November 2020 Publication History

Abstract

Modern "self-driving'' service infrastructures consist of a diverse collection of distributed control components providing a broad spectrum of application- and network-centric functions. The complex and non-deterministic nature of these interactions leads to failures, ranging from subtle gray failures to catastrophic service outages, that are difficult to anticipate and repair.
Our goal is to call attention to the need for formal understanding of dynamic service infrastructure control. We provide an overview of several incidents reported by large service providers as well as issues in a popular orchestration system, identifying key characteristics of the systems and their failures. We then propose a verification approach in which we treat abstract models of control components and the environment as parametric transition systems and leverage symbolic model checking to verify safety and liveness properties, or propose safe configuration parameters. Our preliminary experiments show that our approach is effective in analyzing complex failure scenarios with acceptable performance overhead.

References

[1]
A. Abhashkumar, A. Gember-Jacobson, and A. Akella. Tiramisu: Fast multilayer network verification. In R. Bhagwan and G. Porter, editors, 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020, pages 201--219. USENIX Association, 2020.
[2]
A. Abhashkumar, J. Kang, S. Banerjee, A. Akella, Y. Zhang, and W. Wu. Supporting diverse dynamic intent-based policies using janus. In Proceedings of the 13th International Conference on emerging Networking EXperiments and Technologies, CoNEXT 2017, Incheon, Republic of Korea, December 12 - 15, 2017, pages 296--309. ACM, 2017.
[3]
Amazon. Aws post-event summaries. https://aws.amazon.com/cn/premiumsupport/technology/pes/, June 2020.
[4]
R. Beckett, A. Gupta, R. Mahajan, and D. Walker. A general approach to network configuration verification. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM 2017, Los Angeles, CA, USA, August 21-25, 2017, pages 155--168. ACM, 2017.
[5]
M. Canini, D. Venzano, P. Peres'i ni, D. Kostic, and J. Rexford. A NICE way to test openflow applications. In S. D. Gribble and D. Katabi, editors, Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2012, San Jose, CA, USA, April 25-27, 2012, pages 127--140. USENIX Association, 2012.
[6]
R. Cavada, A. Cimatti, M. Dorigatti, A. Griggio, A. Mariotti, A. Micheli, S. Mover, M. Roveri, and S. Tonetta. The nuxmv symbolic model checker. In A. Biere and R. Bloem, editors, Computer Aided Verification - 26th International Conference, CAV 2014, Held as Part of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 18-22, 2014. Proceedings, volume 8559 of Lecture Notes in Computer Science, pages 334--342. Springer, 2014.
[7]
Cisco. Cisco automated fault management. https://www.cisco.com/c/dam/en/us/services/collateral/services/bcs-afm-aag.pdf, August 2018.
[8]
CloudFormation. Aws cloud: formation model and provision all your cloud infrastructure resources. https://aws.amazon.com/cloudformation/, June 2020.
[9]
T. Gehr, S. Misailovic, P. Tsankov, L. Vanbever, P. Wiesmann, and M. T. Vechev. Bayonet: probabilistic inference for networks. In J. S. Foster and D. Grossman, editors, Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, pages 586--602. ACM, 2018.
[10]
Github. Github of kubernete descheduler. https://github.com/kubernetes-sigs/descheduler, June 2020.
[11]
Github. Hpa v2 scales up deployment during rolling updates #90461. https://github.com/kubernetes/kubernetes/issues/90461, June 2020.
[12]
Github. Replicaset controller bug: continuously creating pod to tainted nodes #75913. https://github.com/kubernetes/kubernetes/issues/75913, June 2020.
[13]
Google. Google cloud incident reports. https://status.cloud.google.com/summary, June 2020.
[14]
C. Hawblitzel, J. Howell, M. Kapritsos, J. R. Lorch, B. Parno, M. L. Roberts, S. T. V. Setty, and B. Zill. Ironfleet: proving practical distributed systems correct. In E. L. Miller and S. Hand, editors, Proceedings of the 25th Symposium on Operating Systems Principles, SOSP 2015, Monterey, CA, USA, October 4-7, 2015, pages 1--17. ACM, 2015.
[15]
A. Horn, A. Kheradmand, and M. R. Prasad. Delta-net: Real-time network verification using atoms. In A. Akella and J. Howell, editors, 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017, pages 735--749. USENIX Association, 2017.
[16]
A. Horn, A. Kheradmand, and M. R. Prasad. A precise and expressive lattice-theoretical framework for efficient network verification. In 27th IEEE International Conference on Network Protocols, ICNP 2019, Chicago, IL, USA, October 8-10, 2019, pages 1--12. IEEE, 2019.
[17]
Istio. Istio: connect, secure, control, and observe services. https://istio.io/, June 2020.
[18]
S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hö lzle, S. Stuart, and A. Vahdat. B4: experience with a globally-deployed software defined wan. In D. M. Chiu, J. Wang, P. Barford, and S. Seshan, editors, ACM SIGCOMM 2013 Conference, SIGCOMM'13, Hong Kong, China, August 12-16, 2013, pages 3--14. ACM, 2013.
[19]
G. Juniwal, N. Bjorner, R. Mahajan, S. Seshia, and G. Varghese. Quantitative network analysis. Technical report, 2016.
[20]
A. Kheradmand. Automatic inference of high-level network intents by mining forwarding patterns. In SOSR '20: Symposium on SDN Research, San Jose, CA, USA, March 3, 2020, pages 27--33. ACM, 2020.
[21]
A. Kheradmand. Case study implementation details. https://github.com/kheradmand/verdict-hotnets20, 2020.
[22]
A. Kheradmand and G. Rosu. P4K: A formal semantics of P4 and applications. CoRR, abs/1804.01468, 2018.
[23]
H. Kim, J. Reich, A. Gupta, M. Shahbaz, N. Feamster, and R. J. Clark. Kinetic: Verifiable dynamic network control. In 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 15, Oakland, CA, USA, May 4-6, 2015, pages 59--72. USENIX Association, 2015.
[24]
Kubernetes. Kubernetes: Production-grade container orchestration. https://kubernetes.io/, June 2020.
[25]
Kubernetes-sigs. Descheduler. https://github.com/kubernetes-sigs/descheduler, June 2020.
[26]
L. Ma, D. V. Aken, A. Hefny, G. Mezerhane, A. Pavlo, and G. J. Gordon. Query-based workload forecasting for self-driving database management systems. In G. Das, C. M. Jermaine, and P. A. Bernstein, editors, Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pages 631--645. ACM, 2018.
[27]
mboxGoogle. Google bigquery incident #18037. https://status.cloud.google.com/incident/bigquery/18037, June 2020.
[28]
mboxGoogle. Google operations incident #19007. https://status.cloud.google.com/incident/google-stackdriver/19007, June 2020.
[29]
S. Moon, J. Helt, Y. Yuan, Y. Bieri, S. Banerjee, V. Sekar, W. Wu, M. Yannakakis, and Y. Zhang. Alembic: Automated model inference for stateful network functions. In J. R. Lorch and M. Yu, editors, 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, February 26-28, 2019, pages 699--718. USENIX Association, 2019.
[30]
A. Pavlo, G. Angulo, J. Arulraj, H. Lin, J. Lin, L. Ma, P. Menon, T. C. Mowry, M. Perron, I. Quah, S. Santurkar, A. Tomasic, S. Toor, D. V. Aken, Z. Wang, Y. Wu, R. Xian, and T. Zhang. Self-driving database management systems. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings. www.cidrdb.org, 2017.
[31]
S. Prabhu, K. Chou, A. Kheradmand, B. Godfrey, and M. Caesar. Plankton: Scalable network configuration verification through model checking. In R. Bhagwan and G. Porter, editors, 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020, pages 953--967. USENIX Association, 2020.
[32]
S. Prabhu, A. Kheradmand, B. Godfrey, and M. Caesar. Predicting network futures with plankton. In K. Chen and J. Padhye, editors, Proceedings of the First Asia-Pacific Workshop on Networking, APNet 2017, Hong Kong, China, August 3-4, 2017, pages 92--98. ACM, 2017.
[33]
E. Research. A look at automated fault management with machine learning. https://www.ericsson.com/en/blog/2019/6/automated-fault-management-machine-learning, June 2019.
[34]
R. Shambaugh, A. Weiss, and A. Guha. Rehearsal: a configuration verification tool for puppet. In C. Krintz and E. Berger, editors, Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2016, Santa Barbara, CA, USA, June 13-17, 2016, pages 416--430. ACM, 2016.
[35]
S. Smolka, P. Kumar, N. Foster, D. Kozen, and A. Silva. Cantor meets scott: semantic foundations for probabilistic networks. In G. Castagna and A. D. Gordon, editors, Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL 2017, Paris, France, January 18-20, 2017, pages 557--571. ACM, 2017.
[36]
K. Subramanian, A. Abhashkumar, L. D'Antoni, and A. Akella. Detecting network load violations for distributed control planes. In A. F. Donaldson and E. Torlak, editors, Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, PLDI 2020, London, UK, June 15-20, 2020, pages 974--988. ACM, 2020.
[37]
D. Swarm. Docker swarm: Swarm mode overview. https://docs.docker.com/engine/swarm/, June 2020.
[38]
Terraform. Terraform: use infrastructure as code to provision and manage any cloud, infrastructure, or service. https://www.terraform.io/, June 2020.
[39]
Y. Wu, A. Chen, A. Haeberlen, W. Zhou, and B. T. Loo. Automated bug removal for software-defined networks. In A. Akella and J. Howell, editors, 14th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2017, Boston, MA, USA, March 27-29, 2017, pages 719--733. USENIX Association, 2017.
[40]
F. Yousefi, A. Abhashkumar, K. Subramanian, K. Hans, S. Ghorbani, and A. Akella. Liveness verification of stateful network functions. In R. Bhagwan and G. Porter, editors, 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020, pages 257--272. USENIX Association, 2020.
[41]
Y. Yuan, S. Moon, S. Uppal, L. Jia, and V. Sekar. Netsmc: A custom symbolic model checker for stateful network verification. In R. Bhagwan and G. Porter, editors, 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020, Santa Clara, CA, USA, February 25-27, 2020, pages 181--200. USENIX Association, 2020.
[42]
Y. Zhang, W. Wu, S. Banerjee, J. Kang, and M. A. Sá nchez. Sla-verifier: Stateful and quantitative verification for service chaining. In 2017 IEEE Conference on Computer Communications, INFOCOM 2017, Atlanta, GA, USA, May 1-4, 2017, pages 1--9. IEEE, 2017.
[43]
W. Zhou, J. Croft, B. Liu, E. Ang, and M. Caesar. Automatically correcting networks with NEAt. In S. Banerjee and S. Seshan, editors, 15th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2018, Renton, WA, USA, April 9-11, 2018, pages 595--608. USENIX Association, 2018.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HotNets '20: Proceedings of the 19th ACM Workshop on Hot Topics in Networks
November 2020
228 pages
ISBN:9781450381451
DOI:10.1145/3422604
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 November 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. parameter synthesis
  2. self-driving infrastructure
  3. service infrastructure control
  4. symbolic model checking
  5. verification

Qualifiers

  • Research-article

Funding Sources

  • Maryland Procurement Office
  • National Science Foundation

Conference

HotNets '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 110 of 460 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)11
Reflects downloads up to 13 Sep 2024

Other Metrics

Citations

Cited By

View all

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media