skip to main content
10.1145/3341301.3359657acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Snap: a microkernel approach to host networking

Published: 27 October 2019 Publication History

Abstract

This paper presents our design and experience with a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system that supports Google's rapidly evolving needs with flexible modules that implement a range of network functions, including edge packet switching, virtualization for our cloud platform, traffic shaping policy enforcement, and a high-performance reliable messaging and RDMA-like service. Snap has been running in production for over three years, supporting the extensible communication needs of several large and critical systems.
Snap enables fast development and deployment of new networking features, leveraging the benefits of address space isolation and the productivity of userspace software development together with support for transparently upgrading networking services without migrating applications off of a machine. At the same time, Snap achieves compelling performance through a modular architecture that promotes principled synchronization with minimal state sharing, and supports real-time scheduling with dynamic scaling of CPU resources through a novel kernel/userspace CPU scheduler co-design. Our evaluation demonstrates over 3x Gbps/core improvement compared to a kernel networking stack for RPC workloads, software-based RDMA-like performance of up to 5M IOPS/core, and transparent upgrades that are largely imperceptible to user applications. Snap is deployed to over half of our fleet of machines and supports the needs of numerous teams.

References

[1]
Data plane development kit. http://www.dpdk.org.
[2]
Fast memcpy with SPDK and intel I/OAT DMA engine. https://software.intel.com/en-us/articles/fast-memcpy-using-spdk-and-ioat-dma-engine.
[3]
Github repository: Neper linux networking performance tool. https://github.com/google/neper.
[4]
grpc benchmarking. https://grpc.io/docs/guides/benchmarking.html.
[5]
Linux CFS scheduler. https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt.
[6]
memfd manpage. http://man7.org/linux/man-pages/man2/memfd_create.2.html.
[7]
Nice levels in the linux scheduler. https://www.kernel.org/doc/Documentation/scheduler/sched-nice-design.txt.
[8]
Scaling in the linux networking stack. https://www.kernel.org/doc/Documentation/networking/scaling.txt.
[9]
Short waits with umwait. https://lwn.net/Articles/790920/.
[10]
M. J. Accetta, R. V. Baron, W. J. Bolosky, D. B. Golub, R. F. Rashid, A. Tevanian, and M. Young. Mach: A new kernel foundation for UNIX development. In Proceedings of the USENIX Summer Conference, Altanta, GA, USA, June 1986, pages 93--113, 1986.
[11]
M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker. pFabric: Minimal near-optimal datacenter transport. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM '13, pages 435--446, New York, NY, USA, 2013. ACM.
[12]
G. Banga, P. Druschel, and J. C. Mogul. Resource containers: A new facility for resource management in server systems. In Proceedings of the Third Symposium on Operating Systems Design and Implementation, OSDI '99, pages 45--58, Berkeley, CA, USA, 1999. USENIX Association.
[13]
A. Belay, A. Bittau, A. Mashtizadeh, D. Terei, D. Mazières, and C. Kozyrakis. Dune: Safe user-level access to privileged CPU features. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 335--348, Berkeley, CA, USA, 2012. USENIX Association.
[14]
A. Belay, G. Prekas, A. Klimovic, S. Grossman, C. Kozyrakis, and E. Bugnion. IX: A protected dataplane operating system for high throughput and low latency. In Proceedings of the 11th USENIX Conference on (Operating Systems Design and Implementation, OSDI'14, pages 49--65, Berkeley, CA, USA, 2014. USENIX Association.
[15]
B. N. Bershad, T. E. Anderson, E. D. Lazowska, and H. M. Levy. Lightweight remote procedure call. ACM Trans. Comput. Syst., 8(1):37--55, Feb. 1990.
[16]
J. B. Chen and B. N. Bershad. The impact of operating system structure on memory system performance. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, SOSP '93, pages 120--133, New York, NY, USA, 1993. ACM.
[17]
L. Chen, K. Chen, W. Bai, and M. Alizadeh. Scheduling mix-flows in commodity datacenters with karuna. In Proceedings of the 2016 ACM SIGCOMM Conference, SIGCOMM '16, pages 174--187, New York, NY, USA, 2016. ACM.
[18]
C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live migration of virtual machines. In Proceedings of the 2Nd Conference on Symposium on Networked Systems Design & Implementation - Volume 2, NSDI'05, pages 273--286, Berkeley, CA, USA, 2005. USENIX Association.
[19]
M. Dalton, D. Schultz, J. Adriaens, A. Arefin, A. Gupta, B. Fahs, D. Rubinstein, E. C. Zermeno, E. Rubow, J. A. Docauer, J. Alpert, J. Ai, J. Olson, K. DeCabooter, M. de Kruijf, N. Hua, N. Lewis, N. Kasinadhuni, R. Crepaldi, S. Krishnan, S. Venkata, Y. Richter, U. Naik, and A. Vahdat. Andromeda: Performance, isolation, and velocity at scale in cloud network virtualization. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 373--387, Renton, WA, 2018. USENIX Association.
[20]
R. P. Draves, B. N. Bershad, R. F. Rashid, and R. W. Dean. Using continuations to implement thread management and communication in operating systems. In Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, SOSP '91, pages 122--136, New York, NY, USA, 1991. ACM.
[21]
P. Druschel and G. Banga. Lazy receiver processing (LRP): A network subsystem architecture for server systems. In Proceedings of the Second USENIX Symposium on Operating Systems Design and Implementation, OSDI '96, pages 261--275, New York, NY, USA, 1996. ACM.
[22]
D. E. Eisenbud, C. Yi, C. Contavalli, C. Smith, R. Kononov, E. Mann-Hielscher, A. Cilingiroglu, B. Cheyney, W. Shang, and J. D. Hosein. Maglev: A fast and reliable software network load balancer. In Proceedings of the 13th Usenix Conference on Networked Systems Design and Implementation, NSDI'16, pages 523--535, Berkeley, CA, USA, 2016. USENIX Association.
[23]
D. R. Engler, M. F. Kaashoek, and J. O'Toole, Jr. Exokernel: An operating system architecture for application-level resource management. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, SOSP '95, pages 251--266, New York, NY, USA, 1995. ACM.
[24]
P. X. Gao, A. Narayan, G. Kumar, R. Agarwal, S. Ratnasamy, and S. Shenker. pHost: Distributed near-optimal datacenter transport over commodity network fabric. In Proceedings of the 11th ACM Conference on Emerging Networking Experiments and Technologies, CoNEXT '15, pages 1:1--1:12, New York, NY, USA, 2015. ACM.
[25]
D. Gruss, M. Lipp, M. Schwarz, R. Fellner, C. Maurice, and S. Mangard. KASLR is dead: Long live KASLR. In Engineering Secure Software and Systems - 9th International Symposium, ESSoS 2017, Proceedings, volume 10379 LNCS of Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), pages 161--176, Italy, 2017. Springer-Verlag Italia.
[26]
S. Han, K. Jang, A. Panda, S. Palkar, D. Han, and S. Ratnasamy. SoftNIC: A software NIC to augment hardware. Technical Report UCB/EECS-2015-155, EECS Department, University of California, Berkeley, May 2015.
[27]
M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, G. Antichi, and M. Wójcik. Re-architecting datacenter networks and stacks for low latency and high performance. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '17, pages 29--42, New York, NY, USA, 2017. ACM.
[28]
P. B. Hansen. The nucleus of a multiprogramming system. Commun. ACM, 13(4):238--241, Apr. 1970.
[29]
H. Härtig, M. Hohmuth, J. Liedtke, S. Schönberg, and J. Wolter. The performance of μ-kernel-based systems. In Proceedings of the Sixteenth ACM Symposium on Operating Systems Principles, SOSP '97, pages 66--77, New York, NY, USA, 1997. ACM.
[30]
E. Jeong, S. Wood, M. Jamshed, H. Jeong, S. Ihm, D. Han, and K. Park. mTCP: a highly scalable user-level TCP stack for multicore systems. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), pages 489--502, Seattle, WA, 2014. USENIX Association.
[31]
K. Kaffes, T. Chong, J. T. Humphries, A. Belay, D. Mazières, and C. Kozyrakis. Shinjuku: Preemptive scheduling for usecond-scale tail latency. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 345--360, Boston, MA, 2019. USENIX Association.
[32]
A. Kalia, M. Kaminsky, and D. Andersen. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2019, Boston, MA, February 26--28, 2019., pages 1--16, 2019.
[33]
A. Kalia, M. Kaminsky, and D. G. Andersen. Using RDMA efficiently for key-value services. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, pages 295--306, New York, NY, USA, 2014. ACM.
[34]
A. Kalia, M. Kaminsky, and D. G. Andersen. FaSST: Fast, scalable and simple distributed transactions with two-sided (RDMA) datagram rpcs. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 185--201, Berkeley, CA, USA, 2016. USENIX Association.
[35]
A. Kaufmann, T. Stamler, S. Peter, N. K. Sharma, A. Krishnamurthy, and T. Anderson. TAS: TCP acceleration as an OS service. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys '19, pages 24:1--24:16, New York, NY, USA, 2019. ACM.
[36]
J. Khalid, E. Rozner, W. Felter, C. Xu, K. Rajamani, A. Ferreira, and A. Akella. Iron: Isolating network-based CPU in container environments. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 313--328, Renton, WA, 2018. USENIX Association.
[37]
P. Kocher, J. Horn, A. Fogh, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and Y. Yarom. Spectre attacks: Exploiting speculative execution. In 40th IEEE Symposium on Security and Privacy (S&P'19), 2019.
[38]
C. Kulkarni, S. Moore, M. Naqvi, T. Zhang, R. Ricci, and R. Stutsman. Splinter: Bare-metal extensions for multi-tenant low-latency storage. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 627--643, Carlsbad, CA, Oct. 2018. USENIX Association.
[39]
A. Kumar, S. Jain, U. Naik, A. Raghuraman, N. Kasinadhuni, E. C. Zermeno, C. S. Gunn, J. Ai, B. Carlin, M. Amarandei-Stavila, M. Robin, A. Siganporia, S. Stuart, and A. Vahdat. BwE: Flexible, hierarchical bandwidth allocation for WAN distributed computing. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, SIGCOMM '15, pages 1--14, New York, NY, USA, 2015. ACM.
[40]
A. Langley, A. Riddoch, A. Wilk, A. Vicente, C. Krasic, D. Zhang, F. Yang, F. Kouranov, I. Swett, J. Iyengar, J. Bailey, J. Dorfman, J. Roskind, J. Kulik, P. Westin, R. Tenneti, R. Shade, R. Hamilton, V. Vasiliev, W.-T. Chang, and Z. Shi. The QUIC transport protocol: Design and internet-scale deployment. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '17, pages 183--196, New York, NY, USA, 2017. ACM.
[41]
J. Liedtke. Improving IPC by kernel design. In Proceedings of the Fourteenth ACM Symposium on Operating Systems Principles, SOSP '93, pages 175--188, New York, NY, USA, 1993. ACM.
[42]
H. Lim, D. Han, D. G. Andersen, and M. Kaminsky. MICA: A holistic approach to fast in-memory key-value storage. In Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, NSDI'14, pages 429--444, Berkeley, CA, USA, 2014. USENIX Association.
[43]
M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, A. Fogh, J. Horn, S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and M. Hamburg. Meltdown: Reading kernel memory from user space. In 27th USENIX Security Symposium (USENIX Security 18), 2018.
[44]
Z. Mi, D. Li, Z. Yang, X. Wang, and H. Chen. SkyBridge: Fast and secure inter-process communication for microkernels. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys '19, pages 9:1--9:15, New York, NY, USA, 2019. ACM.
[45]
R. Mittal, T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, and D. Zats. Timely: RTT-based congestion control for the datacenter. In Sigcomm '15, 2015.
[46]
B. Montazeri, Y. Li, M. Alizadeh, and J. K. Ousterhout. Homa: A receiver-driven low-latency transport protocol using network priorities. CoRR, abs/1803.09615, 2018.
[47]
R. Morris, E. Kohler, J. Jannotti, and M. F. Kaashoek. The click modular router. In Proceedings of the Seventeenth ACM Symposium on Operating Systems Principles, SOSP '99, pages 217--231, New York, NY, USA, 1999. ACM.
[48]
A. Ousterhout, J. Fried, J. Behrens, A. Belay, and H. Balakrishnan. Shenango: Achieving high CPU efficiency for latency-sensitive data-center workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 361--378, Boston, MA, 2019. USENIX Association.
[49]
J. Perry, A. Ousterhout, H. Balakrishnan, D. Shah, and H. Fugal. Fast-pass: A centralized "zero-queue" datacenter network. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, pages 307--318, New York, NY, USA, 2014. ACM.
[50]
S. Peter, J. Li, I. Zhang, D. R. K. Ports, D. Woos, A. Krishnamurthy, T. Anderson, and T. Roscoe. Arrakis: The operating system is the control plane. pages 1--16, 2014.
[51]
G. Prekas, M. Kogias, and E. Bugnion. Zygos: Achieving low tail latency for microsecond-scale networked tasks. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 325--341, New York, NY, USA, 2017. ACM.
[52]
G. Prekas, M. Primorac, A. Belay, C. Kozyrakis, and E. Bugnion. Energy proportionality and workload consolidation for latency-critical applications. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC 2015, Kohala Coast, Hawaii, USA, August 27--29, 2015, pages 342--355, 2015.
[53]
H. Qin, Q. Li, J. Speiser, P. Kraft, and J. Ousterhout. Arachne: Core-aware thread management. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'18, pages 145--160, Berkeley, CA, USA, 2018. USENIX Association.
[54]
G. J. Regnier, S. Makineni, R. Illikkal, R. R. Iyer, D. B. Minturn, R. Huggahalli, D. Newell, L. S. Cline, and A. P. Foong. TCP onloading for data center servers. IEEE Computer, 37(11):48--58, 2004.
[55]
L. Rizzo. Netmap: A novel framework for fast packet I/O. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference, USENIX ATC'12, pages 9--9, Berkeley, CA, USA, 2012. USENIX Association.
[56]
L. Shalev, V. Makhervaks, Z. Machulsky, G. Biran, J. Satran, M. Ben-Yehuda, and I. Shimony. Loosely coupled TCP acceleration architecture. In Hot Interconnects, pages 3--8. IEEE Computer Society, 2006.
[57]
L. Shalev, J. Satran, E. Borovik, and M. Ben-Yehuda. IsoStack: Highly efficient network processing on dedicated cores. In Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'10, pages 5--5, Berkeley, CA, USA, 2010. USENIX Association.
[58]
L. Soares and M. Stumm. FlexSC: Flexible system call scheduling with exception-less system calls. In Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation, OSDI'10, pages 33--46, Berkeley, CA, USA, 2010. USENIX Association.
[59]
C. A. Thekkath, T. D. Nguyen, E. Moy, and E. D. Lazowska. Implementing network protocols at user level. IEEE/ACM Trans. Nettw., 1(5):554--565, Oct. 1993.
[60]
S.-Y. Tsai and Y. Zhang. LITE kernel RDMA support for datacenter applications. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP '17, pages 306--324, New York, NY, USA, 2017. ACM.
[61]
W. Wulf, E. Cohen, W. Corwin, A. Jones, R. Levin, C. Pierson, and F. Pollack. HYDRA: The kernel of a multiprocessor operating system. Commun. ACM, 17(6):337--345, June 1974.
[62]
K. Yap, M. Motiwala, J. Rahe, S. Padgett, M. Holliman, G. Baldus, M. Hines, T. Kim, A. Narayanan, A. Jain, V. Lin, C. Rice, B. Rogan, A. Singh, B. Tanaka, M. Verma, P. Sood, M. Tariq, M. Tierney, D. Trumic, V. Valancius, C. Ying, M. Kallahalla, B. Koley, and A. Vahdat. Taking the edge off with espresso: Scale, reliability and programmability for global internet peering. 2017.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles
October 2019
615 pages
ISBN:9781450368735
DOI:10.1145/3341301
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike International 4.0 License.

Sponsors

In-Cooperation

  • USENIX Assoc: USENIX Assoc

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. RDMA
  2. datacenter
  3. microkernel
  4. network stack

Qualifiers

  • Research-article

Conference

SOSP '19
Sponsor:
SOSP '19: ACM SIGOPS 27th Symposium on Operating Systems Principles
October 27 - 30, 2019
Ontario, Huntsville, Canada

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25
ACM SIGOPS 31st Symposium on Operating Systems Principles
October 13 - 16, 2025
Seoul , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)624
  • Downloads (Last 6 weeks)53
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media