skip to main content
10.1145/3126908.3126970acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

sPIN: High-performance streaming Processing In the Network

Published: 12 November 2017 Publication History

Abstract

Optimizing communication performance is imperative for large-scale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator Log-GOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.

References

[1]
Shawn Hansen and Sujal Das. 2006. Fabric-agnostic RDMA with OpenFabrics Enterprise Distribution: Promises, Challenges, and Future Direction. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing (SC '06). ACM, New York, NY, USA, Article 23.
[2]
Greg Faanes, Abdulla Bataineh, Duncan Roweth, Tom Court, Edwin Froese, Bob Alverson, Tim Johnson, Joe Kopnick, Mike Higgins, and James Reinhard. 2012. Cray Cascade: A scalable HPC system based on a Dragonfly network. In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC'12). IEEE Computer Society, Article 103, 9 pages. http://dl.acm.org/citation.cfm?id=2388996.2389136
[3]
Brian W Barrett, Ronald Brightwell, Ryan E. Grant, Scott Hemmert, Kevin T Pedretti, Kyle Wheeler, Keith Underwood, Rolf Riesen, Arthur B. Maccabe, and Trammell Hudson. 2017. The Portals 4.1 network programming interface. Technical Report. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States).
[4]
Antoine Kaufmann, SImon Peter, Naveen Kr. Sharma, Thomas Anderson, and Arvind Krishnamurthy. 2016. High Performance Packet Processing with FlexNIC. SIGPLAN Not. 51, 4 (March 2016), 67--81.
[5]
Ethernet Alliance. 2015. 2015 Ethernet Roadmap. (2015).
[6]
Daniel Molka, Daniel Hackenberg, and Robert Schöne. 2014. Main Memory and Cache Performance of Intel Sandy Bridge and AMD Bulldozer. In Proceedings of the Workshop on Memory Systems Performance and Correctness (MSPC '14). ACM, New York, NY, USA, Article 4, 10 pages.
[7]
Intel Corporation. 2016. Intel 64 and IA-32 Architectures Optimization Reference Manual. (July 2016).
[8]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA. Queue 6, 2 (March 2008), 40--53.
[9]
John E. Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. IEEE Des. Test 12, 3 (May 2010), 66--73.
[10]
M. G. Venkata, R. L. Graham, J. S. Ladd, P. Shamis, I. Rabinovitz, V. Filipov, and G. Shainer. 2011. ConnectX-2 CORE-Direct Enabled Asynchronous Broadcast Collective Communications. In 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum. 781--787.
[11]
B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, and R. Rajamony. 2010. The PERCS High-Performance Interconnect. In Proceedings of 18th Symposium on High-Performance Interconnects (Hot Interconnects 2010). IEEE.
[12]
K Scott Hemmert, Brian Barrett, and Keith D Underwood. 2010. Using triggered operations to offload collective communication operations. In European MPI Users' Group Meeting. Springer, 249--256.
[13]
K. Rupp, F. Rudolf, and J. Weinbub. 2010. ViennaCL - A High Level Linear Algebra Library for GPUs and Multi-Core CPUs. In Intl. Workshop on GPUs and Scientific Applications. 51--56.
[14]
Thorsten von Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. 1992. Active Messages: A Mechanism for Integrated Communication and Computation. SIGARCH Comput. Archit. News 20, 2 (April 1992), 256--266.
[15]
Ada Gavrilovska. SPLITS Stream Handlers: Deploying Application-level Services to Attached Network Processor. Ph.D. Dissertation. Georgia Institute of Technology.
[16]
Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and others. 2014. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87--95.
[17]
Atos Technologies. 2016. Bull eXascale Interconnect in sequana. (2016).
[18]
S. Di Girolamo, P. Jolivet, K. D. Underwood, and T. Hoefler. 2015. Exploiting Offload Enabled Network Interfaces. In Proceedings of the 23rd Annual Symposium on High-Performance Interconnects (HOTI'15). IEEE.
[19]
T. Hoefler, T. Schneider, and A. Lumsdaine. 2010. LogGOPSim - Simulating Large-Scale Applications in the LogGOPS Model. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing. ACM, 597--604.
[20]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1--7.
[21]
T. Hoefler, T. Schneider, and A. Lumsdaine. 2010. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10).
[22]
T. Hoefler, T. Schneider, and A. Lumsdaine. 2009. The Effect of Network Noise on Large-Scale Collective Communications. Parallel Processing Letters (PPL) 19, 4 (Aug. 2009), 573--593.
[23]
Mellanox Technologies. 2015. EDR InfiniBand. Jan. 2015). Open Fabrics User's Meeting 2015.
[24]
F. A. Endo, D. CouroussÃl', and H. P. Charles. 2014. Micro-architectural simulation of in-order and out-of-order ARM microprocessors with gem5. In 2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV). 266--273.
[25]
Keith D. Underwood, Jerrie Coffman, Roy Larsen, K. Scott Hemmert, Brian W. Barrett, Ron Brightwell, and Michael Levenhagen. 2011. Enabling Flexible Collective Communication Offload with Triggered Operations. In Proceedings of the 2011 IEEE 19th Annual Symposium on High Performance Interconnects (HOTI '11). IEEE Computer Society, Washington, DC, USA, 35--42.
[26]
Ayaz Akram and Lina Sawalha. 2016. x86 computer architecture simulators: A comparative study. In Computer Design (ICCD), 2016 IEEE 34th International Conference on. IEEE, 638--645.
[27]
B. v. Werkhoven, J. Maassen, F. J. Seinstra, and H. E. Bal. 2014. Performance Models for CPU-GPU Data Transfers. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. 11--20.
[28]
Maxime Martinasso, Grzegorz Kwasniewski, Sadaf R. Alam, Thomas C. Schulthess, and Torsten Hoefler. 2016. A PCIe Congestion-aware Performance Model for Densely Populated Accelerator Servers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE Press, Piscataway, NJ, USA, Article 63, 11 pages. http://dl.acm.org/citation.cfm?id=3014904.3014989
[29]
Duncan Roweth and Ashley Pittman. 2005. Optimised Global Reduction on QsNet II. In Proceedings of the 13th Symposium on High Performance Interconnects (HOTI '05). IEEE Computer Society, Washington, DC, USA, 23--28.
[30]
T. Hoefler and D. Moor. 2014. Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations. Journal of Supercomputing Frontiers and Innovations 1, 2 (Oct. 2014), 58--75.
[31]
Tim S. Woodall, Galen M. Shipman, George Bosilca, Richard L. Graham, and Arthur B. Maccabe. 2006. High Performance RDMA Protocols in HPC. In Proceedings of the 13th European PVM/MPI User's Group Conference on Recent Advances in Parallel Virtual Machine and Message Passing Interface (EuroPVM/MPI'06). Springer-Verlag, Berlin, Heidelberg, 76--85.
[32]
T. Hoefler and A. Lumsdaine. 2008. Message Progression in Parallel Computing - To Thread or not to Thread?. In Proceedings of the 2008 IEEE International Conference on Cluster Computing. IEEE Computer Society.
[33]
Brian W Barrett, Ron Brightwell, K Scott Hemmert, Kyle B Wheeler, and Keith D Underwood. 2011. Using triggered operations to offload rendezvous messages. In European MPI Users' Group Meeting. Springer, 120--129.
[34]
Claude Bernard, Michael C Ogilvie, Thomas A DeGrand, Carleton E DeTar, Steven A Gottlieb, A Krasnitz, Robert L Sugar, and Doug Toussaint. 1991. Studying quarks and gluons on MIMD parallel computers. The International Journal of Supercomputing Applications 5, 4 (1991), 61--70.
[35]
Philip W Jones, Patrick H Worley, Yoshikatsu Yoshida, JB White, and John Levesque. 2005. Practical performance portability in the Parallel Ocean Program (POP). Concurrency and Computation: Practice and Experience 17, 10 (2005), 1317--1327.
[36]
Michael A Heroux, Douglas W Doerfler, Paul S Crozier, James M Willenbring, H Carter Edwards, Alan Williams, Mahesh Rajan, Eric R Keiter, Heidi K Thornquist, and Robert W Numrich. 2009. Improving Performance via Mini-applications. Technical Report SAND2009-5574. Sandia National Laboratories.
[37]
Hyong-youb Kim, Vijay S. Pai, and Scott Rixner. 2003. Exploiting Task-level Concurrency in a Programmable Network Interface. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '03). ACM, New York, NY, USA, 61--72.
[38]
T. Schneider, R. Gerstenberger, and T. Hoefler. 2012. Micro-Applications for Communication Data Access Patterns and MPI Datatypes. In Recent Advances in the Message Passing Interface - 19th European MPI Users' Group Meeting, EuroMPI 2012, Vienna, Austria, September 23-26, 2012. Proceedings, Vol. 7490. Springer, 121--131.
[39]
Matthias Weber. High availability for the lustre file system. Ph.D. Dissertation. Oak Ridge National Laboratory.
[40]
John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazières, Subhasish Mitra, Aravind Narayanan, Guru Parulkar, Mendel Rosenblum, and others. 2010. The case for RAMClouds: scalable high-performance storage entirely in DRAM. ACM SIGOPS Operating Systems Review 43, 4 (2010), 92--105.
[41]
Storage Performance Council. 2002. SPC Trace File Format Specification, Revision 1.0.1. (2002).
[42]
Brad Fitzpatrick. 2004. Distributed caching with memcached. Linux journal 2004, 124 (2004), 5.
[43]
Marius Poke and Torsten Hoefler. 2015. DARE: High-Performance State Machine Replication on RDMA Networks. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC '15). ACM, New York, NY, USA, 107--118.
[44]
Ciprian Docan, Manish Parashar, and Scott Klasky. 2010. DataSpaces: An Interaction and Coordination Framework for Coupled Simulation Workflows. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing (HPDC '10). ACM, New York, NY, USA, 25--36.
[45]
Aleksandar Dragojević, Dushyanth Narayanan, Orion Hodson, and Miguel Castro. 2014. FaRM: Fast Remote Memory. In Proceedings of the 11th USENNIX Conference on Networked Systems Design and Implementation (NSDI'14). USENIX Association, Berkeley, CA, USA, 401--414. http://dl.acm.org/citation.cfm?id=2616448.2616486
[46]
Aleksandar Dragojević, Dushyanth Narayanan, Edmund B. Nightingale, Matthew Renzelmann, Alex Shamis, Anirudh Badam, and Miguel Castro. 2015. No Compromises: Distributed Transactions with Consistency, Availability, and Performance. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 54--70.
[47]
V. Santhosh Kumar, M. J. Thazhuthaveetil, and R. Govindarajan. 2006. Exploiting Programmable Network Interfaces for Parallel Query Execution in Workstation Clusters. In Proceedings of the 20th International Conference on Parallel and Distributed Processing (IPDPS'06). IEEE Computer Society, Washington, DC, USA, 77--77. http://dl.acm.org/citation.cfm?id=1898953.1899010
[48]
Maurice Herlihy and J. Eliot B. Moss. 1993. Transactional Memory: Architectural Support for Lock-free Data Structures. SIGARCH Comput. Archit. News 21, 2 (May 1993), 289--300.
[49]
Darius Buntinas. 2012. Scalable distributed consensus to support MPI fault tolerance. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 1240--1249.
[50]
Thara Angskun, George Bosilca, and Jack Dongarra. 2007. Binomial graph: A scalable and fault-tolerant logical network topology. In International Symposium on Parallel and Distributed Processing and Applications. Springer, 471--482.
[51]
Fabrizio Petrini, Wu-chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. 2002. The Quadrics network: High-performance clustering technology. IEEE Micro 22, 1 (2002), 46--57.
[52]
W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda. 2004. Efficient and scalable barrier over Quadrics and Myrinet with a new NIC-based collective message passing protocol. In 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. 182--.
[53]
Ron Brighttwell Kevin T. Pedretti. 2004. A NIC-Offload Implementation of Portals for Quadrics QsNet. In Fifth LCI International Conference on Linux Clusters.
[54]
A. Wagner, Hyun-Wook Jin, D. K. Panda, and R. Riesen. 2004. NIC-based offload of dynamic user-defined modules for Myrinet clusters. In 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935). 205--214.

Cited By

View all
  • (2024)Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective OperationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528935(1-12)Online publication date: May-2024
  • (2024)NetCL: A Unified Programming Framework for In-Network ComputingSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00051(1-20)Online publication date: 17-Nov-2024
  • (2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
  • General Chair:
  • Bernd Mohr,
  • Program Chair:
  • Padma Raghavan
© 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SC '17
Sponsor:

Acceptance Rates

SC '17 Paper Acceptance Rate 61 of 327 submissions, 19%;
Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)74
  • Downloads (Last 6 weeks)7
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Optimizing Application Performance with BlueField: Accelerating Large-Message Blocking and Nonblocking Collective OperationsISC High Performance 2024 Research Paper Proceedings (39th International Conference)10.23919/ISC.2024.10528935(1-12)Online publication date: May-2024
  • (2024)NetCL: A Unified Programming Framework for In-Network ComputingSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00051(1-20)Online publication date: 17-Nov-2024
  • (2024)Smart Network Traffic Prediction for Scientific Applications2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00022(108-115)Online publication date: 20-Mar-2024
  • (2024)FlexRoute: A Fast, Flexible and Priority-Aware Packet-Processing Design2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP62718.2024.00016(52-59)Online publication date: 20-Mar-2024
  • (2024)FlexCross: High-Speed and Flexible Packet Processing via a Crosspoint-Queued Crossbar2024 27th Euromicro Conference on Digital System Design (DSD)10.1109/DSD64264.2024.00022(98-105)Online publication date: 28-Aug-2024
  • (2023)Offloading Machine Learning to Programmable Data Planes: A Systematic SurveyACM Computing Surveys10.1145/360515356:1(1-34)Online publication date: 26-Aug-2023
  • (2023)HEAR: Homomorphically Encrypted AllreduceProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607099(1-17)Online publication date: 12-Nov-2023
  • (2023)Software-Hardware Co-design of Heterogeneous SmartNIC System for Recommendation Models Inference and TrainingProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593724(336-347)Online publication date: 21-Jun-2023
  • (2023)Roar: A Router Microarchitecture for In-network AllreduceProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593711(423-436)Online publication date: 21-Jun-2023
  • (2023)FlexPipe: Fast, Flexible and Scalable Packet Processing for High-Performance SmartNICs2023 IFIP/IEEE 31st International Conference on Very Large Scale Integration (VLSI-SoC)10.1109/VLSI-SoC57769.2023.10321933(1-6)Online publication date: 16-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media