research-article

Open access

SmartIO: Zero-overhead Device Sharing through PCIe Networking

Authors:

Jonas Markussen,

Lars Bjørlykke Kristiansen,

Pål Halvorsen,

Halvor Kielland-Gyrud,

Håkon Kvale Stensland,

Carsten GriwodzAuthors Info & Claims

ACM Transactions on Computer Systems (TOCS), Volume 38, Issue 1-2

Article No.: 2, Pages 1 - 78

https://doi.org/10.1145/3462545

Published: 08 July 2021 Publication History

All formats PDF

Abstract

The large variety of compute-heavy and data-driven applications accelerate the need for a distributed I/O solution that enables cost-effective scaling of resources between networked hosts. For example, in a cluster system, different machines may have various devices available at different times, but moving workloads to remote units over the network is often costly and introduces large overheads compared to accessing local resources. To facilitate I/O disaggregation and device sharing among hosts connected using Peripheral Component Interconnect Express (PCIe) non-transparent bridges, we present SmartIO. NVMes, GPUs, network adapters, or any other standard PCIe device may be borrowed and accessed directly, as if they were local to the remote machines. We provide capabilities beyond existing disaggregation solutions by combining traditional I/O with distributed shared-memory functionality, allowing devices to become part of the same global address space as cluster applications. Software is entirely removed from the data path, and simultaneous sharing of a device among application processes running on remote hosts is enabled. Our experimental results show that I/O devices can be shared with remote hosts, achieving native PCIe performance. Thus, compared to existing device distribution mechanisms, SmartIO provides more efficient, low-cost resource sharing, increasing the overall system performance.

References

[1]

Keras. [n.d.]. Retrieved from https://keras.io.

[2]

TensorFlow. [n.d.]. Large-Scale Machine Learning on Heterogeneous Systems. Retrieved from https://www.tensorflow.org/.

[3]

Darren Abramson, Jeff Jackson, Sridhar Muthrasanallur, Gil Neiger, Greg Regnier, Rajes Sankaran, Ioannis Schoinas, Rich Uhlig, Balaji Vembu, and John Weigert. 2006. Intel virtualization technology for directed I/O. Intel Technol. J. 10, 03 (2006).

[4]

Ahmed Abulila, Vikram Sharma Mailthody, Zaid Qureshi, Jian Huang, Nam Sung Kim, Jinjun Xiong, and Wen-mei Hwu. 2019. FlatFlash: Exploiting the byte-accessibility of SSDs within a unified memory-storage hierarchy. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 971–985.

Digital Library

[5]

Marcos K. Aguilera, Nadav Amit, Irina Calciu, Xavier Deguillard, Jayneel Gandhi, Stanko Novaković, Arun Ramanathan, Pratap Subrahmanyam, Lalith Suresh, Kiran Tati, Rajesh Venkatasubramanian, and Michael Wei. 2018. Remote regions: A simple abstraction for remote memory. In Proceedings of the USENIX Annual Technical Conference (ATC’18). 775–787.

[6]

Knut Alnæs, Ernst H. Kristiansen, David B. Gustavson, and David V. James. 1990. Scalable coherent interface. In Proceedings of the International Conference on Computer Systems and Software Engineering (CompEuro’90). 446–453.

[7]

Nadav Amit, Muli Ben-Yehuda, and Ben-Ami Yassour. 2010. IOMMU: Strategies for mitigating the IOTLB bottleneck. In Proceedings of the International Symposium on Computer Architecture(ISCA’10). Springer, 256–274.

[8]

Eric A. Anderson and Jeanna M. Neefe. 1994. An Exploration of Network RAM. Technical Report. EECS Department, University of California. Retrieved from https://www2.eecs.berkeley.edu/Pubs/TechRpts/1998/CSD-98-1000.pdf.

[9]

Jens Axboe. [n.d.]. Flexible I/O Tester. Retrieved from https://github.com/axboe/fio.

[10]

Stephen Bates. 2015. Project Donard. Retrieved from https://github.com/sbates130272/donard.

[11]

Shai Bergman, Tanya Brokhman, Tzachi Cohen, and Mark Silberstein. 2017. SPIN: Seamless operating system integration of peer-to-peer DMA between SSDs and GPUs. In Proceedings of the USENIX Annual Technical Conference (ATC’17). 665–676.

[12]

Maciej Bielski, Christian Pinto, Daniel Raho, and Renaud Pacalet. 2016. Survey on memory and devices disaggregation solutions for HPC systems. In Proceedings of the International Conference on Computational Science and Engineering and International Conference on Embedded and Ubiquitous Computing and International Symposium on Distributed Computing and Applications for Business Engineering (CSE-EUC-DCABES’16). 197–204.

[13]

Broadcom. 2011. PEX8733, PCI Express Gen 3 Switch, 32 Lanes, 18 Ports. Retrieved from https://docs.broadcom.com/docs/12351852.

[14]

Broadcom. 2012. PEX8796, PCI Express Gen 3 Switch, 96 Lanes, 24 Ports. Retrieved from https://docs.broadcom.com/docs/12351860.

[15]

I.-Hsin Chung, Bulent Abali, and Paul Crumley. 2018. Towards a composable computer system. In Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region (HPCAsia’18). 137–147.

Digital Library

[16]

Adam Coates, Brody Huval, Tao Wang, David J. Wu, Andrew Y. Ng, and Bryan Catanzaro. 2013. Deep learning with COTS HPC systems. In Proceedings of the International Conference on Machine Learning (ICML’13). 1337–1345.

[17]

Intel Corporation. 2015. Intel Rack Scale Design. Retrieved from https://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html.

[18]

Liqid Corporation. [n.d.]. Liqid Composable Infrastructure. Retrieved from https://www.liqid.com/.

[19]

Alexandros Daglis, Stanko Novaković, Edouard Bugnion, Babak Falsafi, and Boris Grot. 2015. Manycore network interfaces for in-memory rack-scale computing. ACM SIGARCH Comput. Architect. News 43, 3 (2015), 567–579.

Digital Library

[20]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’09). 248–255.

[21]

Dolphin Interconnect Solutions. [n.d.]. SFF-8644 MiniSAS-HD PCIe Gen3 cables. Retrieved from https://www.dolphinics.com/products/PCI_Express_SFF-8644_cables.html.

[22]

Dolphin Interconnect Solutions [n.d.]. SISCI API Documentation. Dolphin Interconnect Solutions. Retrieved from http://ww.dolphinics.no/download/SISCI_DOC_V2/.

[23]

José Duato, Antonio J. Pena, Frederico Silla, Rafael Mayo, and Enrique S. Quintana-Ortí. 2010. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters. In Proceedings of the International Conference on High Performance Computing and Simulation (HPCS’10). 224–231.

[24]

Michael J. Feeley, William E. Morgan, Frederic H. Pighin, Anna R. Karlin, and Henry M. Levy. 1995. Implementing global memory management in a workstation cluster. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’95). 201–212.

[25]

Trevor Fountain, Alexandra McCarthy, and Fangfang Peng. 2005. PCI express: An overview of PCI express, cabled PCI express and PXI express. In Proceedings of the International Conference on Accelerator 8 Large Experimental Physics Control Systems(ICALEPCS’05).

[26]

Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdury, and Kang G. Shin. 2017. Efficient memory disaggregation with INFINISWAP. In Proceedings of the Symposium on Networked Systems Design and Implementation (NSDI’17). 649–667.

[27]

Anubhav Guleria, J. Lakshmi, and Chakri Padala. 2019. EMF: Disaggregated GPUs in datacenters for efficiency, modularity and flexibility. In Proceedings of the International Conference on Cloud Computing in Emerging Markets (CCEM’19). 1–8.

[28]

Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balakrishnan. 2017. NVMe-over-fabrics performance characterization and the path to low-overhead flash disaggregation. In Proceedings of the International Systems and Storage Conference (SYSTOR’17). 1–9.

Digital Library

[29]

Zvika Guz, Harry Li, Anahita Shayesteh, and Vijay Balkrishnan. 2018. Performance characterization of NVMe-over-fabrics storage disaggregation. ACM Trans. Stor. 14, 4 (Dec. 2018), 1–18.

[30]

Steven Alexander Hicks, Michael Riegler, Konstantin Pogorelov, Kim V. Ånonsen, Thomas de Lange, Dag Johansen, Mattis Jeppsson, Kristin Ranheim Randel, Sigrun Eskeland, and Pål Halvorsen. 2018. Dissecting deep neural networks for better medical image classification and classification understanding. In Proceedings of the International Symposium on Computer-Based Medical Systems (CBMS’18). 363–368.

[31]

Rui Hou, Tao Jiang, Liuhang Zhang, Pengfei Qi, Jianbo Dong, Haibin Wang, Xiongli Gu, and Shujie Zhang. 2013. Cost effective data center servers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA’13). 179–187.

[32]

Jian Huang, Xiangyong Ouyang, Jithin Jose, Md Wasi-Ur-Rahman, Hao Wang, Miao Luo, Hari Subramoni, Chet Murthy, and Dhabaleswar K. Panda. 2012. High-performance design of hbase with RDMA over InfiniBand. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’12). 774–785.

[33]

Neo Jia and Kirti Wankhede. 2016. VFIO Mediated Devices. Retrieved from https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt.

[34]

Weihang Jiang, Jiuxing Liu, Hyun-Wook Jin, Dhabaleswar K. Panda, William Gropp, and Rajeev Thakur. 2004. High performance MPI-2 one-sided communication over InfiniBand. In Proceedings of the International Symposium on Cluster Computing and the Grid (CCGrid’04). 531–538.

[35]

Linux kernel development community. [n.d.]. NTB Drivers. Retrieved from https://www.kernel.org/doc/html/latest/driver-api/ntb.html.

[36]

Linux kernel development community. 2013. Linux Filesystems API. Retrieved from https://www.kernel.org/doc/htmldocs/filesystems/index.html.

[37]

Linux kernel development community. 2013. VFIO—“Virtual Function I/O.” Retrieved from https://www.kernel.org/doc/Documentation/vfio.txt.

[38]

Linux kernel development community. 2019. Linux IOMMU Support. Retrieved from https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt.

[39]

Hyeong-Jun Kim, Young-Sik Lee, and Jin-Soo Kim. 2016. NVMeDirect: A user-space I/O framework for application-specific optimization on NVMe SSDs. In Proceedings of the USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16). 41–45.

[40]

KaiGai Kohei. 2016. GpuScan + SSD-to-GPUDirect DMA. Retrieved from https://kaigai.hatenablog.com/entry/2016/09/08/003556.

[41]

Lars Bjørlykke Kristiansen, Jonas Markussen, Håkon Kvale Stensland, Michael Riegler, Hugo Kohmann, Friedrich Seifert, Roy Nordstrøm, Carsten Griwodz, and Pål Halvorsen. 2016. Device lending in PCI express networks. In Proceedings of the International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV’16). 10:1–10:6.

Digital Library

[42]

Shuang Liang, Ranjit Noronha, and Dhabaleswar K. Panda. 2005. Swapping to remote memory over Infiniband: An approach using a high performance network block device. In Proceedings of the IEEE International Conference on Cluster Computing (Cluster’05). 1–10.

[43]

Kevin Lim, Jichuan Chang, Trevor Mudge, Parthasarathy Ranganathan, Steven K. Reinhardt, and Thomas F. Wenisch. 2009. Disaggregated memory for expansion and sharing in blade servers. In Proceedings of the the Annual International Symposium on Computer Architecture(ISCA’09). 267–278.

[44]

Seung-Ho Lim, Ki-Woong Park, and Kwang-Ho Cha. 2019. Developing an OpenSHMEM model over a switchless PCIe non-transparent bridge interface. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops (IPDPSW’19). 593–602.

[45]

Xiaoyi Lu, Nusrat S. Islam, Md. Wasi-Ur-Rahman, Jithin Jose, Hari Subramoni, Hao Wang, and Dhabaleswar K. Panda. 2013. High-performance design of Hadoop RPC with RDMA over InfiniBand. In Proceedings of the International Conference on Parallel Processing (ICPP’13). 641–650.

Digital Library

[46]

Evangelos P. Markatos and George Dramitinos. 1996. Implementation of a reliable remote memory pager. In Proceedings of the USENIX Annual Technical Conference (ATC’96).

[47]

Athanasios Theodore Markettos, Colin Rothwell, Brett F. Gutstein, Allison Pearce, Peter G. Neumann, Simon W. Moore, and Robert N. M. Watson. 2019. Thunderclap: Exploring vulnerabilities in operating system IOMMU protection via DMA from untrustworthy peripherals. In Proceedings of the Network and Distributed System Security Symposium (NDSS’19).

[48]

Jonas Markussen, Lars Bjørlykke Kristiansen, Rune Johan Borgli, Håkon Kvale Stensland, Friedrich Seifert, Michael Riegler, Carsten Griwodz, and Pål Halvorsen. 2020. Flexible device compositions and dynamic resource sharing in PCIe interconnected clusters using Device lending. Cluster Comput. 23 (2020), 1211–1234. Issue 2.

Digital Library

[49]

Jonas Markussen, Lars Bjørlykke Kristiansen, Håkon Kvale Stensland, Friedrich Seifert, Carsten Griwodz, and Pål Halvorsen. 2018. Flexible device sharing in PCIe clusters using device lending. In Proceedings of the International Conference on Parallel Processing Companion (ICPPComp’18). Article 48, 48:1–48:10.

Digital Library

[50]

Vijay Meduri. 2011. A Case for PCI Express as a High-Performance Cluster Interconnect. Retrieved from https://www.hpcwire.com/2011/01/24/a_case_for_pci_express_as_a_high-performance_cluster_interconnect/.

[51]

Microsemi. 2019. Multi-Host Sharing of NVMe Drives and GPUs Using PCIe Fabrics. Technical Report. Microsemi. Retrieved from http://www.symmttm.com/document-portal/doc_download/1244483-multi-host-sharing-of-nvme-drives-and-gpus-using-pcie.

[52]

Ben-Yehuda Muli, Jon Mason, Orran Krieger, Jimi Xenidis, Leendert Van Doorn, Asit Mallick, Jun Nakijima, and Elsie Wahlig. 2006. Utilizing IOMMUs for virtualization in Linux and Xen. In Proceedings of the Linux Symposium. 71–85.

[53]

NVIDIA Corporation 2019. GPUDirect RDMA Documentation. NVIDIA Corporation. Retrieved from https://docs.nvidia.com/cuda/gpudirect-rdma/index.html.

[54]

NVIDIA Corporation 2020. CUDA Toolkit Documentation v11.0.171. NVIDIA Corporation. Retrieved from http://docs.nvidia.com/cuda/.

[55]

NVM Express 2019. NVM Express Base Specification. NVM Express. Retrieved from https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3d-2019.03.20-Ratified.pdf.

[56]

NVM Express 2019. NVM Express Over Fabrics. NVM Express. Retrieved from https://nvmexpress.org/wp-content/uploads/NVMe-over-Fabrics-1.1-2019.10.22-Ratified.pdf.

[57]

Sinno Jialin Pan and Qiang Yang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (Oct. 2010), 1345–1359.

Digital Library

[58]

Bo Peng, Haozhong Zhang, Jianguo Yao, Yaozu Dong, Yu Xu, and Haibing Guan. 2018. MDev-NVMe: A NVMe storage virtualization solution with mediated pass-through. In Proceedings of the USENIX Annual Technical Conference (ATC’18). 665–676.

[59]

Peripheral Component Interconnect Special Interest Group (PCI-SIG) 2008. Multi-root I/O Virtualization and Sharing Specification. Peripheral Component Interconnect Special Interest Group (PCI-SIG). Retrieved from https://www.pcisig.com/specifications/iov/multi-root/.

[60]

Peripheral Component Interconnect Special Interest Group (PCI-SIG) 2009. Address Translation Services Revision 1.1. Peripheral Component Interconnect Special Interest Group (PCI-SIG). Retrieved from https://www.pcisig.com/specifications/iov/ats/.

[61]

Peripheral Component Interconnect Special Interest Group (PCI-SIG) 2010. PCI Express 3.1 Base Specification. Peripheral Component Interconnect Special Interest Group (PCI-SIG). Retrieved from https://pcisig.com/specifications.

[62]

Peripheral Component Interconnect Special Interest Group (PCI-SIG) 2010. Single-root I/O Virtualization and Sharing Specification. Peripheral Component Interconnect Special Interest Group (PCI-SIG). Retrieved from https://www.pcisig.com/specifications/iov/single-root/.

[63]

Konstantin Pogorelov, Olga Ostroukhova, Mattis Jeppsson, Håvard Espeland, Carsten Griwodz, Thomas de Lange, Dag Johansen, Michael Riegler, and Pål Halvorsen. 2018. Deep learning and hand-crafted feature based approaches for polyp detection in medical videos. In Proceedings of the International Symposium on Computer-Based Medical Systems (CBMS’18). 381–386.

[64]

Konstantin Pogorelov, Kristin Ranheim Randel, Carsten Griwodz, Sigrun Losada Eskeland, Thomas de Lange, Dag Johansen, Concetto Spampinato, Duc-Tien Dang-Nguyen, Mathias Lux, Peter Thelin Schmidt, Michael Riegler, and Pål Halvorsen. 2017. KVASIR: A multi-class image dataset for computer aided gastrointestinal disease detection. In Proceedings of the ACM Multimedia Systems Conference (MMSys’17). 164–169.

Digital Library

[65]

Konstantin Pogorelov, Michael Riegler, Sigrun Eskeland, Thomas de Lange, Dag Johansen, Carsten Griwodz, Peter Thelin Schmidt, and Pål Halvorsen. 2017. Efficient disease detection in gastrointestinal videos–global features versus neural networks. Multimedia Tools Appl. 76, 21 (2017), 22493–22525.

Digital Library

[66]

Konstantin Pogorelov, Michael Riegler, Jonas Markussen, Mathias Kux, Håkon Kvale Stensland, Thomas Lange, Carsten Griwodz, Pål Halvorsen, Dag Johansen, Peter Schmidt, and Sigrun Eskeland. 2016. Efficient processing of videos in a multi auditory environment using device lending of GPUs. In Proceedings of the International Conference on Multimedia Systems (MMSys’16). 381–386.

Digital Library

[67]

Murali Ravindran. 2008. Extending cabled PCI express to connect devices with independent PCI domains. In Proceedings of the IEEE Systems Conference (SysCon’08). 1–7.

[68]

Carlos Reaño, Federico Silla, and José Duato. 2017. Enhancing the rCUDA remote GPU virtualization framework: From a prototype to a production solution. In Proceedings of the International Symposium on Cluster, Cloud and Grid Computing (CCGRID’17). 695–698.

Digital Library

[69]

Jack Regula. 2004. Using Non-Transparent Bridging in PCI Express Systems. Whitepaper. PLX Technology/Broadcom. Retrieved from https://www.digikey.no/no/pdf/b/broadcom/using-non-transparent-bridging-pci.

[70]

Davide Rosetti. 2014. Benchmarking GPUDirect RDMA on Modern Server Platforms. Retrieved from https://developer.nvidia.com/blog/benchmarking-gpudirect-rdma-on-modern-server-platforms/.

[71]

Andy Rudoff. 2017. Persistent memory programming. USENIX; login: 42, 2 (2017), 34–40. Retrieved from https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf.

[72]

Kazuo Saito, Koji Anai, Keiju Igarashi, Takeshi Nishikawa, Ryoichi Himeno, and Kazuhiro Yoguchi. 1998. ATM bus system. U.S. patent No. 5,796,741 A.

[73]

Nikolay Sakharnykh. 2016. Beyond GPU Memory Limits with Unified Memory on Pascal. Retrieved from https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/.

[74]

Yizhou Shan, Yutong Huang, Yilun Chen, and Yiying Zhang. 2018. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Proceedings of the Conference on Operating Systems Design and Implementation (OSDI’18). 69–87.

[75]

Cheol Shim, Kwang-Ho Cha, and Min Choi. 2018. Design and implementation of initial OpenSHMEM on PCIe NTB based cloud computing. Cluster Comput. 22 (Feb. 2018), 1815–1826.

[76]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. Retrieved from https://arXiv:1409.1556.

[77]

Mark J. Sullivan. 2010. Intel Xeon Processor C5500/C3500 Series Non-Transparent Bridge. Technical Report. Intel Corporation.

[78]

Jun Suzuki, Yoichi Hidaka, Hunichi Higuchi, Masaki Kan, and Takashi Yoshikawa. 2016. Disaggregation and sharing of I/O devices in cloud data centers. IEEE Trans. Comput. 65 (Dec. 2016), 3013–3026. Issue 10.

Digital Library

[79]

Jun Suzuki, Yoichi Hidaka, Junichi Higuchi, Teruyuki Baba, Nobuharu Kami, and Takashi Yoshikawa. 2010. Multi-root share of single-root I/O virtualization (SR-IOV) compliant PCI express device. In Proceedings of the IEEE Symposium on High Performance Interconnects(HOTI’10). 25–31.

Digital Library

[80]

Amir Taherkordi, Feroz Zahid, Yiannis Verginadis, and Geir Horn. 2018. Future cloud system designs: Challenges and research directions. IEEE Access 6 (2018).

[81]

Mellanox Technologies. [n.d.]. ConnectX-5 EN Single/Dual-Port Adapter Supporting 100Gb/s Ethernet. Retrieved from https://www.mellanox.com/products/ethernet-adapters/connectx-5-en.

[82]

PLX Technologies. 2005. Multi-Host System and Intelligent I/O Design with PCI Express. Whitepaper. PLX Technology/Broadcom. Retrieved from https://docs.broadcom.com/docs-and-downloads/pdf/technical/expresslane/NTB_Brief_April-05.pdf.

[83]

Adam Thompson and Chris J. Newburn. 2019. GPUDirect Storage: A Direct Path Between Storage and GPU Memory. Retrieved from https://developer.nvidia.com/blog/gpudirect-storage/.

[84]

Animesh Trivedi, Bernard Metzler, and Patrick Stuedi. 2011. A case for RDMA in clouds. In Proceedings of the Second Asia-Pacific Workshop on Systems (APSys’11). 17:1–17:5.

Digital Library

[85]

Shin-Yeh Tsai and Yiying Zhang. 2019. A double-edged sword: Security threats and opportunities in one-sided network communication. In Proceedings of the Workshop on Hot Topics in Cloud Computing (HotCloud’19).

[86]

Cheng-Chun Tu. 2014. Memory-Based Rack Area Networking. Ph.D. Dissertation. Stony Brook University.

[87]

Cheng-Chun Tu and Tzi-cker Chiueh. 2018. Seamless fail-over for PCIe switched networks. In Proceedings of the International Systems and Storage Conference (SYSTOR’18). 101–111.

Digital Library

[88]

Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2013. Secure I/O device sharing among virtual machines on multiple hosts. ACM SIGARCH Comput. Architect. News 41, 3 (2013), 108–119.

Digital Library

[89]

Cheng-Chun Tu, Chao-tang Lee, and Tzi-cker Chiueh. 2014. Marlin: A memory-based rack area network. In Proceedings of the ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS’14). 125–136.

Digital Library

[90]

Akshay Venkatesh, Khaled Hamidouche, Sreeram Potluri, Davide Rosettig, Ching-Hsiang Chu, and Dhabaleswar K. Panda. 2017. MPI-GDS: High performance MPI designs with GPUDirect-aSync for CPU-GPU control flow decoupling. In Proceedings of the International Conference on Parallel Processing (ICPP’17). 151–160.

[91]

Akshay Venkatesh, Hari Subramoni, Khaled Hamidouche, and Dhabaleswar K. Panda. 2014. A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters. In Proceedings of the International Conference on High Performance Computing (HiPC’14). 1–10.

[92]

Heymian Wong. 2011. PCI Express Multi-Root Switch Reconfiguration During System Operation. Master’s thesis. Massachusetts Institute of Technology.

[93]

Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An empirical guide to the behavior and use of scalable persistent memory. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST’20). 169–182.

[94]

Ziye Yang, James R. Harris, Benjamin Walker, Daniel Verkamp, Changpeng Liu, Cunyin Chang, Gang Cao, Jonathan Stern, Vishal Verma, and Luse E. Paul. 2017. SPDK: A development kit to build high performance storage applications. In Proceedings of the International Conference on Cloud Computing Technology and Science (CloudCom’17). 154–161.

[95]

Xiangliang Yu. 2016. NTB: Add support for AMD PCI-Express Non-Transparent Bridge. Retrieved from https://lwn.net/Articles/672752/.

Cited By

Ye CLi YHe BLi ZSun J(2024)Large-Scale Graph Label Propagation on GPUsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.333632936:10(5234-5248)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3336329
Markussen JKristiansen LKvale Stensland HHalvorsen P(2024)Multi-Host Sharing of a Single-Function NVMe Device in a PCIe ClusterSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00204(1638-1645)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00204
Huang YFan XYan SWeng C(2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00289
Show More Cited By

Index Terms

SmartIO: Zero-overhead Device Sharing through PCIe Networking

Recommendations

Flexible device compositions and dynamic resource sharing in PCIe interconnected clusters using Device Lending
Abstract
Modern workloads often exceed the processing and I/O capabilities provided by resource virtualization, requiring direct access to the physical hardware in order to reduce latency and computing overhead. For computers interconnected in a cluser, ...
Flexible Device Sharing in PCIe Clusters using Device Lending
ICPP Workshops '18: Workshop Proceedings of the 47th International Conference on Parallel Processing

Processing workloads may have very high IO demands, exceeding the capabilities provided by resource virtualization and requiring direct access to the physical hardware. For computers that are interconnected in PCI Express (PCIe) networks, we have ...
Accelerated high-performance computing through efficient multi-process GPU resource sharing
CF '12: Proceedings of the 9th conference on Computing Frontiers

The HPC field is witnessing a widespread adoption of GPUs as accelerators for traditional homogeneous HPC systems. One of the prevalent parallel programming models is the SPMD paradigm, which has been adapted for GPU-based parallel processing. Since ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Computer Systems

ACM Transactions on Computer Systems Volume 38, Issue 1-2

May 2020

178 pages

ISSN:0734-2071

EISSN:1557-7333

DOI:10.1145/3474395

Editor:
Michael Swift
University of Wisconsin, USA

Issue’s Table of Contents

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 July 2021

Accepted: 01 April 2021

Revised: 01 February 2021

Received: 01 July 2020

Published in TOCS Volume 38, Issue 1-2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Norges Forskningsråd

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
12,147
Total Downloads

Downloads (Last 12 months)3,425
Downloads (Last 6 weeks)317

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ye CLi YHe BLi ZSun J(2024)Large-Scale Graph Label Propagation on GPUsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.333632936:10(5234-5248)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1109/TKDE.2023.3336329
Markussen JKristiansen LKvale Stensland HHalvorsen P(2024)Multi-Host Sharing of a Single-Function NVMe Device in a PCIe ClusterSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00204(1638-1645)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00204
Huang YFan XYan SWeng C(2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00289
Wang SXu HMamandipoor AMahapatra RAhn BGhodrati SKailas KAlian MEsmaeilzadeh H(2024)Data Motion Acceleration: Chaining Cross-Domain Multi Accelerators2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00083(1043-1062)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00083
Kundel RKundel R(2024)Design of QoS-Aware Network FunctionsAccelerating Network Functions Using Reconfigurable Hardware10.1007/978-3-031-52872-9_3(37-91)Online publication date: 19-Apr-2024
https://doi.org/10.1007/978-3-031-52872-9_3
He BZheng XChen YLi WZhou YLong XZhang PLu XJiang LLiu QCai DZhang X(2023)DxPU: Large-scale Disaggregated GPU Pools in the DatacenterACM Transactions on Architecture and Code Optimization10.1145/361799520:4(1-23)Online publication date: 14-Dec-2023
https://dl.acm.org/doi/10.1145/3617995

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents