skip to main content
research-article

DirectNVM: Hardware-accelerated NVMe SSDs for High-performance Embedded Computing

Published: 10 February 2022 Publication History

Abstract

With data-intensive artificial intelligence (AI) and machine learning (ML) applications rapidly surging, modern high-performance embedded systems, with heterogeneous computing resources, critically demand low-latency and high-bandwidth data communication. As such, the newly emerging NVMe (Non-Volatile Memory Express) protocol, with parallel queuing, access prioritization, and optimized I/O arbitration, starts to be widely adopted as a de facto fast I/O communication interface. However, effectively leveraging the potential of modern NVMe storage proves to be nontrivial and demands fine-grained control, high processing concurrency, and application-specific optimization. Fortunately, modern FPGA devices, capable of efficient parallel processing and application-specific programmability, readily meet the underlying physical layer requirements of the NVMe protocol, therefore providing unprecedented opportunities to implementing a rich-featured NVMe middleware to benefit modern high-performance embedded computing.
In this article, we present how to rethink existing accessing mechanisms of NVMe storage and devise innovative hardware-assisted solutions to accelerating NVMe data access performance for the high-performance embedded computing system. Our key idea is to exploit the massively parallel I/O queuing capability, provided by the NVMe storage system, through leveraging FPGAs’ reconfigurability and native hardware computing power to operate transparently to the main processor. Specifically, our DirectNVM system aims at providing effective hardware constructs for facilitating high-performance and scalable userspace storage applications through (1) hardening all the essential NVMe driver functionalities, therefore avoiding expensive OS syscalls and enabling zero-copy data access from the application, (2) relying on hardware for the I/O communication control instead of relying on OS-level interrupts that can significantly reduce both total I/O latency and its variance, and (3) exposing cutting-edge and application-specific weighted-round-robin I/O traffic scheduling to the userspace.
To validate our design methodology, we developed a complete DirectNVM system utilizing the Xilinx Zynq MPSoC architecture that incorporates a high-performance application processor (APU) equipped with DDR4 system memory and a hardened configurable PCIe Gen3 block in its programmable logic part. We then measured the storage bandwidth and I/O latency of both our DirectNVM system and a conventional OS-based system when executing the standard FIO benchmark suite [2]. Specifically, compared against the PetaLinux built-in kernel driver code running on a Zynq MPSoC, our DirectNVM has shown to achieve up to 18.4× higher throughput and up to 4.5× lower latency. To ensure the fairness of our performance comparison, we also measured our DirectNVM system against the Intel SPDK [26], a highly optimized userspace asynchronous NVMe I/O framework running on a X86 PC system. Our experiment results have shown that our DirectNVM, even running on a considerably less powerful embedded ARM processor than a full-scale AMD processor, achieved up to 2.2× higher throughput and 1.3× lower latency. Furthermore, by experimenting with a multi-threading test case, we have demonstrated that our DirectNVM’s weighted-round-robin scheduling can significantly optimize the bandwidth allocation between latency-constraint frontend applications and other backend applications in real-time systems. Finally, we have developed a theoretical framework of performance modeling with classic queuing theory that can quantitatively define the relationship between a system’s I/O performance and its I/O implementation.

References

[2]
Jens Axboe. 2020. Flexible I/O. Retrieved from https://github.com/axboe/fio.
[3]
Matias Bjørling, Jens Axboe, David Nellans, and Philippe Bonnet. 2013. Linux block IO: introducing multi-queue SSD access on multi-core systems. In Proceedings of the 6th International Systems and Storage Conference. 1–10.
[4]
Opsero Electronic Design. 2020. FPGA Drive FMC. Retrieved from https://opsero.com/product/fpga-drive-fmc-dual/.
[6]
OpenPOWER Accelerator Work Group. 2020. CAPI Storage, Network, and Analytics Programming (SNAP) Framework. Retrieved from https://developer.ibm.com/linuxonpower/capi/snap.
[7]
Shashank Gugnani, Xiaoyi Lu, and Dhabaleswar K. D. K. Panda. 2019. Analyzing, modeling, and provisioning QoS for NVME SSDs. In Proceedings of the 11th IEEE/ACM International Conference on Utility and Cloud Computing. 247–256. DOI:DOI:
[8]
Jeremy Hsu. 2018. It’s Time to Think Beyond Cloud Computing. Retrieved from https://www.wired.com/story/its-time-to-think-beyond-cloud-computing/.
[11]
Intel. 2020. Open Programmable Accelerator Engine. Retrieved from https://opae.github.io/latest/index.html.
[12]
Yangwook Kang, Yang-suk Kee, Ethan L. Miller, and Chanik Park. 2013. Enabling cost-effective data processing with smart SSD. In Proceedings of the IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1–12.
[13]
Hyeong-Jun Kim, Young-Sik Lee, and Jin-Soo Kim. 2016. NVMeDirect: A user-space I/O framework for application-specific optimization on nvme SSDs. In Proceedings of the 8th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage’16).
[14]
László Lakatos, László Szeidl, and Miklós Telek. 2013. Introduction to Queueing Systems with Telecommunication Applications. Vol. 9781461453. Springer. DOI:DOI:
[15]
Damien Le Moal. 2017. I/O Latency Optimization with polling. In Proceedings of the Vault Linux Storage and Filesystems Conference.
[16]
Till Miemietz, Hannes Weisbach, Michael Roitzsch, and Hermann Härtig. 2019. K2: Work-constraining scheduling of NVMe-attached storage. In Proceedings of the IEEE Real-time Systems Symposium (RTSS). IEEE, 56–68.
[17]
Arslan Munir, Sanjay Ranka, and Ann Gordon-Ross. 2012. High-performance energy-efficient multicore embedded computing. Trans. Parallel Distrib. Syst. 23 (05 2012), 684-700. DOI:DOI:
[18]
Zhenyuan Ruan, Tong He, and Jason Cong. 2019. INSIDER: Designing in-storage computing system for emerging high-performance drive. In Proceedings of the USENIX Annual Technical Conference (USENIXATC’19). 379–394.
[20]
Dong Won Seo. 2014. Explicit formulae for characteristics of finite-Capacity M/D/1 queues. ETRI J. 36, 4 (2014), 609–616. DOI:DOI:
[21]
Athanasios Stratikopoulos. 2019. Low Overhead & Energy Efficient Storage Path for Next Generation Computer Systems.Ph.D. Dissertation. University of Manchester.
[22]
Athanasios Stratikopoulos, Christos Kotselidis, John Goodacre, and Mikel Luján. 2018. FastPath: Towards wire-speed NVMe SSDs. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 170–1707.
[26]
Ziye Yang, James R. Harris, Benjamin Walker, Daniel Verkamp, Changpeng Liu, Cunyin Chang, Gang Cao, Jonathan Stern, Vishal Verma, and Luse E. Paul. 2017. SPDK: A development kit to build high performance storage applications. In Proceedings of the IEEE International Conference on Cloud Computing Technology and Science (CloudCom). IEEE, 154–161.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 21, Issue 1
January 2022
288 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3505211
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 10 February 2022
Accepted: 01 April 2021
Revised: 01 March 2021
Received: 01 December 2020
Published in TECS Volume 21, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. NVMe
  2. SSDs
  3. FPGA
  4. high-Throughput high-Performance computing

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • Defense Advanced Research Projects Agency (DARPA)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)388
  • Downloads (Last 6 weeks)39
Reflects downloads up to 27 Jan 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media