skip to main content
research-article
Open access

Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources

Published: 21 May 2024 Publication History

Abstract

FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing cores and memory are shared. Nonetheless, multi-tenancy still faces major architectural limitations including: (a) inefficient sharing of memory interfaces across hardware tasks (HT) exacerbated by technological limitations and peculiarities, (b) insufficient solutions for performance and data isolation and high quality of service, and (c) absent or simplistic allocation strategies to effectively distribute external FPGA memory across HT. This article presents a full-stack solution for enabling multi-tenancy on FPGAs. Specifically, our work proposes an intra-fpga virtualization layer to share FPGA interfaces and its resources across tenants. To achieve efficient inter-connectivity between virtual FPGAs (vFGPAs) and external interfaces, we employ a compact network-on-chip architecture to optimize resource utilization. Dedicated memory management units implement the concept of virtual memory in FPGAs, providing mechanisms to isolate the address space and enable memory protection. We also introduce a memory segmentation scheme to effectively allocate FPGA address space and enhance isolation through hardware-software support, while preserving the efficacy of memory transactions. We assess our solution on an Alveo U250 Data Center FPGA Card, employing 10 real-world benchmarks from the Rodinia and Rosetta suites. Our framework preserves the performance of HT from a non-virtualized environment, while enhancing the device aggregate throughput through resource sharing; up to 3.96x in isolated and up to 2.31x in highly congested settings, where an external interface is shared across four vFPGAs. Finally, our work ensures high-quality of service, with HT achieving up to 0.95x of their native performance, even when resource sharing introduces interference from other accelerators.

1 Introduction

Field Programmable Gate Arrays (FPGAs) have become a popular choice for accelerating diverse workloads, including neural networks [10, 38], data analytics [4, 55], databases [31, 36], quantitative finance [47], linear algebra [46, 50], security [48], and compression [34]. Their ability to be reconfigured at runtime and their high energy efficiency make them an effective alternative compared to ASICs and GPUs. This led major cloud providers like Amazon [3] and Alibaba [2] to incorporate FPGAs into their infrastructure and offer on-demand acceleration to their customers.
Modern FPGAs [45] witness a substantial increase in available resources owing to two key factors: (a) the integration of multiple distinct dies into a unified substrate, and (b) the inclusion of multiple memory controllers to enhance parallel access to external FPGA memory. However, hardware designs fall short in harnessing their potential. Large monolithic kernels are affected by routing and placing constraints, especially when spanning across multiple dies, while smaller hardware accelerators inefficiently utilize the available memory and reconfigurable resources. As a solution, cloud providers aim to optimize their return on investment by effectively distributing FPGAs among multiple customers, similar to the virtualization and sharing of CPUs and memory resources.
Enabling FPGA resource sharing among multiple tenants requires the concurrent execution of hardware tasks (HT)to accelerate their respective workloads. However, this presents challenges in establishing the necessary architectural support for a multi-tenant environment. At first, spatial multiplexing should not compromise the isolation or the native performance of accelerators (challenge 1), thus providing the illusion of exclusivity to tenants. The illusion of exclusivity involves two key aspects; Performance isolation covers the interference effects among hardware accelerators caused by shared interconnections. Data isolation concerns the ability of hardware accelerators to compromise sensitive information or access privileged interfaces. Existing environments allow HT to operate without supervision, granting them privileged access to I/O interfaces and FPGA resources (challenge 2). Traditional interconnecting methods fall short on modern FPGAs as they lack scalability and flexibility. They hinder the ability to attain high aggregate throughput and to provide sufficient simultaneous acceleration to tenants (challenge 3). Finally, efficient allocations schemes are essential to distribute FPGA resources across tenants (challenge 4), as they may limit the inherent performance of accelerators or compromise the isolation of tenants.
Previous research efforts have yielded valuable insights on enabling multi-tenancy in FPGAs. Nonetheless, they encounter several architectural limitations and partially address the challenges associated with FPGA sharing. For instance, overlay techniques [11, 27, 28] lack the reconfigurability to map new hardware accelerators onto FPGA devices and limit their native performance. Some works [6, 9, 17, 30] propose static allocation of memory and reconfigurable resources or partitioning I/O and memory interfaces across HT. These approaches limit their performance and the design flexibility of hardware developers. Other works [6, 9, 51, 52] also lack virtualization support to effectively and securely share external memory and I/O interfaces across HT, while others [17, 21] propose a static multiplexing of I/O and memory interfaces, which do not effectively address challenge 3. In addition, allocation schemes on FPGAs are still limited and are only effective either on shared-memory [25] or cache-coherent [21] FPGA platforms, where they are strongly coupled with the host CPU. Therefore, as Table 1 outlines, challenges related to multi-tenancy and an efficient architectural support for FPGA sharing have not been fully addressed or remain unresolved.
Table 1.
ResearchPerformance & Data Isolation (challenge 1)Intra-FPGA Virtualization (challenge 2)FPGA Scalability (challenge 3)FPGA Address Space Management (challenge 4)
Cloud FPGA [6, 9]supportno supportno supportno support
AmorphOS [17]supportno supportpartial support (static I/O partitioning)partial support (static segments)
Coyote [21]supportsupportpartial support (static I/O multiplexing)partial support (applicable on cache-coherent FPGAs)
Optimus [25]supportnot applicablesupportpartial support (applicable on shared-memory FPGAs)
Vital [51]partial support (only performance isolation)no supportsupportno support
VenOS [30]supportno supportsupportpartial support (static segments)
Overlays [27, 28]partial support (only performance isolation)no supportsupportno support
This Worksupportsupportsupportsupport
Table 1. Comparison of Previous Works on FPGA Sharing
Existing works do not focus on providing efficient architectural support for multiplexing distinct tenants on a single FPGA device. Our work provides the most complete and flexible set of virtualization features.
In this article, we present our full-stack solution to effectively provide architectural support for virtualizing and sharing FPGA resources to simultaneously accelerate workloads from distinct tenants. Our work inserts an intra-fpga virtualization layer to allow efficient sharing of external interfaces among hardware accelerators and provide mechanisms for data isolation and protection of I/O interfaces, thereby offering the illusion of operating in an single-tenant environment (challenge 1). This layer relies on a network-on-chip (NoC) architecture, which offers a flexible and scalable design to share FPGA resources and interfaces across HT. NoCs overcome the limitations of static configurations from previous works, optimizing the aggregate throughput of an FPGA and enhance its utilization from multiple tenants (challenge 3). Our work deviates from creating an overlay architecture with preconfigured processing elements. Instead, we harness NoCs as an interconnecting method for efficient sharing of I/O interfaces and an effective layer to accommodate tenants.
Each hardware accelerator is coupled with a task memory management unit (TMMU), that virtualizes the external FPGA memories and enables tasks to operate using virtual addresses. This mechanism provides a flexible way to utilize virtual FPGAs (vFPGAs) in a similar manner as virtual machines in conventional processing units. Each hardware task within vFPGAs operates within its own logical address space, but TMMU takes charge of translating virtual addresses into physical ones, while preventing any unauthorized access to I/O interfaces or privileged address regions. Additionally, virtual addresses conceal the physical data location, inserting a higher level of abstraction and memory protection. By combining TMMUs with the intra-fpga virtualization layer, hardware accelerators operate in a non-privileged execution mode (challenge 2), without compromising their inherent performance. Our framework allows HT to operate as distinct processes, enhancing FPGA utilization and aggregate throughput.
Finally, to effectively handle the FPGA address space (challenge 4), we implement a custom memory segmentation scheme. Previous work [21, 25] has shown how internal fragmentation of pages can impact hardware task performance, mainly due to inefficient DMA transactions. This limitation is effectively addressed through segments, as tasks execute data transactions with similar efficiency as single-tenant environments. Our segmentation scheme efficiently arranges data either in a sequential manner within FPGA memory or distributes them across a few segments, resembling the concept of huge pages in operating systems. Memory segments allow to effectively isolate the address space across tenants and facilitate data isolation through hardware-software support. Concurrently, TMMUs enable HT to perceive the address space in a sequential manner, rendering the segmentation scheme transparent to users.
In summary, our work makes the following contributions:
We present a novel architecture (Section 2) for virtualizing and sharing external I/O interfaces across hardware accelerators. Our approach involves the insertion of an intra-fgpa virtualization layer (Section 3), enabling scalable and flexible sharing of FPGA resources through a NoC architecture. This setup allows us to establish effective isolation mechanisms, provide high quality of service and the illusion of exclusivity to tenants.
We propose the adoption of memory management units (Section 3.3) to virtualize external FPGA memories, allowing hardware accelerators to operate using virtual addresses. The primary aim is to confine the accelerators to work in a user-level execution, thereby restricting their access to unauthorized I/O interfaces or privileged address spaces. They enhance isolation between tenants and I/O protection, without compromising the native performance of hardware accelerators.
We propose a novel memory segmentation scheme (Section 4.2) to optimize FPGA address space management. This scheme ensures hardware accelerators can maintain memory transaction efficiency, while isolating the address space for each tenant. Data isolation is facilitated through hardware/software support.
Results indicate that accelerators maintain their inherent performance from a dedicated, non-virtualized setting (Section 5.3). Our architecture achieves aggregate throughput of up to 3.96x in isolated and up to 2.31x in highly congested conditions, employing four HT (Section 5.4). It also provides high quality of service when tasks share FPGA resources, achieving performance levels \(\sim\)0.95x of their native performance in an isolated environment (Section 5.5).

2 Foundations

Our framework addresses the challenges of a multi-tenant environment by employing a SW/HW strategy. This approach involves coupling a conventional CPU with an FPGA over a peripheral bus such as PCIe, as shown in Figure 1. This naturally divides any design into hardware components operating within the FPGA device, referred as Hardware Stack, and software components executing on the host machine as part of the operating system or support libraries, denoted as Software Stack. The dynamic configuration of the FPGA device with new HT, further divides the Hardware Stack into two distinct categories: (a) the static region, configured during the FPGA device boot time, and (b) vFPGAs, configurable at runtime with new hardware processes. This classification provides clarity and simplifies all FPGA datacenter deployments.
Fig. 1.
Fig. 1. HT are configured within vFPGAs to provide in parallel on-demand acceleration to software workloads. The hardware shell enables access to all FPGA I/O interfaces and bridges the communication with the host device. The intermediate hardware layer enables intra-fpga virtualization through a NoC architecture. To harness the efficiency of FPGA devices, software applications interact with the runtime, which manages the FPGA resources.

2.1 Static Region

The static region of the hardware stack provides the necessary architectural support to enable multi-tenancy on FPGAs and facilitate parallel acceleration of workloads. In our work, the static region includes the Hardware Shell and the Intermediate Hardware Layer. The Hardware Shell integrates the logic and controllers to enable interaction with the interfaces of an FPGA platform, such as memories, Network I/Os, or other external interfaces, according to device specifications. Additionally, it features an internal configuration module (e.g., ICAP) for partially reconfiguring vFPGAs with new HT. It also contains the necessary functionality for communication with the host machine. In a multi-tenant setting, the hardware shell typically incorporates all FPGA interfaces shared by HT during their execution.
Within the static region, our work introduces an Intermediate Hardware Layer to enable intra-fpga virtualization. This layer offers mechanisms for isolating HT and sharing FPGA resources, without sacrificing the inherent performance of accelerators (challenge 1). A compact NoC assumes the role of interconnecting vFPGAs and external interfaces. NoC provides a fair, flexible and scalable architecture for sharing the interfaces enabled by the hardware shell among all active HT, offering simultaneous on-demand acceleration (challenge 3). Each vFPGA is linked through its corresponding node in the NoC, allowing our framework to oversee its execution and regulate unsupervised access to FPGA interfaces and resources (challenge 2). Unlike conventional FPGA environments, where HT establish direct connections with hardware components and acquire privileged-level execution, our work enforces isolation mechanisms through the use of a NoC and the associated nodes, preventing direct and unauthorized communication with FPGA resources.

2.2 Reconfigurable Region

The remaining available space within the FPGA constitutes the reconfigurable area. Modern FPGAs allow the selective reconfiguration of distinct regions within the fabric at any time. In conventional FPGA systems, this region is typically dedicated to a single hardware task and is reprogrammed solely when a new application is initiated on the host machine. In the context of multi-tenant environments, these regions, referred as vFPGAs, allow the parallel acceleration of workloads enhancing FPGA resource utilization. vFPGAs serve as the fundamental mechanism for temporal and spatial multiplexing of FPGA resources. They function as isolated regions within the reconfigurable fabric, assigned to tenants for on-demand acceleration. Each hardware task operates autonomously within its designated vFPGA, preventing any functional interference from other HT. All vFPGAs are interconnected via their respective nodes to the NoC, enabling HT to leverage and share external interfaces.

2.3 Software Components

The software stack is aware of the heterogeneous environment, offering suitable abstractions to user applications on the host machine to interact with the underlying FPGA device. It assumes the control over FPGA resource management, dynamically allocating and managing them as needed, while also monitoring application accesses to the Hardware Stack. Although certain aspects could potentially be implemented in hardware, delegating these responsibilities to software enhances flexibility and frees valuable FPGA space for tenant utilization. The software stack includes the System Runtime Manager, the memory segmentation scheme for distributing the FPGA address space and all virtual/physical mappings, allowing tenants to operate using virtual addresses.
The runtime manager is built upon the PCIe driver and controls the distribution of FPGA resources across tenants. vFPGAs are multiplexed and configured using a slot-based strategy, preventing their assignment to distinct tenants until the completion of their hosted hardware task. Meanwhile, the memory segmentation scheme handles the allocation of the FPGA address space (challenge 4). This scheme facilitates data isolation through a flexible software-hardware support, simultaneously providing a virtual perspective of the FPGA address space for both software applications and HT. Applications interact with the FPGA device through software routines, ensuring a seamless and familiar deployment within our multi-tenant setting. The routines handle the allocation and configuration of vFPGA slots, allocation or release of FPGA address regions, data transactions between the host machine and the FPGA device, and the initiation of HT. Overall, they provide a comprehensive software programming flow to harness the efficiency of FPGAs.

3 Hardware Stack

Our framework considers FPGA platforms as specialized devices for accelerating software workloads. Each FPGA is equipped with a dedicated memory/network stack and establishes a connection with the host machine via a PCIe interface. Figure 2 shows the architecture of the Hardware Stack, dividing the device into three parts, each with a distinct role in our system: (a) the Hardware Shell includes the built-in logic to access all I/O and memory interfaces; (b) the Intermediate Hardware Layer facilitates intra-fpga virtualization, allowing the sharing of FPGA resources across HT; (c) the vFPGAs serve as hosts for the HT, delivering on-demand acceleration.
Fig. 2.
Fig. 2. Block diagram of the Hardware Stack. The intermediate hardware layer enables intra-fpga virtualization to share and virtualize all I/O interfaces across vFPGAs. It employs a NoC architecture to provide a flexible and scalable sharing of FPGA resources. Memory Nodes facilitate data transactions with the external memories. Gatekeepers control and manage the data flow between HT and the underlying system. TMMU virtualizes the external FPGA memories and forms an isolation layer to prevent access to privileged or unauthorized interfaces and address regions.
The Hardware Shell is essential for managing, controlling and debugging the platform from the Software Stack. It enables access to I/O interfaces (e.g., DDR memories and external network) and interaction with HT. The shell exposes a memory map view of the Hardware Stack, effectively dividing the FPGA address space into privileged and user address regions. The user address region includes the external memories for storing the datasets and output results from HT. HT handle on-chip memories (e.g., BRAM or URAM) identically to single-tenant environments, as these resources are seamlessly integrated into vFPGAs. The privileged address space comprises registers for controlling and monitoring the HT, and memory modules for ensuring their normal operation. Finally, the Hardware Shell features a High-Bandwidth ICAP (HBICAP) Controller, for enabling on-demand partial reconfiguration of vFPGAs with the desired bitstreams.
Our intra-fpga virtualization approach centers around the Intermediate Hardware Layer, which interconnects vFPGAs and external interfaces to efficiently share FPGA resources among tenants. It abstracts the underlying architecture, establishes isolation mechanisms, and maintains native accelerator performance, creating the illusion of exclusivity to tenants. The architecture includes a compact NoC to spatially multiplex all tenants, providing a flexible, scalable, and fair sharing of external I/O interfaces. Memory Nodes handle the data transactions with memory controllers, serving all incoming requests from HT. Gatekeepers connect HT to the NoC, providing them access to external interfaces for data transactions. Together with the TMMUs, they form an isolation layer to provide data isolation and restrict the unsupervised access of HT to FPGA interfaces and its resources. The Hardware Task Wrapper simplifies the monitoring process, by standardizing the porting process and connectivity of HT in our environment, abstracting, and hiding any architectural complexities.
The remaining resources are organized as vFPGAs, well-isolated regions for mapping and configuring HT for on-demand acceleration. vFPGAs offer a homogeneous view of the device, overcoming challenges associated with managing, and partitioning the heterogeneous pool of resources (e.g., BRAMs, FFs, DSPs, and LUTs). This approach also favors smaller accelerators over large monolithic kernels, as the performance is reduced when routing extends across dies boundaries. vFPGAs are an efficient alternative solution, facilitating parallel acceleration of multiple workloads from different tenants in a single FPGA device, enhancing resource and memory utilization. They do not share reconfigurable resources or wire connections, enhancing functional isolation and reducing the risk of functional interference within our system. The Software Stack ensures the exclusive allocation of vFPGAs to tenants, securing that HT run until completion before being reconfigured again. Finally, direct communication between vFPGAs or with external interfaces is not permitted. Instead, all interactions occur through the Intermediate Hardware Layer and they are closely monitored.

3.1 Gatekeeper

Gatekeepers act as NoC access points for vFPGAs. They handle all memory requests and data transactions involving external interfaces, establishing isolation mechanisms to prevent direct communication with FPGA resources. In FPGAs, accelerators establish direct connections with hardware components and controllers, allowing unrestricted access to I/O and memory interfaces and granting them a privileged level execution. However, privileged execution poses a major challenge in multi-tenant environments. Gatekeepers effectively add a layer of oversight and control, enforcing all accesses to external interfaces to be routed through the node, effectively downgrading the execution level of a hardware task to a user level. The monitoring process is simplified through the Hardware Task Wrapper, which serves as a sandboxing mechanism for standardizing connectivity with the underlying system.
The Gatekeeper manages and routes incoming and outgoing messages between the NoC and vFPGA. The router connects the hosted hardware task with the wider system, enabling seamless communication with the Intermediate Hardware Layer. Its design is closely tied to the chosen network topology, and a detailed analysis falls outside the scope of this work. Together with the TMMU, the Gatekeeper forms an isolation layer to prevent unauthorized access to address spaces and external modules during the hardware task execution. It safeguards against malicious memory requests or erroneous data, preventing their transmission towards the NoC or the hardware task. Both transmitted and received data are temporarily buffered and converted into raw or network data, based on their direction within the node. TMMU verifies whether a request is intended for an authorized address region or an external module. If flagged as invalid, all buffers are configured to prohibit the data flow either to vFPGA or the router, discarding all data in the process. As such, Gatekeepers eliminate any direct access to external interfaces and FPGA resources, establishing mechanisms to oversee all communication carried out within the Hardware Stack.

3.2 Hardware Task Wrapper

The Hardware Task Wrapper provides a collection of interfaces to enhance portability and configuration of HT across vFPGAs, connecting them with the rest of the Hardware Stack. It is essential for enabling partial reconfiguration, which necessitates a standardized interface due to the locking of boundary signals on the FPGA fabric. Code portability is a rare occurrence between FPGAs or revisions of the same model. This leads developers to adapt their accelerators to the hardware shells or development tools to constantly match new FPGA devices. Our hardware task wrapper addresses this issue by pushing the portability of FPGA accelerators on the language level. It allows any hardware task, given sufficient resources, to be compiled and configured on any available vFPGA, and even facilitates porting between different device models using our Hardware Stack. This enables HT to be both FPGA and vFPGA agnostic, improving portability, flexibility, and a higher level of abstraction to tenants.
The Hardware Task Wrapper establishes a standard I/O interface between HT and Gatekeepers. This allows developers to focus on the computational aspect of their hardware task, leaving the management, communication and data transactions to the Intermediate Hardware Layer. The Wrapper abstracts the FPGA fabric, enabling HT to function independently of hardware specifics and data location. It provides streaming interfaces for I/O, an AXI-Lite interface for control and monitoring, and clock/reset signals. The streaming interfaces support a data-driven approach, reducing handshake signals and facilitating compatibility with pipeline and dataflow primitives for accelerating workloads. The wrapper allows integration of HT in both HDL and HLS languages, offering a familiar environment for developing and porting FPGA applications.

3.3 Task Memory Management Unit

In heterogeneous computing, external memories commonly store vast datasets that HT access during processing. Multi-tenancy raises significant concerns related to data and performance isolation. Unlike, software applications that operate using virtual addresses, HT access external FPGA memories through physical addresses, posing potential security risks. Additionally, conventional FPGA environments allow HT to access any address region or I/O interface without supervision. These vulnerabilities could grant tenants unauthorized access to privileged address spaces, disrupt external controllers, or hinder access to I/O interfaces for other tenants.
To address these concerns, we introduce a TMMU, a dedicated module integrated and paired with each vFPGA. It virtualizes the external FPGA address space, while overseeing and restricting the accesses of HT to external FPGA interfaces. As shown in Figure 3, TMMU ensures that HT access only allocated address regions, while allowing to effectively manage their data within a contiguous address space similar to software applications. Operating in the virtual address domain conceals the physical data location, enforcing memory protection and data isolation among tenants. Two modules achieve the virtualization of external memories: (a) the Task Addressable Memory, which verifies whether memory transactions fall within the allocated address regions, and (b) the Virtual/Physical Segment Translator, which converts virtual segments into physical ones and generate the necessary DMA requests for the memory nodes.
Fig. 3.
Fig. 3. The TMMU allows HT to perceive that data are stored in a contiguous way within the FPGA address space. The Task Addressable Memory verifies whether memory transactions fall within authorized address regions or interfaces. Subsequently, the Virtual/Physical Segment Translator converts virtual segments into physical ones and generates the necessary requests for memory transactions.
Incoming memory requests include the virtual address and its offset, the memory request size and the memory operation. Initially, the Task Addressable Memory evaluates whether the transaction targets an address region associated with the executing program and falls within the corresponding virtual size. This module is built upon the principles of Content Addressable Memory, a specialized computer memory architecture for rapid search applications. Its contents are set during hardware task initialization from the Software Stack and are compared against the memory request. If a match is found, the module returns the associated virtual size and verifies whether the request remains within the allocated space, forwarding it then for translation. Otherwise, it alerts the Gatekeeper to discard any incoming or outgoing data to prevent erroneous data to access the NoC or sensitive information to reach the malicious tenant.
Figure 3 also demonstrates the ability of TMMU in enabling the seamless operation of HT within a contiguous address space, even when data are distributed across different segments. The Virtual/ Physical Segment Translator creates this illusion and handles the conversion of virtual addresses into physical ones. Moreover, the TMMU enables HT to operate unaware of our memory allocation policy, discussed in Section 4.2. The translation process resembles the operation of page tables in operating systems. The virtual address points to an entry in the segment table, retrieving the physical address of the associated segment within FPGA memory. Since data may span across multiple physical segments, a virtual segment may point to several addresses. The module translates incoming memory requests from HT and generates the corresponding DMA transactions that are routed towards memory nodes. In cases where requested data are stored across multiple segments, separate DMA transactions are generated. This step is essential as memory nodes need to configure the DMA engine for each targeted segment to retrieve all requested data.

3.4 Memory Node

Memory Nodes manage memory requests originating from HT and facilitate data routing between the NoC and memory controller. Each node is equipped with a DMA Engine, that handles all memory operations with the external DDR channels. By configuring a DMA engine, a node orchestrates and supervises the entire transaction process with the corresponding memory channel. This design allows memory transactions to operate independently on each node, enabling concurrent access to all available channels. Memory nodes serve to decouple compute processing from memory operations, allowing HT to continue their process uninterrupted, unless they are waiting for a memory request to be served by an external memory controller.
Incoming packets are classified either as Memory Requests Messages or Data Messages. Memory requests are routed to the DMA Hardware Controller, while data messages are buffered, converted, and directed to the DMA engine. Outgoing data follows the reverse process. The DMA Hardware Controller is a lightweight yet highly efficient module within the Hardware Stack, that enables DMA transactions independently of the host CPU. It allows HT within vFPGAs to initiate indirectly memory transactions with the FPGA memories, bypassing the need for software or programmer involvement from the host machine. The controller handles all memory operations within the Hardware Stack, while TMMU maintains the data coherency between tenants. Any transactions extending beyond the privileged FPGA address space of a tenant are discarded, with TMMU overseeing the generation of DMA transactions based on the allocated physical segments. This approach simplifies programmer complexity, with TMMU abstracting all hardware details related to DMA transactions from the HT. The controller oversees the DMA engine, monitoring its status or interrupts, and configures the transactions details such as starting address and length. The memory nodes function to prevent direct access by HT to external DDR memories, preventing them to disrupt the normal operation of memory controllers.

4 Software Stack

4.1 Programming Model

Our framework provides a set of functions to interact with the underlying multi-tenant environment and to harness the efficiency of FPGAs for workload acceleration in a manner similar to existing languages for heterogeneous computing, such as OpenCL for Vitis [49]. The pseudocode in Listing 1 serves as an example of a software application running on the host machine, while Figure 4 depicts how each step interacts with the Hardware Stack. The majority of steps in this code resemble those in Vitis, and include the configuration of a vFPGA with a specific bitstream ①, data transferring between the host machine and FPGA device ③⑥, initialization of hardware task arguments ④, and execution of HT ⑤. The primary distinction between the conventional programming flow and ours lies on allocation of vFPGAs ① or memory regions ②, and the releasing of allocated resources ⑦, which are carried out within the System Runtime Manager. Our objective is to establish a familiar environment that aligns well with existing ones, simplifying the development, and porting of software applications. All interactions with the Hardware Stack are monitored and executed through the System Runtime Manager, that manages our multi-tenant environment and abstracts its details from tenants.
Listing 1.
Listing 1. Example code of a software application running in our multi-tenant setting. Only steps 2 and 7 deviate fromthe existing programming flows for FPGAs.
Fig. 4.
Fig. 4. Overview of interactions of a software application with the Hardware Stack. Each step designates a specific operation initiated by the software application, while the System Runtime Manager oversees the normal operation of the underlying environment.

4.2 System Runtime Manager

The system runtime manager is built upon the PCIe driver to effectively distribute the FPGA resources across tenants, in a dynamic and on-demand manner. It alleviates tenants from the burden to control or manage the underlying architecture, introducing an abstraction layer between the Hardware Stack and software applications. Additionally, it oversees the data transactions between the host and the FPGA device, ensuring that unauthorized access to non-privilege regions is prevented. While it does not enable FPGA device virtualization, a feature handled by the in-built QDMA engine in Hardware Stack, it effectively monitors and manages FPGA resources to optimize their utilization.
As previously stated, the Hardware Stack arranges the reconfigurable resources as vFPGAs, which provide a homogeneous view of the device, despite its heterogeneous nature. vFPGAs are allocated exclusively to tenants until the hosted hardware task is completed, ensuring that multiple tenants cannot simultaneously access the same FPGA resources. This approach enhances functional isolation by preventing tenants from interfering or cause delays in hardware task execution. The manager maintains mappings of vFPGAs and searches for an available slot for assignment. Subsequently, it configures the HBICAP module to map a hardware accelerator onto the designated vFPGA slot.
Sharing the external FPGA memory space presents considerable challenges, particularly in terms of maintaining data isolation and accelerator native performance. Existing approaches statically partition the available address space and memory bandwidth or adopt the traditional memory pages from operating systems. However, the former approach necessitates tenants to confine their datasets within a limited space or restrict parallel access to external memories, resulting significant memory wastage. Moreover, pages introduce extensive internal fragmentation, that slows down the built-in DMA engines. Typically, HT access large sequential data blocks to achieve high level of parallelism, prompting developers to constantly configure their memory management units to handle different page sizes according to their needs (e.g., 4 KB, 2 MB, or 1 GB). To address these limitations, we propose a customized memory segmentation scheme, illustrated in Figure 5(a), that enables efficient sharing of the FPGA address space.
Fig. 5.
Fig. 5. Left Figure; Our memory segmentation scheme allows to effectively manage the FPGA address space and isolate tenants. It implements a best fit algorithm to find the most appropriate available segment. Our scheme also combines different physical segments to form a larger virtual segment, akin to pages in operating systems. Right Figure; All virtual/physical mappings are kept in hash tables to search and retrieve fast the associated entries.
To address the excessive external fragmentation, a major limitation in memory segmentation algorithms, the runtime manager merge non-sequential physical blocks, forming a continuous virtual segment that fulfills the allocation request. However, tenants perceive their allocated space within the FPGA address as sequential, unaware that their data may be partitioned among different segments. In addition, to prevent excessive data partitioning across multiple segments, we use a best-fit algorithm that selects the smallest available segment meeting the size requirements. This approach helps to mitigate the space overhead associated with maintaining virtual-physical mappings. Finally, our framework employs hash tables to store the physical addresses that the virtual segment points to. Hash tables provide an effective mean for fast searching to determine the existence of a virtual segment and identify the corresponding physical segments, as shown in Figure 5(b). This allows our runtime manager to oversee the data transactions between the host and FPGA and prevent access to privileged address regions or segments allocated by other tenants.

5 Experimental Evaluation

5.1 Experimental Setup

Our experimental work evaluates the feasibility of incorporating multi-tenancy into reconfigurable platforms for improved performance and efficiency. We assess this by employing a collection of HT from the widely-used Rodinia [8] and Rosetta [56] benchmark suites for heterogeneous computing. These benchmarks cover both compute and memory-intensive applications, representing real-world scenarios. The Intermediate Hardware Layer consists of configurable amount of vFPGA slots and external DDR channels, allowing us to study the impact of FPGA sharing and memory/NoC congestion on hardware task performance. For this reason, we implement four widely-used network topologies (Crossbar, Ring, DoubleRing, and Torus) and analyze the effect of memory node placement on aggregate throughput. The Dancehall (Figure 6(a)) configuration groups all memory nodes together, while the Interleaved (Figure 6(b)) places memory nodes adjacent to Gatekeepers. Crossbar is omitted from this analysis due to its all-to-all connection architecture. Our Hardware Stack is implemented on an Alveo U250 Data Center acceleration card connected to a host computer via PCIe x16 interface. It is developed using Vitis HLS 2020.2 and Vivado 2020.2 environments, and it operates at 250 MHz. The Software Stack is built upon the Xilinx PCIe Driver, while the host machine features an Intel Core i7-8700 CPU @ 3.20 GHz and 32 GB DDR memory.
Fig. 6.
Fig. 6. The Dancehall configuration arranges Memory nodes in contiguous positions, and this pattern is also mirrored for Gatekeepers. In contrast, the Interleaved configuration alternates the placement of Memory nodes and Gatekeepers.

5.2 Resource Overhead and Cost of Ownership

In a multi-tenant environment, resource utilization is a crucial factor, as the system overhead reduces the available reconfigurable resources, impacting the number of tenants and the available resources per vFPGA. Table 2 reports the Full System Overhead, which includes the Hardware Shell and the Intermediate Hardware Layer with eight nodes interconnected by a NoC. Results indicate that Crossbar introduce the highest overhead due to its numerous links per node, indicating the importance of adopting a topology with less resources. In contrast, Ring has the lowest resource utilization, requiring only two links to connect with adjacent nodes. Torus and DoubleRing exhibit intermediate resource utilization, with a slight increase for Torus due to its complexity. Despite the observed overhead of our system, we contend that the resource utilization is acceptable, leaving almost 75% or more of available reconfigurable resources for tenants. Resource utilization on FPGA devices is directly linked to the total on-chip power. In our study, the system demonstrates a total on-chip power of 25.436 Watts when employing the Crossbar configuration, whereas adopting the Ring topology necessitates one watt less. Similar to the resource overhead, DoubleRing and Torus fall within the power consumption range of the preceding two topologies.
Table 2.
TopologyFull System OverheadMemory NodeGatekeeperPower
# of LinksBRAM %FF %LUT %BRAM %FF %LUT %BRAM %FF %LUT %Watt
Crossbar827.4010.2815.313.421.472.072.700.730.9325.436
Torus423.7010.2315.182.951.442.012.260.670.8525.015
DoubleRing423.3010.1814.982.911.401.912.210.620.7824.915
Ring221.509.7214.342.611.291.752.060.560.7324.744
Table 2. The Resource Overhead and Total On-Chip Power using Four Widely-used Network Topologies; Crossbar, Torus, DoubleRing, and Ring
The configuration includes four Memory Nodes, with the associated channels, and four Gatekeepers and TMMUs with empty vFPGAs. We further breakdown the estimated resources to illustrate the overheads of a Memory node or a Gatekeeper, to explore the scalability of our system.
We further analyze the resource overhead of our Hardware Stack, focusing on memory nodes and gatekeepers. These parameters are crucial for scalability and compatibility with FPGA device specifications. Memory nodes, including DMA Engines and memory controllers, are slightly more expensive than Gatekeepers with the associated TMMUs. The latter contributes a 2% to 2.7% overhead, which is acceptable when multiplexing numerous hardware accelerators in an FPGA. Table 3 a outlines the estimated resource usage for several essential hardware components in our hardware stack. These components require only a fraction of the resources, allowing the integration of an increased number of hardware accelerators or memory controllers. However, a substantial portion is allocated to interconnect the nodes through the preferred topology. While topologies like Ring, DoubleRing, and Torus have predictable resource usage, the Crossbar topology introduces higher overhead due to all-to-all connections. This results in a larger portion of the FPGA device being used for interconnecting memory and HT. Therefore, alternative topologies are recommended, especially for FPGAs with a high number of DDR channels and vFPGAs. Future FPGAs are expected to offer a larger pool of reconfigurable resources, leveraging technologies like Xilinx SSI [35] to connect multiple distinct dies. This calls for a scalable architecture that can multiplex an increasing number of tenants without necessitating a complete system rebuild. Our Hardware Stack is designed to fulfill this requirement by enabling seamless integration of additional tenants, by simply adding extra Gatekeepers and TMMUs to the Intermediate Hardware Layer.
Table 3.
Table 3. Table (a) - Left; The Resource Overhead Primarily Rises from the Controllers to Enable the External I/O Interfaces and the NoC Architecture to Support Intra-fpga Virtualization
Finally, Table 3 b outlines the resource usage for a single instance of each tested workload. Their execution has no impact on the functionality or the operating frequency of our intermediate hardware layer. The sole prerequisite is that the workload must adhere to the spatial constraints of vFGPA regions. Additionally, we present the total on-chip power of the FPGA device, when executing our benchmark workloads under the system configuration detailed in Table 2. Our findings confirms that Ring configuration exhibits lower power consumption than the Crossbar topology, with the total power cost being notably influenced by the spatial demands of the kernel. However, the total cost of ownership (TCO) in terms of power does not exceed 33.5 W, regardless of the kernel or configuration employed.

5.3 Performance of Hardware Tasks

To assess the effectiveness of our framework for on-demand acceleration, we conducted a comparative analysis with Vitis, a widely used FPGA environment designed for single-tenant execution. Figure 7 demonstrates the performance of HT within our framework, normalized against the non-virtualized configuration. Our results indicate that our framework does not introduce any virtualization overheads when compared to the native execution. The lack of overhead can be attributed to several factors. At first, address regions are allocated by our memory segmentation scheme, that ensures contiguous storage of data in physical memory, partitioned only when a physical segment cannot fully serve an allocation request. This approach facilitates efficient burst accesses, improving data transaction efficiency. Additionally, the utilization of DMA Engines and DMA Hardware Controllers serves to further enhance memory transaction efficiency, given their status as hardware-accelerated modules.
Fig. 7.
Fig. 7. Execution time of HT in our multi-tenant setting normalized to Vitis. The results show that our framework does not introduce any performance overheads to hardware accelerators ported in our system. A breakdown analysis is also provided to obtain insights how the runtime is distributed between internal computation and communication with the rest of the system, indicating that our framework effectively handles both compute-intensive tasks and memory-intensive tasks.
Nevertheless, three kernels exhibit different behaviours. LUD and FD show increased performance within our framework. Their implementation allows the retrieval and storage of their entire dataset through a single DMA transaction, whether read or write. As a result, memory nodes only need to configure the DMA engine only once to fulfill their memory operations. The transaction length also mitigates the overhead associated with configuring the DMA engines, concurrently delivering improved performance compared to the memory map transactions utilized by Vitis. Conversely, Hotspot encounters a marginal 8% performance decline. This kernel operates on small data chunks in each iteration, leading to a substantial number of DMA transactions with the FPGA memory to retrieve and store the complete dataset. The cumulative overhead from configuring DMA engines slightly surpasses the memory map procedure in Vitis.
Subsequently, we conducted a detailed analysis of hardware task performance, focusing on the runtime distribution between internal computation and communication with the Intermediate Hardware Layer for data transfers. The results indicate a slight preference of our framework for HT with large burst accesses in memory and few DMA transactions, such as kmeans, leuckocyte, and nw. Nonetheless, HT achieve near-native execution performance. Our system effectively handles both compute-intensive tasks and memory-intensive tasks, making it suitable for a wide range of applications. Developers should be aware that in a multi-tenant setting, any potential performance degradation would arise from memory and I/O congestion, rather than architecture overhead incurred during task porting.

5.4 Aggregate Performance from FPGA Sharing

In this section, we assess the overall performance of our framework when multiple HT are executed in parallel. Our analysis aims to examine the fairness and scalability of our architecture in various FPGA sharing configurations. We investigate how network topologies influence the aggregate performance of the hardware stack and the impact of FPGA sharing on the overall task performance within our benchmark collection. The aggregate throughput serves as a metric to display the overall performance of the FPGA device compared to an non-virtualized and non-shared environment. This analysis aims to provide valuable insights into the effectiveness of our framework under different sharing scenarios.
Initially, we evaluate the performance of our framework by distributing data from tenants across distinct DDR memories. This scenario leads to increased congestion within the NoC as HT perform memory transactions simultaneously. Our results, presented in Figure 8, indicate that while several topologies can deliver optimal aggregate throughput with two memory controllers, this advantage diminishes when scaling to four memory controllers and four HT. Only the interleaved configurations of DoubleRing and Torus demonstrate near-optimal performance, with a geometric mean of 3.96x, which is almost equivalent to the Crossbar. These topologies offer significantly higher bandwidth than Ring, alleviating the congestion within the network. Furthermore, the interleaved configuration positions memory nodes adjacent to HT, enabling a faster and more direct access compared to the Dancehall configuration. This arrangement effectively mitigates pressure on the routers. The results demonstrate the effectiveness of our framework in providing a well-isolated execution environment, effectively minimizing congestion from shared interconnections.
Fig. 8.
Fig. 8. Geometric mean of the HT when data from tenants are distributed on separate memories. The results show the aggregate throughput compared to a non-shared setting, during congestion within the intermediate hardware layer, as all hardware accelerators perform memory transactions in parallel and increase the traffic in the NoC. The results demonstrate that our Hardware Stack can provide a well-isolated execution environment, in the absence of memory congestion, when using Crossbar, Torus, and DoubleRing.
Subsequently, we evaluate the aggregate throughput by grouping data from tenants within the same DDR memory, leading to increased congestion within memory nodes, as HT share the memory interfaces. The outcomes in Figure 9 show that our hardware stack achieves aggregate throughput close to 1.66x and 2.31x, when running two and four instances of a hardware task in parallel, respectively. Compute-intensive HT scale efficiently with sufficient vFPGA slots. Conversely, memory-intensive tasks exhibit performance decline as our framework assigns only a portion of the available bandwidth for memory transactions. Additionally, the chosen topology has no impact on aggregate throughput, with the Ring topology emerging as a favorable choice for its low cost compared to other network configurations. However, when an extra DDR channel is available, Ring falls short in providing competitive performance, highlighting the importance of adopting a network topology with improved scalability and numerous links per node, because of parallel memory accesses. Torus emerges as a favored choice as it competes with Crossbar and offers a more cost-effective solution. By employing Torus, the addition of an extra DDR channel boosts the aggregate throughput of our system to 3.33x. The improvement is attributed to the increased performance of HT dominated by data transfers, as they utilize a larger portion of the available memory bandwidth for their transactions.
Fig. 9.
Fig. 9. Geometric mean of the HT when data from tenants are located within the same memory. The results show the aggregate throughput on memory-congested settings, as tasks simultaneously access the same memory controller. Sharing an interface among active HT, results their reduced performance due to interference effects. Moreover, the topology affects the aggregate throughput only when more memory controllers are available for parallel data transactions.
On Figure 10, we extend our analysis to explore the impact of memory congestion and topology on the aggregate throughput of HT, utilizing four vFPGAs. Results indicate that compute-intensive tasks (e.g., lavamd, lud, nw, and fd) exhibit effective scalability with sufficient vFPGA slots. The Ring demonstrates competitive aggregate throughput compared to Crossbar, despite providing significantly lower network bandwidth. This observation holds even when data are distributed across two or four distinct memory channels, where HT can perform memory transactions in parallel. On the other hand, congestion within memory interfaces/nodes significantly affects the aggregate throughput of memory-intensive HT. However, we notice that small burst transactions enhance the performance even in highly congested cases, as transactions are effectively interleaved, as happens with knn, leukocyte(d), and pathfinder. Finally, the results highlight the importance of choosing a topology with high network bandwidth, as scaling to two or four memory nodes, for parallel memory transactions, does not yield any performance benefits with the Ring topology, unlike in the Crossbar.
Fig. 10.
Fig. 10. Aggregate throughput of four vFPGAs on Ring (Dancehall) and Crossbar configurations. Memory-intensive tasks experience declined performance when using Ring during parallel accesses in two and four memory controllers. However, knn, leuckoycte(d), and pathfinder tasks exhibit slightly better scalability, as their transactions are interleaved, resulted by their small memory requests. Compute-intensive tasks are not affected by either the adopted topology or the presence of more memory controllers.
In brief, our results validate fundamental aspects of implementing a multi-tenant environment in reconfigurable devices. By distributing data across distinct DDR memories, our Hardware Stack attains peak aggregate throughput, providing a completely isolated execution environment to tenants. However, sharing of memory interfaces may impact the performance of HT, given that our system ensures equitable access to external interfaces. Compute-intensive tasks scale effectively given a sufficient supply of vFPGAs, while the rest of our benchmarks depend on factors like network topology or congestion when accessing I/O interfaces to achieve higher performance.

5.5 Quality of Service from Spatial Multiplexing

This section evaluates the quality of service of our intra-fpga virtualization layer, by spatially multiplexing different tenants within the same FPGA device. We conducted an experiment where HT share the same memory channel, allowing us to observe the highest level of interference among active tasks. The outcomes of our study are presented in Figure 11. Our analysis aims to offer valuable insights into the impact of concurrent execution through spatial multiplexing on the performance of HT from the perspective of tenants.
Fig. 11.
Fig. 11. Throughput of HT when running in parallel with every accelerator in our collection, normalized when running in an isolated environment. Data from tenants are stored in the same DDR memory to report the maximum level of interference that can occur between active tasks. The boxplot displays the distribution of the normalized throughput for each hardware task. The red line indicates the median value.
Applications with high computational intensity (e.g., lud, lavamd, nw, and fd) exhibit near-optimal and consistently predictable performance, with minimal deviations observed as outliers in the boxplot. This confirms the effectiveness of our framework for computationally oriented processes, as coexisting HT have little impact on their performance. This characteristic is noteworthy in our multi-tenant setup, where FPGAs excel in accelerating computationally demanding algorithms due to their parallel nature. In contrast, KNN is more susceptible to performance degradation when executed concurrently with another process. This can be attributed to its frequent memory accesses and small burst data transactions, with data transfers significantly contributing to the overall execution time. When KNN operates alone, it fully utilizes available bandwidth, but co-location with another accelerator may cause performance decline, especially if substantial bandwidth is required. Nonetheless, the median performance of KNN converges to 70%, indicating a significant portion of its initial performance is retained.
K-means and backprop demonstrate consistent and high performance, ranging from 0.78x to 0.98x, with a median value closer to their native performance. This can be attributed to their limited interaction with the Hardware Stack and their high data transfer per transaction (512 KB for backprop and 1 MB for k-means). The rest of HT exhibit similar behaviour, although their performance may be influenced by the presence of memory-intensive tasks. When paired with more computationally demanding algorithms, they approach their optimal performance. These findings highlight the effectiveness of our intra-fpga virtualization layer in ensuring a high quality of service. Most processes achieve near-optimal median performance, even when they involve substantial interaction with memory. Our system prioritizes performance isolation to provide high quality of service and the illusion of exclusivity to tenants. Even in congested scenarios, where memory and I/O interfaces are shared, most hardware accelerators experience small declines compared to their native performance.

6 Discussion

6.1 Network-on-Chip

Our framework leverages a compact NoC to effectively multiplex vFPGAs and enable simultaneous acceleration of workloads. In this work, we focus on evaluating the efficiency of NoCs as an alternative to traditional interconnecting methods for enabling multi-tenancy on FPGA devices. However, our research does not delve into the field of NoCs. Consequently, we refrain from conducting a comprehensive evaluation of network behaviour with adherent traffic patterns and other related factors. Nevertheless, our framework can be extended and integrated on existing NoC environments [16] to meet provider requirements (e.g., network bandwidth and topology and network protocols). Our primary purpose is to isolate tenants and provide exclusivity, demonstrating the efficiency of a NoC architecture in enhancing spatial multiplexing. Meanwhile, it severs direct communication and access of HT to external interfaces and other vFPGAs, enhancing overall protection.

6.2 System Scalability

The availability of vFPGAs poses a persistent challenge in multi-tenancy, as a higher number leads to smaller regions with limited resources for configuring HT. During experimentation, we faced this tradeoff due to resource-hungry tasks, making it infeasible to scale to a greater number, even though the Hardware Stack requires a small fraction of resources for enabling multi-tenancy. However, future FPGAs with technologies like SSI [35] are expected to offer a greater pool of reconfigurable resources. Prior works [17, 25] have discussed the necessity of adopting a scalable solution as an alternative for multiplexing an increased number of hardware accelerators. This motivates the adoption of a NoC architecture, which addresses routing challenges posed by new FPGA devices and offers better scalability compared to traditional interconnecting methods. Our results demonstrate that a NoC architecture achieves near-optimal aggregate throughput when HT perform parallel accesses on memory channels. However, when memory interfaces are shared, the throughput is influenced by whether an accelerator is compute-intensive tasks or data-intensive. It proves that memory sharing poses a key issue on multi-tenant environments. Our system manages the distribution of memory bandwidth while assigning address regions, by collocating data from tenants in a shared memory channel. This approach offers a flexible way to distribute and share memory bandwidth across HT through software control, such as pairing a memory-intensive task with a compute-intensive tasks one to reduce interference. This leaves room for further optimizations by experimenting and analyzing the workload characteristics [23]. Nonetheless, selecting the appropriate topology is also crucial, as it impacts the performance of data-intensive workloads. We believe that our experiments effectively illustrate the trends in scaling each HT on both congested and isolated settings. Consequently, multiplexing additional vFPGAS, whenever feasible, does not yield additional insights.

6.3 Multi-FPGA Environment

Our research focuses on sharing the resources of a single FPGA device across several tenants. However, cloud nodes may contain multiple FPGAs for workload acceleration. While the challenges specific to a scale-out environment are left for future work, we present helpful insight on extending our research in such a context. Typically, in a scale-out setting, FPGAs are connected through a network infrastructure, with each device possessing its own memory address space, memory bandwidth and I/O interfaces. Consequently, managing each FPGA device as an independent entity simplifies FPGA management, treating vFPGAs and memory resources similarly to our single-FPGA test case. However, the allocation of FPGA resources should be confined within individual devices, as it requires additional strategies for FPGA-FPGA communication and mechanisms to address the challenges related to isolating and sharing the communication interfaces. Moreover, FPGA-FPGA communication affects the performance of HT, as data may need to traverse the network. It requires an abstraction layer to conceal any inter-FPGA communication, enabling developers to operate unaware of their data location or the FPGA responsible for workload acceleration. Furthermore, scale-out environments require the runtime manager to prioritize equitable distribution across all active FPGAs based on underlying workload demands, to maintain a balanced load and minimize congestion and interference resulting from shared FPGA resources. This model resembles the serverless architecture, where user functions may be triggered within vFPGAs, and results are transmitted to tenants via HTTP and a network layer. Such an approach holds the potential for enhanced scalability and efficiency in accelerating workloads across multi-FPGA nodes, while enabling dynamic management of the underlying workload.

6.4 Key Takeaways

Our work highlights three key areas for improving architectural support for multi-tenancy. Initially, our proposed NoC-based architecture is able to effectively multiplex vFPGAs, enabling parallel workload acceleration, while ensuring sufficient scalability. Our results demonstrate the efficacy of NoCs in achieving high aggregate throughput, even during parallel accesses to external FPGA memories. However, we observe that the placement of nodes in certain topologies influences the overall performance, as interleaved configurations outperform the dancehall ones. Moreover, factors like topology and congestion within I/O interfaces affect the performance of memory-intensive tasks. Hence, it is essential to adopt topologies with higher bandwidth or employ efficient allocation strategies, when handling these accelerators. Next, we emphasize the importance of efficient virtual memory on FPGAs. Existing environments allow hardware accelerators to operate at privileged level, granting them unrestricted access to I/O interfaces and address regions using physical addresses. Virtual memory abstraction enhances protection and conceals the physical location of data. Along with the intermediate hardware layer, they effectively shift the execution of tasks to user level, introducing a layer of supervision when accessing I/O interfaces.
Finally, previous works have researched the use of pages to manage FPGA address space. However, Coyote [21] shows that pages introduce significant overheads on PCIe-based systems, making them more suitable for cache-coherent systems. Similarly, Optimus [25] extends pages on shared-memory FPGAs, but their approach is not compatible with PCIe devices. In contrast, our work proposes managing FPGA address space using a memory segmentation scheme. Our approach optimizes data transactions with memory, by optimally placing data sequentially within the physical address space or distributing them across a few segments (typically 2–3 in our experiments). It effectively reduces the entries needed for translating virtual segments into physical ones, while TMMU allows HT to perceive the address space sequentially and preserve their original performance.

7 Related Work

FPGA Abstraction. Chen et al. [9] enable FPGA usage in the cloud through Linux-KVM in a modified OpenStack environment. hCODE [54] introduces a multi-channel shell for managing, creating, and sharing HT via independent PCIe channels. VirtualRC [18] implements a software middleware API as a virtualization layer, converting communication routines for virtual components into API calls for the physical platform. Tarafdar et al. [39] develop an FPGA hypervisor that provides access to all I/O interfaces and programs a partially reconfigurable region with desired bitstreams. Similarly, Catapult [7, 33] virtualizes FPGA resources as a common pool, enabling job scheduling on available accelerators. RACOS [40] offers a user-friendly interface for loading/unloading reconfigurable HT and transparent I/O operations. Finally, FSRF [22] abstracts FPGA I/O at a high level, enabling files to be mapped directly into FPGA virtual memory from the host. Our research integrates multiple elements from prior studies to enhance programming productivity. Specifically, we employ SR-IOV to enable FPGA virtualization, facilitated by the in-built QDMA engine. Additionally, our runtime manager simplifies access to vFPGAs, HT and I/O interfaces, thereby eliminating the need for tenants to possess in-depth knowledge of the underlying platform and hardware.
FPGA Sharing. In [6, 9], FPGA resources are shared through OpenStack using partial reconfigurable regions in both temporal and spatial domains. [43] utilizes an accelerator scheduler to match user requests with a suitable resource pool. [19] implements a hypervisor to manage bitstreams for configuring PRRs and monitoring user access to accelerators. [12] uses a hypervisor on the software stack to communicate with PRRs via a common interface in the static region, handling configuration and allocation of regions to users. In [41], hardware accelerators are shared among multiple customers in a paravirtualized environment. Vital [51] and Hetero-Vital [52] maximize per-FPGA area utilization by segmenting designs into smaller bitstreams and mapping them onto fine-tuned slots within an FPGA cluster, supported by an augmented compiler. AmorphOS [17] implements a “low latency” mode to enable the use of vFPGAs called Morphlets. These Morphlets are managed by a user-mode library, which handles I/O interfaces and facilitates application access. Coyote [21] integrates OS-abstractions within the FPGA device, making it part of the host operating system. Each hardware task is paired with a custom memory management unit and translation lookaside buffers to unify FPGA and host memory. Optimus [25], acting as a hypervisor, utilizes time-multiplexing to schedule virtual machines on pre-configured accelerators, and employs page table slicing for memory and I/O isolation. Nimblock [26] examines scheduling on shared-FPGAs, aiming to enhance responses times and reduce deadline violations. Feniks [53] incorporates an operating system within the FPGA and includes communication stacks and modules for off-chip memory, host CPU, server, and other cloud resources. VenOS [30] employs a NoC architecture for sharing external memories across HT, utilizing static segments to distribute and isolate the FPGA address space among tenants. This work builds upon the principles of VenOS by introducing a virtual view of the FPGA address space and external interfaces for the HT. This approach strengthens tenant isolation by confining hardware accelerators to user-level execution and preventing any authorized access to FPGA resources, similar to software applications and operating system in the host machine. Additionally, memory segments allow HT to operate on virtual addresses without compromising their native performance due to high data fragmentation. Our memory scheme enables a more flexible management of the address space, aligning with the requirements of tenants, in contrast to static memory and I/O partitioning. The flexible management over both memory bandwidth and address space seems to uphold the inherent performance of HT, even in the presence of interference from other accelerators, up to 0.95x when compared to a dedicated, non-virtualized environment. Overall, this work advances prior research by introducing an intra-fpga virtualization layer to effectively provide architectural support for multi-tenancy. Table 1 presents a comparative analysis, demonstrating that our work provides the most comprehensive feature set for sharing FPGAs among multiple tenants.
Accelerator Libraries. Leading cloud providers, including Amazon [2] and Microsoft [29], now offer pre-compiled hardware accelerators, enabling software applications to harness the efficiency of FPGA devices through simple routines. This approach simplifies programming complexity and enables a Software as a Service (SaaS) model, that decouples application development from FPGA design optimization. Similarly, InAccel [14] facilitates large-scale data acceleration across an FPGA cluster using familiar software programming models. While our work is orthogonal from this approach, implementing a SaaS model on top of our framework is a natural fit. In this scenario, the pre-compiled kernels can be executed in parallel within our Hardware Stack, effectively accelerating workloads from multiple tenants simultaneously. Adopting a SaaS model can significantly enhance and facilitate the functionality of our framework, as the HT have predefined operations, set by the provider.
Overlays. Overlays offer a higher level of abstraction that allow configurations to be architecture-agnostic, ensuring code portability and minimum compilation overheads across different FPGA platforms. Upon this approach, authors in [11] propose a virtual reconfigurable architecture which hides the complexity of fine-grained reconfigurable resources. Similarly, Koch et al. [20] leverage overlays through custom instruction set extensions, to utilize FPGA platforms in a more efficient and flexible manner. In [44], the authors extend ZUMA to provide bitstream compatibility between different devices, allowing the integration of ReconOS programming model to facilitate the extension of software applications to reconfigurable hardware. Finally, recent works [27, 28] leverage overlays to provide multi-tenancy within the reconfigurable fabric and enable communication between software applications and the FPGA device through VirtIO. Nevertheless, overlay architectures often sacrifice the performance of hardware accelerators and introduce significant resource overheads. They also reduce the ability to reconfigure the devices with new HT, which is a key advantage of FPGAs over other processing units. For this reason, cloud environments encourage the use of native hardware accelerators, providing maximum performance and efficiency in accelerating workloads.
FPGA OSes. BORPH [5, 37] extends the Linux OS to manage HT like software processes, treating their compilation and execution in a similar manner. Furthermore, inter-process communication is facilitated through UNIX pipes. Similarly, ReconOS [24], and Hthread [32] extend the multi-threaded programming model on FPGAs, providing support for inter-process communication, and synchronization. FUSE [15] provides native OS support for integrating HT in FPGA devices transparently. Leap [1, 13] introduces OS-managed channels latency-sensitive to enable communication between different hardware modules and an partitioning algorithm to share the on-board memory. Finally, Wassi et al. [42] optimize the resource usage of FPGA devices through a real-time operating system and a multi-shape task manager to select the proper version of a hardware task. While these studies explore the potential of providing native OS support for FPGAs, they do not address any of the challenges associated with implementing multi-tenancy on reconfigurable devices. Their main focus lies in treating HT similar to software processes, allowing for inter-process communication and access to the OS and its resources.

8 Conclusion

This article presents an intra-fpga virtualization layer to provide architectural support for multi-tenancy on FPGA devices. This layer comprise a compact NoC to effectively multiplex hardware accelerators and to offer a scalable solution to isolate tenants within an FPGA. Special TMMUs allow hardware task to operate on virtual addresses, virtualizing the FPGA address space and enhancing memory protection. Finally, we introduce a memory segmentation scheme to reinforce I/O isolation through a flexible hardware-software support and improve the efficiency of DMA transactions. Our results demonstrate that our work does not introduce any performance overheads on the hosted accelerators, while achieving up to 3.96x and 2.31x aggregate throughput using four HT in isolated and congested settings. At last, our solution provides evidence of high quality of service, as 7 out of 10 tasks exhibit performance close to their optimum when sharing resources with another task.

Acknowledgments

The authors would like to thank the AMD University Program for its generous donation of the Alveo FPGA boards used in this work.

References

[1]
Michael Adler, Kermin E. Fleming, Angshuman Parashar, Michael Pellauer, and Joel Emer. 2011. Leap scratchpads: Automatic memory and cache management for reconfigurable logic. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays . Association for Computing Machinery, New York, NY, USA, 25–28. DOI:DOI:
[2]
Alibaba. 2023. Overview - Elastic Compute Service - Alibaba Cloud Documentation Center. Retrieved Feb 19, 2023 from https://www.alibabacloud.com/help/en/elastic-compute-service/latest/compute-optimized-type-family-with-fpga-overview
[3]
Amazon. 2023. Amazon EC2 F1 Instances. Retrieved Feb 19, 2023 from https://aws.amazon.com/ec2/instance-types/f1/
[4]
Osama G. Attia, Tyler Johnson, Kevin Townsend, Philip Jones, and Joseph Zambreno. 2014. CyGraph: A reconfigurable architecture for parallel breadth-first search. In Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. 228–235. DOI:DOI:
[5]
Robert Brodersen, Artem Tkachenko, and Hayden Kwok-Hay So. 2006. A unified hardware/software runtime environment for FPGA-based reconfigurable computers using BORPH. In Proceedings of the 4th International Conference on Hardware/Software Codesign and System Synthesis. 259–264. DOI:DOI:
[6]
Stuart Byma, J. Gregory Steffan, Hadi Bannazadeh, Alberto Leon-Garcia, and Paul Chow. 2014. FPGAs in the cloud: Booting virtualized hardware accelerators with OpenStack. In Proceedings of the 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines. 109–116. DOI:DOI:
[7]
Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger. 2016. A cloud-scale acceleration architecture. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture. 1–13. DOI:DOI:
[8]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization. 44–54. DOI:DOI:
[9]
Fei Chen, Yi Shan, Yu Zhang, Yu Wang, Hubertus Franke, Xiaotao Chang, and Kun Wang. 2014. Enabling FPGAs in the cloud. In Proceedings of the 11th ACM Conference on Computing Frontiers . Association for Computing Machinery, New York, NY, USA, Article 3, 10 pages. DOI:DOI:
[10]
Eric Chung, Jeremy Fowers, Kalin Ovtcharov, Michael Papamichael, Adrian Caulfield, Todd Massengill, Ming Liu, Daniel Lo, Shlomi Alkalay, Michael Haselman, Maleen Abeydeera, Logan Adams, Hari Angepat, Christian Boehn, Derek Chiou, Oren Firestein, Alessandro Forin, Kang Su Gatlin, Mahdi Ghandi, Stephen Heil, Kyle Holohan, Ahmad El Husseini, Tamas Juhasz, Kara Kagi, Ratna K. Kovvuri, Sitaram Lanka, Friedel van Megen, Dima Mukhortov, Prerak Patel, Brandon Perez, Amanda Rapsang, Steven Reinhardt, Bita Rouhani, Adam Sapek, Raja Seera, Sangeetha Shekar, Balaji Sridharan, Gabriel Weisz, Lisa Woods, Phillip Yi Xiao, Dan Zhang, Ritchie Zhao, and Doug Burger. 2018. Serving DNNs in real time at datacenter scale with project brainwave. IEEE Micro 38, 2 (2018), 8–20. DOI:DOI:
[11]
James Coole and Greg Stitt. 2010. Intermediate fabrics: Virtual architectures for circuit portability and fast placement and routing. In Proceedings of the 2010 IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. 13–22. DOI:DOI:
[12]
Suhaib A. Fahmy, Kizheppatt Vipin, and Shanker Shreejith. 2015. Virtualized FPGA accelerators for efficient cloud computing. In Proceedings of the 2015 IEEE 7th International Conference on Cloud Computing Technology and Science. 430–435. DOI:DOI:
[13]
Kermin Fleming, Hsin-Jung Yang, Michael Adler, and Joel Emer. 2014. The LEAP FPGA operating system. In Proceedings of the 2014 24th International Conference on Field Programmable Logic and Applications. 1–8. DOI:DOI:
[14]
InAccel. 2023. InAccel; Application Acceleration Made Simple. Retrieved Feb 19, 2023 from https://inaccel.com/. [Accessed 19-Feb-2023].
[15]
Aws Ismail and Lesley Shannon. 2011. FUSE: Front-end user framework for O/S abstraction of hardware accelerators. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines. 170–177. DOI:DOI:
[16]
Nachiket Kapre and Jan Gray. 2015. Hoplite: Building austere overlay NoCs for FPGAs. In Proceedings of the 2015 25th International Conference on Field Programmable Logic and Applications. 1–8. DOI:DOI:
[17]
Ahmed Khawaja, Joshua Landgraf, Rohith Prakash, Michael Wei, Eric Schkufza, and Christopher J. Rossbach. 2018. Sharing, protection, and compatibility for reconfigurable fabric with amorphos. In Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation. USENIX Association, USA, 107–127.
[18]
Robert Kirchgessner, Greg Stitt, Alan George, and Herman Lam. 2012. VirtualRC: A virtual FPGA platform for applications and tools portability. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays. Association for Computing Machinery, New York, NY, USA, 205–208. DOI:DOI:
[19]
Oliver Knodel, Patrick Lehmann, and Rainer G. Spallek. 2016. RC3E: Reconfigurable accelerators in data centres and their provision by adapted service models. In Proceedings of the 2016 IEEE 9th International Conference on Cloud Computing. 19–26. DOI:DOI:
[20]
Dirk Koch, Christian Beckhoff, and Guy G. F. Lemieux. 2013. An efficient FPGA overlay for portable custom instruction set extensions. In Proceedings of the 2013 23rd International Conference on Field Programmable Logic and Applications. 1–8. DOI:DOI:
[21]
Dario Korolija, Timothy Roscoe, and Gustavo Alonso. 2020. Do OS abstractions make sense on FPGAs?. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. USENIX Association, USA, Article 56, 20 pages.
[22]
Joshua Landgraf, Matthew Giordano, Esther Yoon, and Christopher J. Rossbach. 2023. Reconfigurable virtual memory for FPGA-Driven I/O. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems . Association for Computing Machinery, New York, NY, USA, 556–571. DOI:DOI:
[23]
David Lo, Liqun Cheng, Rama Govindaraju, Parthasarathy Ranganathan, and Christos Kozyrakis. 2015. Heracles: Improving resource efficiency at scale. In Proceedings of the International Symposium on Computer Architecture 13-17-June, 450–462. DOI:DOI:
[24]
Enno Lübbers and Marco Platzner. 2009. ReconOS: Multithreaded programming for reconfigurable computers. ACM Transactions on Embedded Computing Systems 9, 1, Article 8 (october2009), 33 pages. DOI:DOI:
[25]
Jiacheng Ma, Gefei Zuo, Kevin Loughlin, Xiaohe Cheng, Yanqiang Liu, Abel Mulugeta Eneyew, Zhengwei Qi, and Baris Kasikci. 2020. A hypervisor for shared-memory FPGA platforms. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. Association for Computing Machinery, New York, NY, USA, 827–844. DOI:DOI:
[26]
Meghna Mandava, Paul Reckamp, and Deming Chen. 2023. Nimblock: Scheduling for fine-grained FPGA sharing through virtualization. In Proceedings of the 50th Annual International Symposium on Computer Architecture . Association for Computing Machinery, New York, NY, USA, Article 60, 13 pages. DOI:DOI:
[27]
Joel Mbongue, Festus Hategekimana, Danielle Tchuinkou Kwadjo, David Andrews, and Christophe Bobda. 2018. FPGAVirt: A novel virtualization framework for FPGAs in the cloud. In Proceedings of the 2018 IEEE 11th International Conference on Cloud Computing. 862–865. DOI:DOI:
[28]
Joel Mandebi Mbongue, Danielle Tchuinkou Kwadjo, and Christophe Bobda. 2018. FLexiTASK: A flexible FPGA overlay for efficient multitasking. In Proceedings of the 2018 on Great Lakes Symposium on VLSI (2018).
[29]
Microsoft. 2023. Azure Machine Learning. Retrieved Feb 19, 2023 from https://learn.microsoft.com/en-us/azure/machine-learning/
[30]
Panagiotis Miliadis, Dimitris Theodoropoulos, Dionisios N. Pnevmatikatos, and Nectarios Koziris. 2022. VenOS: A virtualization framework for multiple tenant accommodation on reconfigurable platforms. In Proceedings of the Applied Reconfigurable Computing. Architectures, Tools, and Applications: 18th International Symposium, ARC 2022, Virtual Event, September 19–20, 2022, Proceedings. Springer-Verlag, Berlin, 181–195. DOI:DOI:
[31]
Muhsen Owaida, David Sidler, Kaan Kara, and Gustavo Alonso. 2017. Centaur: A framework for hybrid CPU-FPGA databases. In Proceedings of the 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines. 211–218. DOI:DOI:
[32]
Wesley Peck, Erik Anderson, Jason Agron, Jim Stevens, Fabrice Baijot, and David Andrews. 2006. Hthreads: A computational model for reconfigurable devices. In Proceedings of the 2006 International Conference on Field Programmable Logic and Applications. 1–4. DOI:DOI:
[33]
Andrew Putnam, Adrian M. Caulfield, Eric S. Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, Michael Haselman, Scott Hauck, Stephen Heil, Amir Hormati, Joo-Young Kim, Sitaram Lanka, James Larus, Eric Peterson, Simon Pope, Aaron Smith, Jason Thong, Phillip Yi Xiao, and Doug Burger. 2015. A reconfigurable fabric for accelerating large-scale datacenter services. IEEE Micro 35, 3 (2015), 10–22. DOI:DOI:
[34]
Weikang Qiao, Jieqiong Du, Zhenman Fang, Michael Lo, Mau-Chung Frank Chang, and Jason Cong. 2018. High-throughput lossless compression on tightly coupled CPU-FPGA platforms. In Proceedings of the 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines. 37–44. DOI:DOI:
[35]
Kirk Saban. 2023. Xilinx Stacked Silicon Interconnect Technology Delivers Breakthrough FPGA Capacity, Bandwidth, and Power Efficiency. Technical Report. Xilinx.
[36]
David Sidler, Zsolt István, Muhsen Owaida, and Gustavo Alonso. 2017. Accelerating pattern matching queries in hybrid CPU-FPGA architectures. In Proceedings of the 2017 ACM International Conference on Management of Data . Association for Computing Machinery, New York, NY, USA, 403–415. DOI:DOI:
[37]
Hayden Kwok-Hay So and Robert W. Brodersen. 2007. BORPH: An Operating System for FPGA-Based Reconfigurable Computers. Ph. D. Dissertation. EECS Department, University of California, Berkeley. Retrieved from http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-92.html
[38]
Naveen Suda, Vikas Chandra, Ganesh Dasika, Abinash Mohanty, Yufei Ma, Sarma Vrudhula, Jae-sun Seo, and Yu Cao. 2016. Throughput-optimized OpenCL-Based FPGA accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays . Association for Computing Machinery, New York, NY, USA, 16–25. DOI:DOI:
[39]
Naif Tarafdar, Nariman Eskandari, Thomas Lin, and Paul Chow. 2018. Designing for FPGAs in the cloud. IEEE Design & Test 35, 1 (2018), 23–29. DOI:DOI:
[40]
Charalampos Vatsolakis and Dionisios Pnevmatikatos. 2017. RACOS: Transparent access and virtualization of reconfigurable hardware accelerators. In Proceedings of the 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 11–19. DOI:DOI:
[41]
Wei Wang, Miodrag Bolic, and Jonathan Parri. 2013. pvFPGA: Accessing an FPGA-based hardware accelerator in a paravirtualized environment. In Proceedings of the 2013 International Conference on Hardware/Software Codesign and System Synthesis. 1–9. DOI:DOI:
[42]
Guy Wassi, Mohamed El Amine Benkhelifa, Geoff Lawday, François Verdier, and Samuel Garcia. 2014. Multi-shape tasks scheduling for online multitasking on FPGAs. In Proceedings of the 2014 9th International Symposium on Reconfigurable and Communication-Centric Systems-on-Chip. 1–7. DOI:DOI:
[43]
Jagath Weerasinghe, Francois Abel, Christoph Hagleitner, and Andreas Herkersdorf. 2015. Enabling FPGAs in hyperscale data centers. In Proceedings of the 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops. 1078–1086. DOI:DOI:
[44]
Tobias Wiersema, Ame Bockhorn, and Marco Platzner. 2014. Embedding FPGA overlays into configurable Systems-on-Chip: ReconOS meets ZUMA. In Proceedings of the 2014 International Conference on ReConFigurable Computing and FPGAs. 1–6. DOI:DOI:
[45]
Xilinx. 2023. Getting Started with Alveo Data Center Accelerator Cards User Guide UG1301. Retrieved Jan 26, 2023 from https://docs.xilinx.com/r/en-US/ug1301-getting-started-guide-alveo-accelerator-cards
[46]
[47]
Xilinx. 2023. Vitis Quantitative Finance Library. Retrieved Feb 19, 2023 from https://github.com/Xilinx/Vitis_Libraries/tree/main/quantitative_finance
[48]
Xilinx. 2023. Vitis Security Library. RetrievedFeb19,2023fromhttps://github.com/Xilinx/Vitis_Libraries/tree/main/security.
[49]
Xilinx. 2023. Vitis Software Platform — xilinx.com. Retrieved Jul 27, 2023 from https://www.xilinx.com/products/design-tools/vitis/vitis-platform.html
[50]
[51]
Yue Zha and Jing Li. 2020. Virtualizing FPGAs in the cloud(ASPLOS’20). Association for Computing Machinery, New York, NY, USA, 845–858. DOI:DOI:
[52]
Yue Zha and Jing Li. 2021. Hetero-ViTAL: A virtualization stack for heterogeneous FPGA clusters. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture. 470–483. DOI:
[53]
Jiansong Zhang, Yongqiang Xiong, Ningyi Xu, Ran Shu, Bojie Li, Peng Cheng, Guo Chen, and Thomas Moscibroda. 2017. The feniks FPGA operating system for cloud computing. In Proceedings of the 8th Asia-Pacific Workshop on Systems . Association for Computing Machinery, New York, NY, USA, Article 22, 7 pages. DOI:DOI:
[54]
Qian ZHAO, Motoki AMAGASAKI, Masahiro IIDA, Morihiro KUGA, and Toshinori SUEYOSHI. 2018. Enabling FPGA-as-a-service in the cloud with hCODE Platform. IEICE Transactions on Information and Systems E101.D, 2 (2018), 335–343. DOI:
[55]
Shijie Zhou and Vikto r. K. Prasanna. 2017. Accelerating graph analytics on CPU-FPGA heterogeneous platform. In Proceedings of the 2017 29th International Symposium on Computer Architecture and High Performance Computing. 137–144. DOI:DOI:
[56]
Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, Hanchen Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, Wenping Wang, and Zhiru Zhang. 2018. Rosetta: A realistic high-level synthesis benchmark suite for software-programmable FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (February2018).

Cited By

View all

Index Terms

  1. Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 2
      June 2024
      520 pages
      EISSN:1544-3973
      DOI:10.1145/3613583
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 May 2024
      Online AM: 16 February 2024
      Accepted: 05 February 2024
      Revised: 28 November 2023
      Received: 05 August 2023
      Published in TACO Volume 21, Issue 2

      Check for updates

      Author Tags

      1. FPGA
      2. multi-tenancy
      3. virtualization
      4. heterogeneous computing

      Qualifiers

      • Research-article

      Funding Sources

      • European High-Performance Computing Joint Undertaking (JU) project OPTIMA
      • European Union’s H2020 research and innovation programme project EuroEXA

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,516
      • Downloads (Last 6 weeks)215
      Reflects downloads up to 29 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Full Access

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media