research-article

Open access

Architectural Support for Sharing, Isolating and Virtualizing FPGA Resources

Authors:

Panagiotis Miliadis,

Dimitris Theodoropoulos, Dionisios Pnevmatikatos,

Nectarios KozirisAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization, Volume 21, Issue 2

Article No.: 33, Pages 1 - 26

https://doi.org/10.1145/3648475

Published: 21 May 2024 Publication History

PDF eReader

Abstract

FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing cores and memory are shared. Nonetheless, multi-tenancy still faces major architectural limitations including: (a) inefficient sharing of memory interfaces across hardware tasks (HT) exacerbated by technological limitations and peculiarities, (b) insufficient solutions for performance and data isolation and high quality of service, and (c) absent or simplistic allocation strategies to effectively distribute external FPGA memory across HT. This article presents a full-stack solution for enabling multi-tenancy on FPGAs. Specifically, our work proposes an intra-fpga virtualization layer to share FPGA interfaces and its resources across tenants. To achieve efficient inter-connectivity between virtual FPGAs (vFGPAs) and external interfaces, we employ a compact network-on-chip architecture to optimize resource utilization. Dedicated memory management units implement the concept of virtual memory in FPGAs, providing mechanisms to isolate the address space and enable memory protection. We also introduce a memory segmentation scheme to effectively allocate FPGA address space and enhance isolation through hardware-software support, while preserving the efficacy of memory transactions. We assess our solution on an Alveo U250 Data Center FPGA Card, employing 10 real-world benchmarks from the Rodinia and Rosetta suites. Our framework preserves the performance of HT from a non-virtualized environment, while enhancing the device aggregate throughput through resource sharing; up to 3.96x in isolated and up to 2.31x in highly congested settings, where an external interface is shared across four vFPGAs. Finally, our work ensures high-quality of service, with HT achieving up to 0.95x of their native performance, even when resource sharing introduces interference from other accelerators.

1 Introduction

Field Programmable Gate Arrays (FPGAs) have become a popular choice for accelerating diverse workloads, including neural networks [10, 38], data analytics [4, 55], databases [31, 36], quantitative finance [47], linear algebra [46, 50], security [48], and compression [34]. Their ability to be reconfigured at runtime and their high energy efficiency make them an effective alternative compared to ASICs and GPUs. This led major cloud providers like Amazon [3] and Alibaba [2] to incorporate FPGAs into their infrastructure and offer on-demand acceleration to their customers.

Modern FPGAs [45] witness a substantial increase in available resources owing to two key factors: (a) the integration of multiple distinct dies into a unified substrate, and (b) the inclusion of multiple memory controllers to enhance parallel access to external FPGA memory. However, hardware designs fall short in harnessing their potential. Large monolithic kernels are affected by routing and placing constraints, especially when spanning across multiple dies, while smaller hardware accelerators inefficiently utilize the available memory and reconfigurable resources. As a solution, cloud providers aim to optimize their return on investment by effectively distributing FPGAs among multiple customers, similar to the virtualization and sharing of CPUs and memory resources.

Enabling FPGA resource sharing among multiple tenants requires the concurrent execution of hardware tasks (HT)to accelerate their respective workloads. However, this presents challenges in establishing the necessary architectural support for a multi-tenant environment. At first, spatial multiplexing should not compromise the isolation or the native performance of accelerators (challenge 1), thus providing the illusion of exclusivity to tenants. The illusion of exclusivity involves two key aspects; Performance isolation covers the interference effects among hardware accelerators caused by shared interconnections. Data isolation concerns the ability of hardware accelerators to compromise sensitive information or access privileged interfaces. Existing environments allow HT to operate without supervision, granting them privileged access to I/O interfaces and FPGA resources (challenge 2). Traditional interconnecting methods fall short on modern FPGAs as they lack scalability and flexibility. They hinder the ability to attain high aggregate throughput and to provide sufficient simultaneous acceleration to tenants (challenge 3). Finally, efficient allocations schemes are essential to distribute FPGA resources across tenants (challenge 4), as they may limit the inherent performance of accelerators or compromise the isolation of tenants.

Previous research efforts have yielded valuable insights on enabling multi-tenancy in FPGAs. Nonetheless, they encounter several architectural limitations and partially address the challenges associated with FPGA sharing. For instance, overlay techniques [11, 27, 28] lack the reconfigurability to map new hardware accelerators onto FPGA devices and limit their native performance. Some works [6, 9, 17, 30] propose static allocation of memory and reconfigurable resources or partitioning I/O and memory interfaces across HT. These approaches limit their performance and the design flexibility of hardware developers. Other works [6, 9, 51, 52] also lack virtualization support to effectively and securely share external memory and I/O interfaces across HT, while others [17, 21] propose a static multiplexing of I/O and memory interfaces, which do not effectively address challenge 3. In addition, allocation schemes on FPGAs are still limited and are only effective either on shared-memory [25] or cache-coherent [21] FPGA platforms, where they are strongly coupled with the host CPU. Therefore, as Table 1 outlines, challenges related to multi-tenancy and an efficient architectural support for FPGA sharing have not been fully addressed or remain unresolved.

Table 1.

Research	Performance & Data Isolation (challenge 1)	Intra-FPGA Virtualization (challenge 2)	FPGA Scalability (challenge 3)	FPGA Address Space Management (challenge 4)
Cloud FPGA [6, 9]	support	no support	no support	no support
AmorphOS [17]	support	no support	partial support (static I/O partitioning)	partial support (static segments)
Coyote [21]	support	support	partial support (static I/O multiplexing)	partial support (applicable on cache-coherent FPGAs)
Optimus [25]	support	not applicable	support	partial support (applicable on shared-memory FPGAs)
Vital [51]	partial support (only performance isolation)	no support	support	no support
VenOS [30]	support	no support	support	partial support (static segments)
Overlays [27, 28]	partial support (only performance isolation)	no support	support	no support
This Work	support	support	support	support

Table 1. Comparison of Previous Works on FPGA Sharing

Existing works do not focus on providing efficient architectural support for multiplexing distinct tenants on a single FPGA device. Our work provides the most complete and flexible set of virtualization features.

In this article, we present our full-stack solution to effectively provide architectural support for virtualizing and sharing FPGA resources to simultaneously accelerate workloads from distinct tenants. Our work inserts an intra-fpga virtualization layer to allow efficient sharing of external interfaces among hardware accelerators and provide mechanisms for data isolation and protection of I/O interfaces, thereby offering the illusion of operating in an single-tenant environment (challenge 1). This layer relies on a network-on-chip (NoC) architecture, which offers a flexible and scalable design to share FPGA resources and interfaces across HT. NoCs overcome the limitations of static configurations from previous works, optimizing the aggregate throughput of an FPGA and enhance its utilization from multiple tenants (challenge 3). Our work deviates from creating an overlay architecture with preconfigured processing elements. Instead, we harness NoCs as an interconnecting method for efficient sharing of I/O interfaces and an effective layer to accommodate tenants.

Each hardware accelerator is coupled with a task memory management unit (TMMU), that virtualizes the external FPGA memories and enables tasks to operate using virtual addresses. This mechanism provides a flexible way to utilize virtual FPGAs (vFPGAs) in a similar manner as virtual machines in conventional processing units. Each hardware task within vFPGAs operates within its own logical address space, but TMMU takes charge of translating virtual addresses into physical ones, while preventing any unauthorized access to I/O interfaces or privileged address regions. Additionally, virtual addresses conceal the physical data location, inserting a higher level of abstraction and memory protection. By combining TMMUs with the intra-fpga virtualization layer, hardware accelerators operate in a non-privileged execution mode (challenge 2), without compromising their inherent performance. Our framework allows HT to operate as distinct processes, enhancing FPGA utilization and aggregate throughput.

Finally, to effectively handle the FPGA address space (challenge 4), we implement a custom memory segmentation scheme. Previous work [21, 25] has shown how internal fragmentation of pages can impact hardware task performance, mainly due to inefficient DMA transactions. This limitation is effectively addressed through segments, as tasks execute data transactions with similar efficiency as single-tenant environments. Our segmentation scheme efficiently arranges data either in a sequential manner within FPGA memory or distributes them across a few segments, resembling the concept of huge pages in operating systems. Memory segments allow to effectively isolate the address space across tenants and facilitate data isolation through hardware-software support. Concurrently, TMMUs enable HT to perceive the address space in a sequential manner, rendering the segmentation scheme transparent to users.

In summary, our work makes the following contributions:

—

We present a novel architecture (Section 2) for virtualizing and sharing external I/O interfaces across hardware accelerators. Our approach involves the insertion of an intra-fgpa virtualization layer (Section 3), enabling scalable and flexible sharing of FPGA resources through a NoC architecture. This setup allows us to establish effective isolation mechanisms, provide high quality of service and the illusion of exclusivity to tenants.

—

We propose the adoption of memory management units (Section 3.3) to virtualize external FPGA memories, allowing hardware accelerators to operate using virtual addresses. The primary aim is to confine the accelerators to work in a user-level execution, thereby restricting their access to unauthorized I/O interfaces or privileged address spaces. They enhance isolation between tenants and I/O protection, without compromising the native performance of hardware accelerators.

—

We propose a novel memory segmentation scheme (Section 4.2) to optimize FPGA address space management. This scheme ensures hardware accelerators can maintain memory transaction efficiency, while isolating the address space for each tenant. Data isolation is facilitated through hardware/software support.

—

Results indicate that accelerators maintain their inherent performance from a dedicated, non-virtualized setting (Section 5.3). Our architecture achieves aggregate throughput of up to 3.96x in isolated and up to 2.31x in highly congested conditions, employing four HT (Section 5.4). It also provides high quality of service when tasks share FPGA resources, achieving performance levels \(\sim\)0.95x of their native performance in an isolated environment (Section 5.5).

2 Foundations

Our framework addresses the challenges of a multi-tenant environment by employing a SW/HW strategy. This approach involves coupling a conventional CPU with an FPGA over a peripheral bus such as PCIe, as shown in Figure 1. This naturally divides any design into hardware components operating within the FPGA device, referred as Hardware Stack, and software components executing on the host machine as part of the operating system or support libraries, denoted as Software Stack. The dynamic configuration of the FPGA device with new HT, further divides the Hardware Stack into two distinct categories: (a) the static region, configured during the FPGA device boot time, and (b) vFPGAs, configurable at runtime with new hardware processes. This classification provides clarity and simplifies all FPGA datacenter deployments.

Fig. 1.

2.1 Static Region

The static region of the hardware stack provides the necessary architectural support to enable multi-tenancy on FPGAs and facilitate parallel acceleration of workloads. In our work, the static region includes the Hardware Shell and the Intermediate Hardware Layer. The Hardware Shell integrates the logic and controllers to enable interaction with the interfaces of an FPGA platform, such as memories, Network I/Os, or other external interfaces, according to device specifications. Additionally, it features an internal configuration module (e.g., ICAP) for partially reconfiguring vFPGAs with new HT. It also contains the necessary functionality for communication with the host machine. In a multi-tenant setting, the hardware shell typically incorporates all FPGA interfaces shared by HT during their execution.

Within the static region, our work introduces an Intermediate Hardware Layer to enable intra-fpga virtualization. This layer offers mechanisms for isolating HT and sharing FPGA resources, without sacrificing the inherent performance of accelerators (challenge 1). A compact NoC assumes the role of interconnecting vFPGAs and external interfaces. NoC provides a fair, flexible and scalable architecture for sharing the interfaces enabled by the hardware shell among all active HT, offering simultaneous on-demand acceleration (challenge 3). Each vFPGA is linked through its corresponding node in the NoC, allowing our framework to oversee its execution and regulate unsupervised access to FPGA interfaces and resources (challenge 2). Unlike conventional FPGA environments, where HT establish direct connections with hardware components and acquire privileged-level execution, our work enforces isolation mechanisms through the use of a NoC and the associated nodes, preventing direct and unauthorized communication with FPGA resources.

2.2 Reconfigurable Region

The remaining available space within the FPGA constitutes the reconfigurable area. Modern FPGAs allow the selective reconfiguration of distinct regions within the fabric at any time. In conventional FPGA systems, this region is typically dedicated to a single hardware task and is reprogrammed solely when a new application is initiated on the host machine. In the context of multi-tenant environments, these regions, referred as vFPGAs, allow the parallel acceleration of workloads enhancing FPGA resource utilization. vFPGAs serve as the fundamental mechanism for temporal and spatial multiplexing of FPGA resources. They function as isolated regions within the reconfigurable fabric, assigned to tenants for on-demand acceleration. Each hardware task operates autonomously within its designated vFPGA, preventing any functional interference from other HT. All vFPGAs are interconnected via their respective nodes to the NoC, enabling HT to leverage and share external interfaces.

2.3 Software Components

The software stack is aware of the heterogeneous environment, offering suitable abstractions to user applications on the host machine to interact with the underlying FPGA device. It assumes the control over FPGA resource management, dynamically allocating and managing them as needed, while also monitoring application accesses to the Hardware Stack. Although certain aspects could potentially be implemented in hardware, delegating these responsibilities to software enhances flexibility and frees valuable FPGA space for tenant utilization. The software stack includes the System Runtime Manager, the memory segmentation scheme for distributing the FPGA address space and all virtual/physical mappings, allowing tenants to operate using virtual addresses.

The runtime manager is built upon the PCIe driver and controls the distribution of FPGA resources across tenants. vFPGAs are multiplexed and configured using a slot-based strategy, preventing their assignment to distinct tenants until the completion of their hosted hardware task. Meanwhile, the memory segmentation scheme handles the allocation of the FPGA address space (challenge 4). This scheme facilitates data isolation through a flexible software-hardware support, simultaneously providing a virtual perspective of the FPGA address space for both software applications and HT. Applications interact with the FPGA device through software routines, ensuring a seamless and familiar deployment within our multi-tenant setting. The routines handle the allocation and configuration of vFPGA slots, allocation or release of FPGA address regions, data transactions between the host machine and the FPGA device, and the initiation of HT. Overall, they provide a comprehensive software programming flow to harness the efficiency of FPGAs.

3 Hardware Stack

Our framework considers FPGA platforms as specialized devices for accelerating software workloads. Each FPGA is equipped with a dedicated memory/network stack and establishes a connection with the host machine via a PCIe interface. Figure 2 shows the architecture of the Hardware Stack, dividing the device into three parts, each with a distinct role in our system: (a) the Hardware Shell includes the built-in logic to access all I/O and memory interfaces; (b) the Intermediate Hardware Layer facilitates intra-fpga virtualization, allowing the sharing of FPGA resources across HT; (c) the vFPGAs serve as hosts for the HT, delivering on-demand acceleration.

Fig. 2.

The Hardware Shell is essential for managing, controlling and debugging the platform from the Software Stack. It enables access to I/O interfaces (e.g., DDR memories and external network) and interaction with HT. The shell exposes a memory map view of the Hardware Stack, effectively dividing the FPGA address space into privileged and user address regions. The user address region includes the external memories for storing the datasets and output results from HT. HT handle on-chip memories (e.g., BRAM or URAM) identically to single-tenant environments, as these resources are seamlessly integrated into vFPGAs. The privileged address space comprises registers for controlling and monitoring the HT, and memory modules for ensuring their normal operation. Finally, the Hardware Shell features a High-Bandwidth ICAP (HBICAP) Controller, for enabling on-demand partial reconfiguration of vFPGAs with the desired bitstreams.

Our intra-fpga virtualization approach centers around the Intermediate Hardware Layer, which interconnects vFPGAs and external interfaces to efficiently share FPGA resources among tenants. It abstracts the underlying architecture, establishes isolation mechanisms, and maintains native accelerator performance, creating the illusion of exclusivity to tenants. The architecture includes a compact NoC to spatially multiplex all tenants, providing a flexible, scalable, and fair sharing of external I/O interfaces. Memory Nodes handle the data transactions with memory controllers, serving all incoming requests from HT. Gatekeepers connect HT to the NoC, providing them access to external interfaces for data transactions. Together with the TMMUs, they form an isolation layer to provide data isolation and restrict the unsupervised access of HT to FPGA interfaces and its resources. The Hardware Task Wrapper simplifies the monitoring process, by standardizing the porting process and connectivity of HT in our environment, abstracting, and hiding any architectural complexities.

The remaining resources are organized as vFPGAs, well-isolated regions for mapping and configuring HT for on-demand acceleration. vFPGAs offer a homogeneous view of the device, overcoming challenges associated with managing, and partitioning the heterogeneous pool of resources (e.g., BRAMs, FFs, DSPs, and LUTs). This approach also favors smaller accelerators over large monolithic kernels, as the performance is reduced when routing extends across dies boundaries. vFPGAs are an efficient alternative solution, facilitating parallel acceleration of multiple workloads from different tenants in a single FPGA device, enhancing resource and memory utilization. They do not share reconfigurable resources or wire connections, enhancing functional isolation and reducing the risk of functional interference within our system. The Software Stack ensures the exclusive allocation of vFPGAs to tenants, securing that HT run until completion before being reconfigured again. Finally, direct communication between vFPGAs or with external interfaces is not permitted. Instead, all interactions occur through the Intermediate Hardware Layer and they are closely monitored.

3.1 Gatekeeper

Gatekeepers act as NoC access points for vFPGAs. They handle all memory requests and data transactions involving external interfaces, establishing isolation mechanisms to prevent direct communication with FPGA resources. In FPGAs, accelerators establish direct connections with hardware components and controllers, allowing unrestricted access to I/O and memory interfaces and granting them a privileged level execution. However, privileged execution poses a major challenge in multi-tenant environments. Gatekeepers effectively add a layer of oversight and control, enforcing all accesses to external interfaces to be routed through the node, effectively downgrading the execution level of a hardware task to a user level. The monitoring process is simplified through the Hardware Task Wrapper, which serves as a sandboxing mechanism for standardizing connectivity with the underlying system.

The Gatekeeper manages and routes incoming and outgoing messages between the NoC and vFPGA. The router connects the hosted hardware task with the wider system, enabling seamless communication with the Intermediate Hardware Layer. Its design is closely tied to the chosen network topology, and a detailed analysis falls outside the scope of this work. Together with the TMMU, the Gatekeeper forms an isolation layer to prevent unauthorized access to address spaces and external modules during the hardware task execution. It safeguards against malicious memory requests or erroneous data, preventing their transmission towards the NoC or the hardware task. Both transmitted and received data are temporarily buffered and converted into raw or network data, based on their direction within the node. TMMU verifies whether a request is intended for an authorized address region or an external module. If flagged as invalid, all buffers are configured to prohibit the data flow either to vFPGA or the router, discarding all data in the process. As such, Gatekeepers eliminate any direct access to external interfaces and FPGA resources, establishing mechanisms to oversee all communication carried out within the Hardware Stack.

3.2 Hardware Task Wrapper

The Hardware Task Wrapper provides a collection of interfaces to enhance portability and configuration of HT across vFPGAs, connecting them with the rest of the Hardware Stack. It is essential for enabling partial reconfiguration, which necessitates a standardized interface due to the locking of boundary signals on the FPGA fabric. Code portability is a rare occurrence between FPGAs or revisions of the same model. This leads developers to adapt their accelerators to the hardware shells or development tools to constantly match new FPGA devices. Our hardware task wrapper addresses this issue by pushing the portability of FPGA accelerators on the language level. It allows any hardware task, given sufficient resources, to be compiled and configured on any available vFPGA, and even facilitates porting between different device models using our Hardware Stack. This enables HT to be both FPGA and vFPGA agnostic, improving portability, flexibility, and a higher level of abstraction to tenants.

The Hardware Task Wrapper establishes a standard I/O interface between HT and Gatekeepers. This allows developers to focus on the computational aspect of their hardware task, leaving the management, communication and data transactions to the Intermediate Hardware Layer. The Wrapper abstracts the FPGA fabric, enabling HT to function independently of hardware specifics and data location. It provides streaming interfaces for I/O, an AXI-Lite interface for control and monitoring, and clock/reset signals. The streaming interfaces support a data-driven approach, reducing handshake signals and facilitating compatibility with pipeline and dataflow primitives for accelerating workloads. The wrapper allows integration of HT in both HDL and HLS languages, offering a familiar environment for developing and porting FPGA applications.

3.3 Task Memory Management Unit

In heterogeneous computing, external memories commonly store vast datasets that HT access during processing. Multi-tenancy raises significant concerns related to data and performance isolation. Unlike, software applications that operate using virtual addresses, HT access external FPGA memories through physical addresses, posing potential security risks. Additionally, conventional FPGA environments allow HT to access any address region or I/O interface without supervision. These vulnerabilities could grant tenants unauthorized access to privileged address spaces, disrupt external controllers, or hinder access to I/O interfaces for other tenants.

To address these concerns, we introduce a TMMU, a dedicated module integrated and paired with each vFPGA. It virtualizes the external FPGA address space, while overseeing and restricting the accesses of HT to external FPGA interfaces. As shown in Figure 3, TMMU ensures that HT access only allocated address regions, while allowing to effectively manage their data within a contiguous address space similar to software applications. Operating in the virtual address domain conceals the physical data location, enforcing memory protection and data isolation among tenants. Two modules achieve the virtualization of external memories: (a) the Task Addressable Memory, which verifies whether memory transactions fall within the allocated address regions, and (b) the Virtual/Physical Segment Translator, which converts virtual segments into physical ones and generate the necessary DMA requests for the memory nodes.

Fig. 3.

Incoming memory requests include the virtual address and its offset, the memory request size and the memory operation. Initially, the Task Addressable Memory evaluates whether the transaction targets an address region associated with the executing program and falls within the corresponding virtual size. This module is built upon the principles of Content Addressable Memory, a specialized computer memory architecture for rapid search applications. Its contents are set during hardware task initialization from the Software Stack and are compared against the memory request. If a match is found, the module returns the associated virtual size and verifies whether the request remains within the allocated space, forwarding it then for translation. Otherwise, it alerts the Gatekeeper to discard any incoming or outgoing data to prevent erroneous data to access the NoC or sensitive information to reach the malicious tenant.

Figure 3 also demonstrates the ability of TMMU in enabling the seamless operation of HT within a contiguous address space, even when data are distributed across different segments. The Virtual/ Physical Segment Translator creates this illusion and handles the conversion of virtual addresses into physical ones. Moreover, the TMMU enables HT to operate unaware of our memory allocation policy, discussed in Section 4.2. The translation process resembles the operation of page tables in operating systems. The virtual address points to an entry in the segment table, retrieving the physical address of the associated segment within FPGA memory. Since data may span across multiple physical segments, a virtual segment may point to several addresses. The module translates incoming memory requests from HT and generates the corresponding DMA transactions that are routed towards memory nodes. In cases where requested data are stored across multiple segments, separate DMA transactions are generated. This step is essential as memory nodes need to configure the DMA engine for each targeted segment to retrieve all requested data.

3.4 Memory Node

Memory Nodes manage memory requests originating from HT and facilitate data routing between the NoC and memory controller. Each node is equipped with a DMA Engine, that handles all memory operations with the external DDR channels. By configuring a DMA engine, a node orchestrates and supervises the entire transaction process with the corresponding memory channel. This design allows memory transactions to operate independently on each node, enabling concurrent access to all available channels. Memory nodes serve to decouple compute processing from memory operations, allowing HT to continue their process uninterrupted, unless they are waiting for a memory request to be served by an external memory controller.

Incoming packets are classified either as Memory Requests Messages or Data Messages. Memory requests are routed to the DMA Hardware Controller, while data messages are buffered, converted, and directed to the DMA engine. Outgoing data follows the reverse process. The DMA Hardware Controller is a lightweight yet highly efficient module within the Hardware Stack, that enables DMA transactions independently of the host CPU. It allows HT within vFPGAs to initiate indirectly memory transactions with the FPGA memories, bypassing the need for software or programmer involvement from the host machine. The controller handles all memory operations within the Hardware Stack, while TMMU maintains the data coherency between tenants. Any transactions extending beyond the privileged FPGA address space of a tenant are discarded, with TMMU overseeing the generation of DMA transactions based on the allocated physical segments. This approach simplifies programmer complexity, with TMMU abstracting all hardware details related to DMA transactions from the HT. The controller oversees the DMA engine, monitoring its status or interrupts, and configures the transactions details such as starting address and length. The memory nodes function to prevent direct access by HT to external DDR memories, preventing them to disrupt the normal operation of memory controllers.

4 Software Stack

4.1 Programming Model

Our framework provides a set of functions to interact with the underlying multi-tenant environment and to harness the efficiency of FPGAs for workload acceleration in a manner similar to existing languages for heterogeneous computing, such as OpenCL for Vitis [49]. The pseudocode in Listing 1 serves as an example of a software application running on the host machine, while Figure 4 depicts how each step interacts with the Hardware Stack. The majority of steps in this code resemble those in Vitis, and include the configuration of a vFPGA with a specific bitstream ①, data transferring between the host machine and FPGA device ③⑥, initialization of hardware task arguments ④, and execution of HT ⑤. The primary distinction between the conventional programming flow and ours lies on allocation of vFPGAs ① or memory regions ②, and the releasing of allocated resources ⑦, which are carried out within the System Runtime Manager. Our objective is to establish a familiar environment that aligns well with existing ones, simplifying the development, and porting of software applications. All interactions with the Hardware Stack are monitored and executed through the System Runtime Manager, that manages our multi-tenant environment and abstracts its details from tenants.

Listing 1.

Fig. 4.

4.2 System Runtime Manager

The system runtime manager is built upon the PCIe driver to effectively distribute the FPGA resources across tenants, in a dynamic and on-demand manner. It alleviates tenants from the burden to control or manage the underlying architecture, introducing an abstraction layer between the Hardware Stack and software applications. Additionally, it oversees the data transactions between the host and the FPGA device, ensuring that unauthorized access to non-privilege regions is prevented. While it does not enable FPGA device virtualization, a feature handled by the in-built QDMA engine in Hardware Stack, it effectively monitors and manages FPGA resources to optimize their utilization.

As previously stated, the Hardware Stack arranges the reconfigurable resources as vFPGAs, which provide a homogeneous view of the device, despite its heterogeneous nature. vFPGAs are allocated exclusively to tenants until the hosted hardware task is completed, ensuring that multiple tenants cannot simultaneously access the same FPGA resources. This approach enhances functional isolation by preventing tenants from interfering or cause delays in hardware task execution. The manager maintains mappings of vFPGAs and searches for an available slot for assignment. Subsequently, it configures the HBICAP module to map a hardware accelerator onto the designated vFPGA slot.

Sharing the external FPGA memory space presents considerable challenges, particularly in terms of maintaining data isolation and accelerator native performance. Existing approaches statically partition the available address space and memory bandwidth or adopt the traditional memory pages from operating systems. However, the former approach necessitates tenants to confine their datasets within a limited space or restrict parallel access to external memories, resulting significant memory wastage. Moreover, pages introduce extensive internal fragmentation, that slows down the built-in DMA engines. Typically, HT access large sequential data blocks to achieve high level of parallelism, prompting developers to constantly configure their memory management units to handle different page sizes according to their needs (e.g., 4 KB, 2 MB, or 1 GB). To address these limitations, we propose a customized memory segmentation scheme, illustrated in Figure 5(a), that enables efficient sharing of the FPGA address space.

Fig. 5.

To address the excessive external fragmentation, a major limitation in memory segmentation algorithms, the runtime manager merge non-sequential physical blocks, forming a continuous virtual segment that fulfills the allocation request. However, tenants perceive their allocated space within the FPGA address as sequential, unaware that their data may be partitioned among different segments. In addition, to prevent excessive data partitioning across multiple segments, we use a best-fit algorithm that selects the smallest available segment meeting the size requirements. This approach helps to mitigate the space overhead associated with maintaining virtual-physical mappings. Finally, our framework employs hash tables to store the physical addresses that the virtual segment points to. Hash tables provide an effective mean for fast searching to determine the existence of a virtual segment and identify the corresponding physical segments, as shown in Figure 5(b). This allows our runtime manager to oversee the data transactions between the host and FPGA and prevent access to privileged address regions or segments allocated by other tenants.

5 Experimental Evaluation

5.1 Experimental Setup

Our experimental work evaluates the feasibility of incorporating multi-tenancy into reconfigurable platforms for improved performance and efficiency. We assess this by employing a collection of HT from the widely-used Rodinia [8] and Rosetta [56] benchmark suites for heterogeneous computing. These benchmarks cover both compute and memory-intensive applications, representing real-world scenarios. The Intermediate Hardware Layer consists of configurable amount of vFPGA slots and external DDR channels, allowing us to study the impact of FPGA sharing and memory/NoC congestion on hardware task performance. For this reason, we implement four widely-used network topologies (Crossbar, Ring, DoubleRing, and Torus) and analyze the effect of memory node placement on aggregate throughput. The Dancehall (Figure 6(a)) configuration groups all memory nodes together, while the Interleaved (Figure 6(b)) places memory nodes adjacent to Gatekeepers. Crossbar is omitted from this analysis due to its all-to-all connection architecture. Our Hardware Stack is implemented on an Alveo U250 Data Center acceleration card connected to a host computer via PCIe x16 interface. It is developed using Vitis HLS 2020.2 and Vivado 2020.2 environments, and it operates at 250 MHz. The Software Stack is built upon the Xilinx PCIe Driver, while the host machine features an Intel Core i7-8700 CPU @ 3.20 GHz and 32 GB DDR memory.

Fig. 6.

5.2 Resource Overhead and Cost of Ownership

In a multi-tenant environment, resource utilization is a crucial factor, as the system overhead reduces the available reconfigurable resources, impacting the number of tenants and the available resources per vFPGA. Table 2 reports the Full System Overhead, which includes the Hardware Shell and the Intermediate Hardware Layer with eight nodes interconnected by a NoC. Results indicate that Crossbar introduce the highest overhead due to its numerous links per node, indicating the importance of adopting a topology with less resources. In contrast, Ring has the lowest resource utilization, requiring only two links to connect with adjacent nodes. Torus and DoubleRing exhibit intermediate resource utilization, with a slight increase for Torus due to its complexity. Despite the observed overhead of our system, we contend that the resource utilization is acceptable, leaving almost 75% or more of available reconfigurable resources for tenants. Resource utilization on FPGA devices is directly linked to the total on-chip power. In our study, the system demonstrates a total on-chip power of 25.436 Watts when employing the Crossbar configuration, whereas adopting the Ring topology necessitates one watt less. Similar to the resource overhead, DoubleRing and Torus fall within the power consumption range of the preceding two topologies.

Table 2.

Topology	Full System Overhead				Memory Node			Gatekeeper			Power
Topology	# of Links	BRAM %	FF %	LUT %	BRAM %	FF %	LUT %	BRAM %	FF %	LUT %	Watt
Crossbar	8	27.40	10.28	15.31	3.42	1.47	2.07	2.70	0.73	0.93	25.436
Torus	4	23.70	10.23	15.18	2.95	1.44	2.01	2.26	0.67	0.85	25.015
DoubleRing	4	23.30	10.18	14.98	2.91	1.40	1.91	2.21	0.62	0.78	24.915
Ring	2	21.50	9.72	14.34	2.61	1.29	1.75	2.06	0.56	0.73	24.744

Table 2. The Resource Overhead and Total On-Chip Power using Four Widely-used Network Topologies; Crossbar, Torus, DoubleRing, and Ring

The configuration includes four Memory Nodes, with the associated channels, and four Gatekeepers and TMMUs with empty vFPGAs. We further breakdown the estimated resources to illustrate the overheads of a Memory node or a Gatekeeper, to explore the scalability of our system.

We further analyze the resource overhead of our Hardware Stack, focusing on memory nodes and gatekeepers. These parameters are crucial for scalability and compatibility with FPGA device specifications. Memory nodes, including DMA Engines and memory controllers, are slightly more expensive than Gatekeepers with the associated TMMUs. The latter contributes a 2% to 2.7% overhead, which is acceptable when multiplexing numerous hardware accelerators in an FPGA. Table 3 a outlines the estimated resource usage for several essential hardware components in our hardware stack. These components require only a fraction of the resources, allowing the integration of an increased number of hardware accelerators or memory controllers. However, a substantial portion is allocated to interconnect the nodes through the preferred topology. While topologies like Ring, DoubleRing, and Torus have predictable resource usage, the Crossbar topology introduces higher overhead due to all-to-all connections. This results in a larger portion of the FPGA device being used for interconnecting memory and HT. Therefore, alternative topologies are recommended, especially for FPGAs with a high number of DDR channels and vFPGAs. Future FPGAs are expected to offer a larger pool of reconfigurable resources, leveraging technologies like Xilinx SSI [35] to connect multiple distinct dies. This calls for a scalable architecture that can multiplex an increasing number of tenants without necessitating a complete system rebuild. Our Hardware Stack is designed to fulfill this requirement by enabling seamless integration of additional tenants, by simply adding extra Gatekeepers and TMMUs to the Intermediate Hardware Layer.

Table 3.

Finally, Table 3 b outlines the resource usage for a single instance of each tested workload. Their execution has no impact on the functionality or the operating frequency of our intermediate hardware layer. The sole prerequisite is that the workload must adhere to the spatial constraints of vFGPA regions. Additionally, we present the total on-chip power of the FPGA device, when executing our benchmark workloads under the system configuration detailed in Table 2. Our findings confirms that Ring configuration exhibits lower power consumption than the Crossbar topology, with the total power cost being notably influenced by the spatial demands of the kernel. However, the total cost of ownership (TCO) in terms of power does not exceed 33.5 W, regardless of the kernel or configuration employed.

5.3 Performance of Hardware Tasks

To assess the effectiveness of our framework for on-demand acceleration, we conducted a comparative analysis with Vitis, a widely used FPGA environment designed for single-tenant execution. Figure 7 demonstrates the performance of HT within our framework, normalized against the non-virtualized configuration. Our results indicate that our framework does not introduce any virtualization overheads when compared to the native execution. The lack of overhead can be attributed to several factors. At first, address regions are allocated by our memory segmentation scheme, that ensures contiguous storage of data in physical memory, partitioned only when a physical segment cannot fully serve an allocation request. This approach facilitates efficient burst accesses, improving data transaction efficiency. Additionally, the utilization of DMA Engines and DMA Hardware Controllers serves to further enhance memory transaction efficiency, given their status as hardware-accelerated modules.

Fig. 7.

Nevertheless, three kernels exhibit different behaviours. LUD and FD show increased performance within our framework. Their implementation allows the retrieval and storage of their entire dataset through a single DMA transaction, whether read or write. As a result, memory nodes only need to configure the DMA engine only once to fulfill their memory operations. The transaction length also mitigates the overhead associated with configuring the DMA engines, concurrently delivering improved performance compared to the memory map transactions utilized by Vitis. Conversely, Hotspot encounters a marginal 8% performance decline. This kernel operates on small data chunks in each iteration, leading to a substantial number of DMA transactions with the FPGA memory to retrieve and store the complete dataset. The cumulative overhead from configuring DMA engines slightly surpasses the memory map procedure in Vitis.

Subsequently, we conducted a detailed analysis of hardware task performance, focusing on the runtime distribution between internal computation and communication with the Intermediate Hardware Layer for data transfers. The results indicate a slight preference of our framework for HT with large burst accesses in memory and few DMA transactions, such as kmeans, leuckocyte, and nw. Nonetheless, HT achieve near-native execution performance. Our system effectively handles both compute-intensive tasks and memory-intensive tasks, making it suitable for a wide range of applications. Developers should be aware that in a multi-tenant setting, any potential performance degradation would arise from memory and I/O congestion, rather than architecture overhead incurred during task porting.

5.4 Aggregate Performance from FPGA Sharing

In this section, we assess the overall performance of our framework when multiple HT are executed in parallel. Our analysis aims to examine the fairness and scalability of our architecture in various FPGA sharing configurations. We investigate how network topologies influence the aggregate performance of the hardware stack and the impact of FPGA sharing on the overall task performance within our benchmark collection. The aggregate throughput serves as a metric to display the overall performance of the FPGA device compared to an non-virtualized and non-shared environment. This analysis aims to provide valuable insights into the effectiveness of our framework under different sharing scenarios.

Initially, we evaluate the performance of our framework by distributing data from tenants across distinct DDR memories. This scenario leads to increased congestion within the NoC as HT perform memory transactions simultaneously. Our results, presented in Figure 8, indicate that while several topologies can deliver optimal aggregate throughput with two memory controllers, this advantage diminishes when scaling to four memory controllers and four HT. Only the interleaved configurations of DoubleRing and Torus demonstrate near-optimal performance, with a geometric mean of 3.96x, which is almost equivalent to the Crossbar. These topologies offer significantly higher bandwidth than Ring, alleviating the congestion within the network. Furthermore, the interleaved configuration positions memory nodes adjacent to HT, enabling a faster and more direct access compared to the Dancehall configuration. This arrangement effectively mitigates pressure on the routers. The results demonstrate the effectiveness of our framework in providing a well-isolated execution environment, effectively minimizing congestion from shared interconnections.

Fig. 8.

Subsequently, we evaluate the aggregate throughput by grouping data from tenants within the same DDR memory, leading to increased congestion within memory nodes, as HT share the memory interfaces. The outcomes in Figure 9 show that our hardware stack achieves aggregate throughput close to 1.66x and 2.31x, when running two and four instances of a hardware task in parallel, respectively. Compute-intensive HT scale efficiently with sufficient vFPGA slots. Conversely, memory-intensive tasks exhibit performance decline as our framework assigns only a portion of the available bandwidth for memory transactions. Additionally, the chosen topology has no impact on aggregate throughput, with the Ring topology emerging as a favorable choice for its low cost compared to other network configurations. However, when an extra DDR channel is available, Ring falls short in providing competitive performance, highlighting the importance of adopting a network topology with improved scalability and numerous links per node, because of parallel memory accesses. Torus emerges as a favored choice as it competes with Crossbar and offers a more cost-effective solution. By employing Torus, the addition of an extra DDR channel boosts the aggregate throughput of our system to 3.33x. The improvement is attributed to the increased performance of HT dominated by data transfers, as they utilize a larger portion of the available memory bandwidth for their transactions.

Fig. 9.

On Figure 10, we extend our analysis to explore the impact of memory congestion and topology on the aggregate throughput of HT, utilizing four vFPGAs. Results indicate that compute-intensive tasks (e.g., lavamd, lud, nw, and fd) exhibit effective scalability with sufficient vFPGA slots. The Ring demonstrates competitive aggregate throughput compared to Crossbar, despite providing significantly lower network bandwidth. This observation holds even when data are distributed across two or four distinct memory channels, where HT can perform memory transactions in parallel. On the other hand, congestion within memory interfaces/nodes significantly affects the aggregate throughput of memory-intensive HT. However, we notice that small burst transactions enhance the performance even in highly congested cases, as transactions are effectively interleaved, as happens with knn, leukocyte(d), and pathfinder. Finally, the results highlight the importance of choosing a topology with high network bandwidth, as scaling to two or four memory nodes, for parallel memory transactions, does not yield any performance benefits with the Ring topology, unlike in the Crossbar.

Fig. 10.

In brief, our results validate fundamental aspects of implementing a multi-tenant environment in reconfigurable devices. By distributing data across distinct DDR memories, our Hardware Stack attains peak aggregate throughput, providing a completely isolated execution environment to tenants. However, sharing of memory interfaces may impact the performance of HT, given that our system ensures equitable access to external interfaces. Compute-intensive tasks scale effectively given a sufficient supply of vFPGAs, while the rest of our benchmarks depend on factors like network topology or congestion when accessing I/O interfaces to achieve higher performance.

5.5 Quality of Service from Spatial Multiplexing

This section evaluates the quality of service of our intra-fpga virtualization layer, by spatially multiplexing different tenants within the same FPGA device. We conducted an experiment where HT share the same memory channel, allowing us to observe the highest level of interference among active tasks. The outcomes of our study are presented in Figure 11. Our analysis aims to offer valuable insights into the impact of concurrent execution through spatial multiplexing on the performance of HT from the perspective of tenants.

Fig. 11.

Applications with high computational intensity (e.g., lud, lavamd, nw, and fd) exhibit near-optimal and consistently predictable performance, with minimal deviations observed as outliers in the boxplot. This confirms the effectiveness of our framework for computationally oriented processes, as coexisting HT have little impact on their performance. This characteristic is noteworthy in our multi-tenant setup, where FPGAs excel in accelerating computationally demanding algorithms due to their parallel nature. In contrast, KNN is more susceptible to performance degradation when executed concurrently with another process. This can be attributed to its frequent memory accesses and small burst data transactions, with data transfers significantly contributing to the overall execution time. When KNN operates alone, it fully utilizes available bandwidth, but co-location with another accelerator may cause performance decline, especially if substantial bandwidth is required. Nonetheless, the median performance of KNN converges to 70%, indicating a significant portion of its initial performance is retained.

K-means and backprop demonstrate consistent and high performance, ranging from 0.78x to 0.98x, with a median value closer to their native performance. This can be attributed to their limited interaction with the Hardware Stack and their high data transfer per transaction (512 KB for backprop and 1 MB for k-means). The rest of HT exhibit similar behaviour, although their performance may be influenced by the presence of memory-intensive tasks. When paired with more computationally demanding algorithms, they approach their optimal performance. These findings highlight the effectiveness of our intra-fpga virtualization layer in ensuring a high quality of service. Most processes achieve near-optimal median performance, even when they involve substantial interaction with memory. Our system prioritizes performance isolation to provide high quality of service and the illusion of exclusivity to tenants. Even in congested scenarios, where memory and I/O interfaces are shared, most hardware accelerators experience small declines compared to their native performance.

6 Discussion

6.1 Network-on-Chip

Our framework leverages a compact NoC to effectively multiplex vFPGAs and enable simultaneous acceleration of workloads. In this work, we focus on evaluating the efficiency of NoCs as an alternative to traditional interconnecting methods for enabling multi-tenancy on FPGA devices. However, our research does not delve into the field of NoCs. Consequently, we refrain from conducting a comprehensive evaluation of network behaviour with adherent traffic patterns and other related factors. Nevertheless, our framework can be extended and integrated on existing NoC environments [16] to meet provider requirements (e.g., network bandwidth and topology and network protocols). Our primary purpose is to isolate tenants and provide exclusivity, demonstrating the efficiency of a NoC architecture in enhancing spatial multiplexing. Meanwhile, it severs direct communication and access of HT to external interfaces and other vFPGAs, enhancing overall protection.

6.2 System Scalability

The availability of vFPGAs poses a persistent challenge in multi-tenancy, as a higher number leads to smaller regions with limited resources for configuring HT. During experimentation, we faced this tradeoff due to resource-hungry tasks, making it infeasible to scale to a greater number, even though the Hardware Stack requires a small fraction of resources for enabling multi-tenancy. However, future FPGAs with technologies like SSI [35] are expected to offer a greater pool of reconfigurable resources. Prior works [17, 25] have discussed the necessity of adopting a scalable solution as an alternative for multiplexing an increased number of hardware accelerators. This motivates the adoption of a NoC architecture, which addresses routing challenges posed by new FPGA devices and offers better scalability compared to traditional interconnecting methods. Our results demonstrate that a NoC architecture achieves near-optimal aggregate throughput when HT perform parallel accesses on memory channels. However, when memory interfaces are shared, the throughput is influenced by whether an accelerator is compute-intensive tasks or data-intensive. It proves that memory sharing poses a key issue on multi-tenant environments. Our system manages the distribution of memory bandwidth while assigning address regions, by collocating data from tenants in a shared memory channel. This approach offers a flexible way to distribute and share memory bandwidth across HT through software control, such as pairing a memory-intensive task with a compute-intensive tasks one to reduce interference. This leaves room for further optimizations by experimenting and analyzing the workload characteristics [23]. Nonetheless, selecting the appropriate topology is also crucial, as it impacts the performance of data-intensive workloads. We believe that our experiments effectively illustrate the trends in scaling each HT on both congested and isolated settings. Consequently, multiplexing additional vFPGAS, whenever feasible, does not yield additional insights.

6.3 Multi-FPGA Environment

Our research focuses on sharing the resources of a single FPGA device across several tenants. However, cloud nodes may contain multiple FPGAs for workload acceleration. While the challenges specific to a scale-out environment are left for future work, we present helpful insight on extending our research in such a context. Typically, in a scale-out setting, FPGAs are connected through a network infrastructure, with each device possessing its own memory address space, memory bandwidth and I/O interfaces. Consequently, managing each FPGA device as an independent entity simplifies FPGA management, treating vFPGAs and memory resources similarly to our single-FPGA test case. However, the allocation of FPGA resources should be confined within individual devices, as it requires additional strategies for FPGA-FPGA communication and mechanisms to address the challenges related to isolating and sharing the communication interfaces. Moreover, FPGA-FPGA communication affects the performance of HT, as data may need to traverse the network. It requires an abstraction layer to conceal any inter-FPGA communication, enabling developers to operate unaware of their data location or the FPGA responsible for workload acceleration. Furthermore, scale-out environments require the runtime manager to prioritize equitable distribution across all active FPGAs based on underlying workload demands, to maintain a balanced load and minimize congestion and interference resulting from shared FPGA resources. This model resembles the serverless architecture, where user functions may be triggered within vFPGAs, and results are transmitted to tenants via HTTP and a network layer. Such an approach holds the potential for enhanced scalability and efficiency in accelerating workloads across multi-FPGA nodes, while enabling dynamic management of the underlying workload.

6.4 Key Takeaways

Our work highlights three key areas for improving architectural support for multi-tenancy. Initially, our proposed NoC-based architecture is able to effectively multiplex vFPGAs, enabling parallel workload acceleration, while ensuring sufficient scalability. Our results demonstrate the efficacy of NoCs in achieving high aggregate throughput, even during parallel accesses to external FPGA memories. However, we observe that the placement of nodes in certain topologies influences the overall performance, as interleaved configurations outperform the dancehall ones. Moreover, factors like topology and congestion within I/O interfaces affect the performance of memory-intensive tasks. Hence, it is essential to adopt topologies with higher bandwidth or employ efficient allocation strategies, when handling these accelerators. Next, we emphasize the importance of efficient virtual memory on FPGAs. Existing environments allow hardware accelerators to operate at privileged level, granting them unrestricted access to I/O interfaces and address regions using physical addresses. Virtual memory abstraction enhances protection and conceals the physical location of data. Along with the intermediate hardware layer, they effectively shift the execution of tasks to user level, introducing a layer of supervision when accessing I/O interfaces.

Finally, previous works have researched the use of pages to manage FPGA address space. However, Coyote [21] shows that pages introduce significant overheads on PCIe-based systems, making them more suitable for cache-coherent systems. Similarly, Optimus [25] extends pages on shared-memory FPGAs, but their approach is not compatible with PCIe devices. In contrast, our work proposes managing FPGA address space using a memory segmentation scheme. Our approach optimizes data transactions with memory, by optimally placing data sequentially within the physical address space or distributing them across a few segments (typically 2–3 in our experiments). It effectively reduces the entries needed for translating virtual segments into physical ones, while TMMU allows HT to perceive the address space sequentially and preserve their original performance.

7 Related Work

FPGA Abstraction. Chen et al. [9] enable FPGA usage in the cloud through Linux-KVM in a modified OpenStack environment. hCODE [54] introduces a multi-channel shell for managing, creating, and sharing HT via independent PCIe channels. VirtualRC [18] implements a software middleware API as a virtualization layer, converting communication routines for virtual components into API calls for the physical platform. Tarafdar et al. [39] develop an FPGA hypervisor that provides access to all I/O interfaces and programs a partially reconfigurable region with desired bitstreams. Similarly, Catapult [7, 33] virtualizes FPGA resources as a common pool, enabling job scheduling on available accelerators. RACOS [40] offers a user-friendly interface for loading/unloading reconfigurable HT and transparent I/O operations. Finally, FSRF [22] abstracts FPGA I/O at a high level, enabling files to be mapped directly into FPGA virtual memory from the host. Our research integrates multiple elements from prior studies to enhance programming productivity. Specifically, we employ SR-IOV to enable FPGA virtualization, facilitated by the in-built QDMA engine. Additionally, our runtime manager simplifies access to vFPGAs, HT and I/O interfaces, thereby eliminating the need for tenants to possess in-depth knowledge of the underlying platform and hardware.

FPGA Sharing. In [6, 9], FPGA resources are shared through OpenStack using partial reconfigurable regions in both temporal and spatial domains. [43] utilizes an accelerator scheduler to match user requests with a suitable resource pool. [19] implements a hypervisor to manage bitstreams for configuring PRRs and monitoring user access to accelerators. [12] uses a hypervisor on the software stack to communicate with PRRs via a common interface in the static region, handling configuration and allocation of regions to users. In [41], hardware accelerators are shared among multiple customers in a paravirtualized environment. Vital [51] and Hetero-Vital [52] maximize per-FPGA area utilization by segmenting designs into smaller bitstreams and mapping them onto fine-tuned slots within an FPGA cluster, supported by an augmented compiler. AmorphOS [17] implements a “low latency” mode to enable the use of vFPGAs called Morphlets. These Morphlets are managed by a user-mode library, which handles I/O interfaces and facilitates application access. Coyote [21] integrates OS-abstractions within the FPGA device, making it part of the host operating system. Each hardware task is paired with a custom memory management unit and translation lookaside buffers to unify FPGA and host memory. Optimus [25], acting as a hypervisor, utilizes time-multiplexing to schedule virtual machines on pre-configured accelerators, and employs page table slicing for memory and I/O isolation. Nimblock [26] examines scheduling on shared-FPGAs, aiming to enhance responses times and reduce deadline violations. Feniks [53] incorporates an operating system within the FPGA and includes communication stacks and modules for off-chip memory, host CPU, server, and other cloud resources. VenOS [30] employs a NoC architecture for sharing external memories across HT, utilizing static segments to distribute and isolate the FPGA address space among tenants. This work builds upon the principles of VenOS by introducing a virtual view of the FPGA address space and external interfaces for the HT. This approach strengthens tenant isolation by confining hardware accelerators to user-level execution and preventing any authorized access to FPGA resources, similar to software applications and operating system in the host machine. Additionally, memory segments allow HT to operate on virtual addresses without compromising their native performance due to high data fragmentation. Our memory scheme enables a more flexible management of the address space, aligning with the requirements of tenants, in contrast to static memory and I/O partitioning. The flexible management over both memory bandwidth and address space seems to uphold the inherent performance of HT, even in the presence of interference from other accelerators, up to 0.95x when compared to a dedicated, non-virtualized environment. Overall, this work advances prior research by introducing an intra-fpga virtualization layer to effectively provide architectural support for multi-tenancy. Table 1 presents a comparative analysis, demonstrating that our work provides the most comprehensive feature set for sharing FPGAs among multiple tenants.

Accelerator Libraries. Leading cloud providers, including Amazon [2] and Microsoft [29], now offer pre-compiled hardware accelerators, enabling software applications to harness the efficiency of FPGA devices through simple routines. This approach simplifies programming complexity and enables a Software as a Service (SaaS) model, that decouples application development from FPGA design optimization. Similarly, InAccel [14] facilitates large-scale data acceleration across an FPGA cluster using familiar software programming models. While our work is orthogonal from this approach, implementing a SaaS model on top of our framework is a natural fit. In this scenario, the pre-compiled kernels can be executed in parallel within our Hardware Stack, effectively accelerating workloads from multiple tenants simultaneously. Adopting a SaaS model can significantly enhance and facilitate the functionality of our framework, as the HT have predefined operations, set by the provider.

Overlays. Overlays offer a higher level of abstraction that allow configurations to be architecture-agnostic, ensuring code portability and minimum compilation overheads across different FPGA platforms. Upon this approach, authors in [11] propose a virtual reconfigurable architecture which hides the complexity of fine-grained reconfigurable resources. Similarly, Koch et al. [20] leverage overlays through custom instruction set extensions, to utilize FPGA platforms in a more efficient and flexible manner. In [44], the authors extend ZUMA to provide bitstream compatibility between different devices, allowing the integration of ReconOS programming model to facilitate the extension of software applications to reconfigurable hardware. Finally, recent works [27, 28] leverage overlays to provide multi-tenancy within the reconfigurable fabric and enable communication between software applications and the FPGA device through VirtIO. Nevertheless, overlay architectures often sacrifice the performance of hardware accelerators and introduce significant resource overheads. They also reduce the ability to reconfigure the devices with new HT, which is a key advantage of FPGAs over other processing units. For this reason, cloud environments encourage the use of native hardware accelerators, providing maximum performance and efficiency in accelerating workloads.

FPGA OSes. BORPH [5, 37] extends the Linux OS to manage HT like software processes, treating their compilation and execution in a similar manner. Furthermore, inter-process communication is facilitated through UNIX pipes. Similarly, ReconOS [24], and Hthread [32] extend the multi-threaded programming model on FPGAs, providing support for inter-process communication, and synchronization. FUSE [15] provides native OS support for integrating HT in FPGA devices transparently. Leap [1, 13] introduces OS-managed channels latency-sensitive to enable communication between different hardware modules and an partitioning algorithm to share the on-board memory. Finally, Wassi et al. [42] optimize the resource usage of FPGA devices through a real-time operating system and a multi-shape task manager to select the proper version of a hardware task. While these studies explore the potential of providing native OS support for FPGAs, they do not address any of the challenges associated with implementing multi-tenancy on reconfigurable devices. Their main focus lies in treating HT similar to software processes, allowing for inter-process communication and access to the OS and its resources.

8 Conclusion

This article presents an intra-fpga virtualization layer to provide architectural support for multi-tenancy on FPGA devices. This layer comprise a compact NoC to effectively multiplex hardware accelerators and to offer a scalable solution to isolate tenants within an FPGA. Special TMMUs allow hardware task to operate on virtual addresses, virtualizing the FPGA address space and enhancing memory protection. Finally, we introduce a memory segmentation scheme to reinforce I/O isolation through a flexible hardware-software support and improve the efficiency of DMA transactions. Our results demonstrate that our work does not introduce any performance overheads on the hosted accelerators, while achieving up to 3.96x and 2.31x aggregate throughput using four HT in isolated and congested settings. At last, our solution provides evidence of high quality of service, as 7 out of 10 tasks exhibit performance close to their optimum when sharing resources with another task.

Acknowledgments

The authors would like to thank the AMD University Program for its generous donation of the Alveo FPGA boards used in this work.

References

[1]

Michael Adler, Kermin E. Fleming, Angshuman Parashar, Michael Pellauer, and Joel Emer. 2011. Leap scratchpads: Automatic memory and cache management for reconfigurable logic. In Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays . Association for Computing Machinery, New York, NY, USA, 25–28. DOI:DOI:

Abstract

1 Introduction

2 Foundations

2.1 Static Region

2.2 Reconfigurable Region

2.3 Software Components

3 Hardware Stack

3.1 Gatekeeper

3.2 Hardware Task Wrapper

3.3 Task Memory Management Unit

3.4 Memory Node

4 Software Stack

4.1 Programming Model

4.2 System Runtime Manager

5 Experimental Evaluation

5.1 Experimental Setup

5.2 Resource Overhead and Cost of Ownership

5.3 Performance of Hardware Tasks

5.4 Aggregate Performance from FPGA Sharing

5.5 Quality of Service from Spatial Multiplexing

6 Discussion

6.1 Network-on-Chip

6.2 System Scalability

6.3 Multi-FPGA Environment

6.4 Key Takeaways

7 Related Work

8 Conclusion

Acknowledgments

References

Cited By

Index Terms

Recommendations

Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

Scheduler activations for interference-resilient SMP virtual machine scheduling

A Case for Virtualizing Persistent Memory

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations