1 Introduction
Field Programmable Gate Arrays (
FPGAs) have become a popular choice for accelerating diverse workloads, including neural networks [
10,
38], data analytics [
4,
55], databases [
31,
36], quantitative finance [
47], linear algebra [
46,
50], security [
48], and compression [
34]. Their ability to be reconfigured at runtime and their high energy efficiency make them an effective alternative compared to ASICs and GPUs. This led major cloud providers like Amazon [
3] and Alibaba [
2] to incorporate FPGAs into their infrastructure and offer on-demand acceleration to their customers.
Modern FPGAs [
45] witness a substantial increase in available resources owing to two key factors: (a) the integration of multiple distinct dies into a unified substrate, and (b) the inclusion of multiple memory controllers to enhance parallel access to external FPGA memory. However, hardware designs fall short in harnessing their potential. Large monolithic kernels are affected by routing and placing constraints, especially when spanning across multiple dies, while smaller hardware accelerators inefficiently utilize the available memory and reconfigurable resources. As a solution, cloud providers aim to optimize their return on investment by effectively distributing FPGAs among multiple customers, similar to the virtualization and sharing of CPUs and memory resources.
Enabling FPGA resource sharing among multiple tenants requires the concurrent execution of hardware tasks (HT)to accelerate their respective workloads. However, this presents challenges in establishing the necessary architectural support for a multi-tenant environment. At first, spatial multiplexing should not compromise the isolation or the native performance of accelerators (challenge 1), thus providing the illusion of exclusivity to tenants. The illusion of exclusivity involves two key aspects; Performance isolation covers the interference effects among hardware accelerators caused by shared interconnections. Data isolation concerns the ability of hardware accelerators to compromise sensitive information or access privileged interfaces. Existing environments allow HT to operate without supervision, granting them privileged access to I/O interfaces and FPGA resources (challenge 2). Traditional interconnecting methods fall short on modern FPGAs as they lack scalability and flexibility. They hinder the ability to attain high aggregate throughput and to provide sufficient simultaneous acceleration to tenants (challenge 3). Finally, efficient allocations schemes are essential to distribute FPGA resources across tenants (challenge 4), as they may limit the inherent performance of accelerators or compromise the isolation of tenants.
Previous research efforts have yielded valuable insights on enabling multi-tenancy in FPGAs. Nonetheless, they encounter several architectural limitations and partially address the challenges associated with FPGA sharing. For instance, overlay techniques [
11,
27,
28] lack the reconfigurability to map new hardware accelerators onto FPGA devices and limit their native performance. Some works [
6,
9,
17,
30] propose static allocation of memory and reconfigurable resources or partitioning I/O and memory interfaces across HT. These approaches limit their performance and the design flexibility of hardware developers. Other works [
6,
9,
51,
52] also lack virtualization support to effectively and securely share external memory and I/O interfaces across HT, while others [
17,
21] propose a static multiplexing of I/O and memory interfaces, which do not effectively address
challenge 3. In addition, allocation schemes on FPGAs are still limited and are only effective either on shared-memory [
25] or cache-coherent [
21] FPGA platforms, where they are strongly coupled with the host CPU. Therefore, as Table
1 outlines, challenges related to multi-tenancy and an efficient architectural support for FPGA sharing have not been fully addressed or remain unresolved.
In this article, we present our full-stack solution to effectively provide architectural support for virtualizing and sharing FPGA resources to simultaneously accelerate workloads from distinct tenants. Our work inserts an intra-fpga virtualization layer to allow efficient sharing of external interfaces among hardware accelerators and provide mechanisms for data isolation and protection of I/O interfaces, thereby offering the illusion of operating in an single-tenant environment (challenge 1). This layer relies on a network-on-chip (NoC) architecture, which offers a flexible and scalable design to share FPGA resources and interfaces across HT. NoCs overcome the limitations of static configurations from previous works, optimizing the aggregate throughput of an FPGA and enhance its utilization from multiple tenants (challenge 3). Our work deviates from creating an overlay architecture with preconfigured processing elements. Instead, we harness NoCs as an interconnecting method for efficient sharing of I/O interfaces and an effective layer to accommodate tenants.
Each hardware accelerator is coupled with a task memory management unit (TMMU), that virtualizes the external FPGA memories and enables tasks to operate using virtual addresses. This mechanism provides a flexible way to utilize virtual FPGAs (vFPGAs) in a similar manner as virtual machines in conventional processing units. Each hardware task within vFPGAs operates within its own logical address space, but TMMU takes charge of translating virtual addresses into physical ones, while preventing any unauthorized access to I/O interfaces or privileged address regions. Additionally, virtual addresses conceal the physical data location, inserting a higher level of abstraction and memory protection. By combining TMMUs with the intra-fpga virtualization layer, hardware accelerators operate in a non-privileged execution mode (challenge 2), without compromising their inherent performance. Our framework allows HT to operate as distinct processes, enhancing FPGA utilization and aggregate throughput.
Finally, to effectively handle the FPGA address space
(challenge 4), we implement a custom memory segmentation scheme. Previous work [
21,
25] has shown how internal fragmentation of pages can impact hardware task performance, mainly due to inefficient DMA transactions. This limitation is effectively addressed through segments, as tasks execute data transactions with similar efficiency as single-tenant environments. Our segmentation scheme efficiently arranges data either in a sequential manner within FPGA memory or distributes them across a few segments, resembling the concept of huge pages in operating systems. Memory segments allow to effectively isolate the address space across tenants and facilitate data isolation through hardware-software support. Concurrently, TMMUs enable HT to perceive the address space in a sequential manner, rendering the segmentation scheme transparent to users.
In summary, our work makes the following contributions:
—
We present a novel architecture (Section
2) for virtualizing and sharing external I/O interfaces across hardware accelerators. Our approach involves the insertion of an intra-fgpa virtualization layer (Section
3), enabling scalable and flexible sharing of FPGA resources through a NoC architecture. This setup allows us to establish effective isolation mechanisms, provide high quality of service and the illusion of exclusivity to tenants.
—
We propose the adoption of memory management units (Section
3.3) to virtualize external FPGA memories, allowing hardware accelerators to operate using virtual addresses. The primary aim is to confine the accelerators to work in a user-level execution, thereby restricting their access to unauthorized I/O interfaces or privileged address spaces. They enhance isolation between tenants and I/O protection, without compromising the native performance of hardware accelerators.
—
We propose a novel memory segmentation scheme (Section
4.2) to optimize FPGA address space management. This scheme ensures hardware accelerators can maintain memory transaction efficiency, while isolating the address space for each tenant. Data isolation is facilitated through hardware/software support.
—
Results indicate that accelerators maintain their inherent performance from a dedicated, non-virtualized setting (Section
5.3). Our architecture achieves aggregate throughput of up to 3.96x in isolated and up to 2.31x in highly congested conditions, employing four HT (Section
5.4). It also provides high quality of service when tasks share FPGA resources, achieving performance levels
\(\sim\)0.95x of their native performance in an isolated environment (Section
5.5).
2 Foundations
Our framework addresses the challenges of a multi-tenant environment by employing a SW/HW strategy. This approach involves coupling a conventional CPU with an FPGA over a peripheral bus such as PCIe, as shown in Figure
1. This naturally divides any design into hardware components operating within the FPGA device, referred as Hardware Stack, and software components executing on the host machine as part of the operating system or support libraries, denoted as Software Stack. The dynamic configuration of the FPGA device with new HT, further divides the Hardware Stack into two distinct categories: (a) the static region, configured during the FPGA device boot time, and (b) vFPGAs, configurable at runtime with new hardware processes. This classification provides clarity and simplifies all FPGA datacenter deployments.
2.1 Static Region
The static region of the hardware stack provides the necessary architectural support to enable multi-tenancy on FPGAs and facilitate parallel acceleration of workloads. In our work, the static region includes the Hardware Shell and the Intermediate Hardware Layer. The Hardware Shell integrates the logic and controllers to enable interaction with the interfaces of an FPGA platform, such as memories, Network I/Os, or other external interfaces, according to device specifications. Additionally, it features an internal configuration module (e.g., ICAP) for partially reconfiguring vFPGAs with new HT. It also contains the necessary functionality for communication with the host machine. In a multi-tenant setting, the hardware shell typically incorporates all FPGA interfaces shared by HT during their execution.
Within the static region, our work introduces an Intermediate Hardware Layer to enable intra-fpga virtualization. This layer offers mechanisms for isolating HT and sharing FPGA resources, without sacrificing the inherent performance of accelerators (challenge 1). A compact NoC assumes the role of interconnecting vFPGAs and external interfaces. NoC provides a fair, flexible and scalable architecture for sharing the interfaces enabled by the hardware shell among all active HT, offering simultaneous on-demand acceleration (challenge 3). Each vFPGA is linked through its corresponding node in the NoC, allowing our framework to oversee its execution and regulate unsupervised access to FPGA interfaces and resources (challenge 2). Unlike conventional FPGA environments, where HT establish direct connections with hardware components and acquire privileged-level execution, our work enforces isolation mechanisms through the use of a NoC and the associated nodes, preventing direct and unauthorized communication with FPGA resources.
2.2 Reconfigurable Region
The remaining available space within the FPGA constitutes the reconfigurable area. Modern FPGAs allow the selective reconfiguration of distinct regions within the fabric at any time. In conventional FPGA systems, this region is typically dedicated to a single hardware task and is reprogrammed solely when a new application is initiated on the host machine. In the context of multi-tenant environments, these regions, referred as vFPGAs, allow the parallel acceleration of workloads enhancing FPGA resource utilization. vFPGAs serve as the fundamental mechanism for temporal and spatial multiplexing of FPGA resources. They function as isolated regions within the reconfigurable fabric, assigned to tenants for on-demand acceleration. Each hardware task operates autonomously within its designated vFPGA, preventing any functional interference from other HT. All vFPGAs are interconnected via their respective nodes to the NoC, enabling HT to leverage and share external interfaces.
2.3 Software Components
The software stack is aware of the heterogeneous environment, offering suitable abstractions to user applications on the host machine to interact with the underlying FPGA device. It assumes the control over FPGA resource management, dynamically allocating and managing them as needed, while also monitoring application accesses to the Hardware Stack. Although certain aspects could potentially be implemented in hardware, delegating these responsibilities to software enhances flexibility and frees valuable FPGA space for tenant utilization. The software stack includes the System Runtime Manager, the memory segmentation scheme for distributing the FPGA address space and all virtual/physical mappings, allowing tenants to operate using virtual addresses.
The runtime manager is built upon the PCIe driver and controls the distribution of FPGA resources across tenants. vFPGAs are multiplexed and configured using a slot-based strategy, preventing their assignment to distinct tenants until the completion of their hosted hardware task. Meanwhile, the memory segmentation scheme handles the allocation of the FPGA address space (challenge 4). This scheme facilitates data isolation through a flexible software-hardware support, simultaneously providing a virtual perspective of the FPGA address space for both software applications and HT. Applications interact with the FPGA device through software routines, ensuring a seamless and familiar deployment within our multi-tenant setting. The routines handle the allocation and configuration of vFPGA slots, allocation or release of FPGA address regions, data transactions between the host machine and the FPGA device, and the initiation of HT. Overall, they provide a comprehensive software programming flow to harness the efficiency of FPGAs.
3 Hardware Stack
Our framework considers FPGA platforms as specialized devices for accelerating software workloads. Each FPGA is equipped with a dedicated memory/network stack and establishes a connection with the host machine via a PCIe interface. Figure
2 shows the architecture of the Hardware Stack, dividing the device into three parts, each with a distinct role in our system: (a) the Hardware Shell includes the built-in logic to access all I/O and memory interfaces; (b) the Intermediate Hardware Layer facilitates intra-fpga virtualization, allowing the sharing of FPGA resources across HT; (c) the vFPGAs serve as hosts for the HT, delivering on-demand acceleration.
The Hardware Shell is essential for managing, controlling and debugging the platform from the Software Stack. It enables access to I/O interfaces (e.g., DDR memories and external network) and interaction with HT. The shell exposes a memory map view of the Hardware Stack, effectively dividing the FPGA address space into privileged and user address regions. The user address region includes the external memories for storing the datasets and output results from HT. HT handle on-chip memories (e.g., BRAM or URAM) identically to single-tenant environments, as these resources are seamlessly integrated into vFPGAs. The privileged address space comprises registers for controlling and monitoring the HT, and memory modules for ensuring their normal operation. Finally, the Hardware Shell features a High-Bandwidth ICAP (HBICAP) Controller, for enabling on-demand partial reconfiguration of vFPGAs with the desired bitstreams.
Our intra-fpga virtualization approach centers around the Intermediate Hardware Layer, which interconnects vFPGAs and external interfaces to efficiently share FPGA resources among tenants. It abstracts the underlying architecture, establishes isolation mechanisms, and maintains native accelerator performance, creating the illusion of exclusivity to tenants. The architecture includes a compact NoC to spatially multiplex all tenants, providing a flexible, scalable, and fair sharing of external I/O interfaces. Memory Nodes handle the data transactions with memory controllers, serving all incoming requests from HT. Gatekeepers connect HT to the NoC, providing them access to external interfaces for data transactions. Together with the TMMUs, they form an isolation layer to provide data isolation and restrict the unsupervised access of HT to FPGA interfaces and its resources. The Hardware Task Wrapper simplifies the monitoring process, by standardizing the porting process and connectivity of HT in our environment, abstracting, and hiding any architectural complexities.
The remaining resources are organized as vFPGAs, well-isolated regions for mapping and configuring HT for on-demand acceleration. vFPGAs offer a homogeneous view of the device, overcoming challenges associated with managing, and partitioning the heterogeneous pool of resources (e.g., BRAMs, FFs, DSPs, and LUTs). This approach also favors smaller accelerators over large monolithic kernels, as the performance is reduced when routing extends across dies boundaries. vFPGAs are an efficient alternative solution, facilitating parallel acceleration of multiple workloads from different tenants in a single FPGA device, enhancing resource and memory utilization. They do not share reconfigurable resources or wire connections, enhancing functional isolation and reducing the risk of functional interference within our system. The Software Stack ensures the exclusive allocation of vFPGAs to tenants, securing that HT run until completion before being reconfigured again. Finally, direct communication between vFPGAs or with external interfaces is not permitted. Instead, all interactions occur through the Intermediate Hardware Layer and they are closely monitored.
3.1 Gatekeeper
Gatekeepers act as NoC access points for vFPGAs. They handle all memory requests and data transactions involving external interfaces, establishing isolation mechanisms to prevent direct communication with FPGA resources. In FPGAs, accelerators establish direct connections with hardware components and controllers, allowing unrestricted access to I/O and memory interfaces and granting them a privileged level execution. However, privileged execution poses a major challenge in multi-tenant environments. Gatekeepers effectively add a layer of oversight and control, enforcing all accesses to external interfaces to be routed through the node, effectively downgrading the execution level of a hardware task to a user level. The monitoring process is simplified through the Hardware Task Wrapper, which serves as a sandboxing mechanism for standardizing connectivity with the underlying system.
The Gatekeeper manages and routes incoming and outgoing messages between the NoC and vFPGA. The router connects the hosted hardware task with the wider system, enabling seamless communication with the Intermediate Hardware Layer. Its design is closely tied to the chosen network topology, and a detailed analysis falls outside the scope of this work. Together with the TMMU, the Gatekeeper forms an isolation layer to prevent unauthorized access to address spaces and external modules during the hardware task execution. It safeguards against malicious memory requests or erroneous data, preventing their transmission towards the NoC or the hardware task. Both transmitted and received data are temporarily buffered and converted into raw or network data, based on their direction within the node. TMMU verifies whether a request is intended for an authorized address region or an external module. If flagged as invalid, all buffers are configured to prohibit the data flow either to vFPGA or the router, discarding all data in the process. As such, Gatekeepers eliminate any direct access to external interfaces and FPGA resources, establishing mechanisms to oversee all communication carried out within the Hardware Stack.
3.2 Hardware Task Wrapper
The Hardware Task Wrapper provides a collection of interfaces to enhance portability and configuration of HT across vFPGAs, connecting them with the rest of the Hardware Stack. It is essential for enabling partial reconfiguration, which necessitates a standardized interface due to the locking of boundary signals on the FPGA fabric. Code portability is a rare occurrence between FPGAs or revisions of the same model. This leads developers to adapt their accelerators to the hardware shells or development tools to constantly match new FPGA devices. Our hardware task wrapper addresses this issue by pushing the portability of FPGA accelerators on the language level. It allows any hardware task, given sufficient resources, to be compiled and configured on any available vFPGA, and even facilitates porting between different device models using our Hardware Stack. This enables HT to be both FPGA and vFPGA agnostic, improving portability, flexibility, and a higher level of abstraction to tenants.
The Hardware Task Wrapper establishes a standard I/O interface between HT and Gatekeepers. This allows developers to focus on the computational aspect of their hardware task, leaving the management, communication and data transactions to the Intermediate Hardware Layer. The Wrapper abstracts the FPGA fabric, enabling HT to function independently of hardware specifics and data location. It provides streaming interfaces for I/O, an AXI-Lite interface for control and monitoring, and clock/reset signals. The streaming interfaces support a data-driven approach, reducing handshake signals and facilitating compatibility with pipeline and dataflow primitives for accelerating workloads. The wrapper allows integration of HT in both HDL and HLS languages, offering a familiar environment for developing and porting FPGA applications.
3.3 Task Memory Management Unit
In heterogeneous computing, external memories commonly store vast datasets that HT access during processing. Multi-tenancy raises significant concerns related to data and performance isolation. Unlike, software applications that operate using virtual addresses, HT access external FPGA memories through physical addresses, posing potential security risks. Additionally, conventional FPGA environments allow HT to access any address region or I/O interface without supervision. These vulnerabilities could grant tenants unauthorized access to privileged address spaces, disrupt external controllers, or hinder access to I/O interfaces for other tenants.
To address these concerns, we introduce a TMMU, a dedicated module integrated and paired with each vFPGA. It virtualizes the external FPGA address space, while overseeing and restricting the accesses of HT to external FPGA interfaces. As shown in Figure
3, TMMU ensures that HT access only allocated address regions, while allowing to effectively manage their data within a contiguous address space similar to software applications. Operating in the virtual address domain conceals the physical data location, enforcing memory protection and data isolation among tenants. Two modules achieve the virtualization of external memories: (a) the Task Addressable Memory, which verifies whether memory transactions fall within the allocated address regions, and (b) the Virtual/Physical Segment Translator, which converts virtual segments into physical ones and generate the necessary DMA requests for the memory nodes.
Incoming memory requests include the virtual address and its offset, the memory request size and the memory operation. Initially, the Task Addressable Memory evaluates whether the transaction targets an address region associated with the executing program and falls within the corresponding virtual size. This module is built upon the principles of Content Addressable Memory, a specialized computer memory architecture for rapid search applications. Its contents are set during hardware task initialization from the Software Stack and are compared against the memory request. If a match is found, the module returns the associated virtual size and verifies whether the request remains within the allocated space, forwarding it then for translation. Otherwise, it alerts the Gatekeeper to discard any incoming or outgoing data to prevent erroneous data to access the NoC or sensitive information to reach the malicious tenant.
Figure
3 also demonstrates the ability of TMMU in enabling the seamless operation of HT within a contiguous address space, even when data are distributed across different segments. The Virtual/ Physical Segment Translator creates this illusion and handles the conversion of virtual addresses into physical ones. Moreover, the TMMU enables HT to operate unaware of our memory allocation policy, discussed in Section
4.2. The translation process resembles the operation of page tables in operating systems. The virtual address points to an entry in the segment table, retrieving the physical address of the associated segment within FPGA memory. Since data may span across multiple physical segments, a virtual segment may point to several addresses. The module translates incoming memory requests from HT and generates the corresponding DMA transactions that are routed towards memory nodes. In cases where requested data are stored across multiple segments, separate DMA transactions are generated. This step is essential as memory nodes need to configure the DMA engine for each targeted segment to retrieve all requested data.
3.4 Memory Node
Memory Nodes manage memory requests originating from HT and facilitate data routing between the NoC and memory controller. Each node is equipped with a DMA Engine, that handles all memory operations with the external DDR channels. By configuring a DMA engine, a node orchestrates and supervises the entire transaction process with the corresponding memory channel. This design allows memory transactions to operate independently on each node, enabling concurrent access to all available channels. Memory nodes serve to decouple compute processing from memory operations, allowing HT to continue their process uninterrupted, unless they are waiting for a memory request to be served by an external memory controller.
Incoming packets are classified either as Memory Requests Messages or Data Messages. Memory requests are routed to the DMA Hardware Controller, while data messages are buffered, converted, and directed to the DMA engine. Outgoing data follows the reverse process. The DMA Hardware Controller is a lightweight yet highly efficient module within the Hardware Stack, that enables DMA transactions independently of the host CPU. It allows HT within vFPGAs to initiate indirectly memory transactions with the FPGA memories, bypassing the need for software or programmer involvement from the host machine. The controller handles all memory operations within the Hardware Stack, while TMMU maintains the data coherency between tenants. Any transactions extending beyond the privileged FPGA address space of a tenant are discarded, with TMMU overseeing the generation of DMA transactions based on the allocated physical segments. This approach simplifies programmer complexity, with TMMU abstracting all hardware details related to DMA transactions from the HT. The controller oversees the DMA engine, monitoring its status or interrupts, and configures the transactions details such as starting address and length. The memory nodes function to prevent direct access by HT to external DDR memories, preventing them to disrupt the normal operation of memory controllers.
5 Experimental Evaluation
5.1 Experimental Setup
Our experimental work evaluates the feasibility of incorporating multi-tenancy into reconfigurable platforms for improved performance and efficiency. We assess this by employing a collection of HT from the widely-used Rodinia [
8] and Rosetta [
56] benchmark suites for heterogeneous computing. These benchmarks cover both compute and memory-intensive applications, representing real-world scenarios. The Intermediate Hardware Layer consists of configurable amount of vFPGA slots and external DDR channels, allowing us to study the impact of FPGA sharing and memory/NoC congestion on hardware task performance. For this reason, we implement four widely-used network topologies (Crossbar, Ring, DoubleRing, and Torus) and analyze the effect of memory node placement on aggregate throughput. The Dancehall (Figure
6(a)) configuration groups all memory nodes together, while the Interleaved (Figure
6(b)) places memory nodes adjacent to Gatekeepers. Crossbar is omitted from this analysis due to its all-to-all connection architecture. Our Hardware Stack is implemented on an Alveo U250 Data Center acceleration card connected to a host computer via PCIe x16 interface. It is developed using Vitis HLS 2020.2 and Vivado 2020.2 environments, and it operates at 250 MHz. The Software Stack is built upon the Xilinx PCIe Driver, while the host machine features an Intel Core i7-8700 CPU @ 3.20 GHz and 32 GB DDR memory.
5.2 Resource Overhead and Cost of Ownership
In a multi-tenant environment, resource utilization is a crucial factor, as the system overhead reduces the available reconfigurable resources, impacting the number of tenants and the available resources per vFPGA. Table
2 reports the Full System Overhead, which includes the Hardware Shell and the Intermediate Hardware Layer with eight nodes interconnected by a NoC. Results indicate that Crossbar introduce the highest overhead due to its numerous links per node, indicating the importance of adopting a topology with less resources. In contrast, Ring has the lowest resource utilization, requiring only two links to connect with adjacent nodes. Torus and DoubleRing exhibit intermediate resource utilization, with a slight increase for Torus due to its complexity. Despite the observed overhead of our system, we contend that the resource utilization is acceptable, leaving almost 75% or more of available reconfigurable resources for tenants. Resource utilization on FPGA devices is directly linked to the total on-chip power. In our study, the system demonstrates a total on-chip power of 25.436 Watts when employing the Crossbar configuration, whereas adopting the Ring topology necessitates one watt less. Similar to the resource overhead, DoubleRing and Torus fall within the power consumption range of the preceding two topologies.
We further analyze the resource overhead of our Hardware Stack, focusing on memory nodes and gatekeepers. These parameters are crucial for scalability and compatibility with FPGA device specifications. Memory nodes, including DMA Engines and memory controllers, are slightly more expensive than Gatekeepers with the associated TMMUs. The latter contributes a 2% to 2.7% overhead, which is acceptable when multiplexing numerous hardware accelerators in an FPGA. Table
3 a outlines the estimated resource usage for several essential hardware components in our hardware stack. These components require only a fraction of the resources, allowing the integration of an increased number of hardware accelerators or memory controllers. However, a substantial portion is allocated to interconnect the nodes through the preferred topology. While topologies like Ring, DoubleRing, and Torus have predictable resource usage, the Crossbar topology introduces higher overhead due to all-to-all connections. This results in a larger portion of the FPGA device being used for interconnecting memory and HT. Therefore, alternative topologies are recommended, especially for FPGAs with a high number of DDR channels and vFPGAs. Future FPGAs are expected to offer a larger pool of reconfigurable resources, leveraging technologies like Xilinx SSI [
35] to connect multiple distinct dies. This calls for a scalable architecture that can multiplex an increasing number of tenants without necessitating a complete system rebuild. Our Hardware Stack is designed to fulfill this requirement by enabling seamless integration of additional tenants, by simply adding extra Gatekeepers and TMMUs to the Intermediate Hardware Layer.
Finally, Table
3 b outlines the resource usage for a single instance of each tested workload. Their execution has no impact on the functionality or the operating frequency of our intermediate hardware layer. The sole prerequisite is that the workload must adhere to the spatial constraints of vFGPA regions. Additionally, we present the total on-chip power of the FPGA device, when executing our benchmark workloads under the system configuration detailed in Table
2. Our findings confirms that Ring configuration exhibits lower power consumption than the Crossbar topology, with the total power cost being notably influenced by the spatial demands of the kernel. However, the
total cost of ownership (
TCO) in terms of power does not exceed 33.5 W, regardless of the kernel or configuration employed.
5.3 Performance of Hardware Tasks
To assess the effectiveness of our framework for on-demand acceleration, we conducted a comparative analysis with Vitis, a widely used FPGA environment designed for single-tenant execution. Figure
7 demonstrates the performance of HT within our framework, normalized against the non-virtualized configuration. Our results indicate that our framework does not introduce any virtualization overheads when compared to the native execution. The lack of overhead can be attributed to several factors. At first, address regions are allocated by our memory segmentation scheme, that ensures contiguous storage of data in physical memory, partitioned only when a physical segment cannot fully serve an allocation request. This approach facilitates efficient burst accesses, improving data transaction efficiency. Additionally, the utilization of DMA Engines and DMA Hardware Controllers serves to further enhance memory transaction efficiency, given their status as hardware-accelerated modules.
Nevertheless, three kernels exhibit different behaviours. LUD and FD show increased performance within our framework. Their implementation allows the retrieval and storage of their entire dataset through a single DMA transaction, whether read or write. As a result, memory nodes only need to configure the DMA engine only once to fulfill their memory operations. The transaction length also mitigates the overhead associated with configuring the DMA engines, concurrently delivering improved performance compared to the memory map transactions utilized by Vitis. Conversely, Hotspot encounters a marginal 8% performance decline. This kernel operates on small data chunks in each iteration, leading to a substantial number of DMA transactions with the FPGA memory to retrieve and store the complete dataset. The cumulative overhead from configuring DMA engines slightly surpasses the memory map procedure in Vitis.
Subsequently, we conducted a detailed analysis of hardware task performance, focusing on the runtime distribution between internal computation and communication with the Intermediate Hardware Layer for data transfers. The results indicate a slight preference of our framework for HT with large burst accesses in memory and few DMA transactions, such as kmeans, leuckocyte, and nw. Nonetheless, HT achieve near-native execution performance. Our system effectively handles both compute-intensive tasks and memory-intensive tasks, making it suitable for a wide range of applications. Developers should be aware that in a multi-tenant setting, any potential performance degradation would arise from memory and I/O congestion, rather than architecture overhead incurred during task porting.
5.4 Aggregate Performance from FPGA Sharing
In this section, we assess the overall performance of our framework when multiple HT are executed in parallel. Our analysis aims to examine the fairness and scalability of our architecture in various FPGA sharing configurations. We investigate how network topologies influence the aggregate performance of the hardware stack and the impact of FPGA sharing on the overall task performance within our benchmark collection. The aggregate throughput serves as a metric to display the overall performance of the FPGA device compared to an non-virtualized and non-shared environment. This analysis aims to provide valuable insights into the effectiveness of our framework under different sharing scenarios.
Initially, we evaluate the performance of our framework by distributing data from tenants across distinct DDR memories. This scenario leads to increased congestion within the NoC as HT perform memory transactions simultaneously. Our results, presented in Figure
8, indicate that while several topologies can deliver optimal aggregate throughput with two memory controllers, this advantage diminishes when scaling to four memory controllers and four HT. Only the interleaved configurations of DoubleRing and Torus demonstrate near-optimal performance, with a geometric mean of 3.96x, which is almost equivalent to the Crossbar. These topologies offer significantly higher bandwidth than Ring, alleviating the congestion within the network. Furthermore, the interleaved configuration positions memory nodes adjacent to HT, enabling a faster and more direct access compared to the Dancehall configuration. This arrangement effectively mitigates pressure on the routers. The results demonstrate the effectiveness of our framework in providing a well-isolated execution environment, effectively minimizing congestion from shared interconnections.
Subsequently, we evaluate the aggregate throughput by grouping data from tenants within the same DDR memory, leading to increased congestion within memory nodes, as HT share the memory interfaces. The outcomes in Figure
9 show that our hardware stack achieves aggregate throughput close to 1.66x and 2.31x, when running two and four instances of a hardware task in parallel, respectively. Compute-intensive HT scale efficiently with sufficient vFPGA slots. Conversely, memory-intensive tasks exhibit performance decline as our framework assigns only a portion of the available bandwidth for memory transactions. Additionally, the chosen topology has no impact on aggregate throughput, with the Ring topology emerging as a favorable choice for its low cost compared to other network configurations. However, when an extra DDR channel is available, Ring falls short in providing competitive performance, highlighting the importance of adopting a network topology with improved scalability and numerous links per node, because of parallel memory accesses. Torus emerges as a favored choice as it competes with Crossbar and offers a more cost-effective solution. By employing Torus, the addition of an extra DDR channel boosts the aggregate throughput of our system to 3.33x. The improvement is attributed to the increased performance of HT dominated by data transfers, as they utilize a larger portion of the available memory bandwidth for their transactions.
On Figure
10, we extend our analysis to explore the impact of memory congestion and topology on the aggregate throughput of HT, utilizing four vFPGAs. Results indicate that compute-intensive tasks (e.g., lavamd, lud, nw, and fd) exhibit effective scalability with sufficient vFPGA slots. The Ring demonstrates competitive aggregate throughput compared to Crossbar, despite providing significantly lower network bandwidth. This observation holds even when data are distributed across two or four distinct memory channels, where HT can perform memory transactions in parallel. On the other hand, congestion within memory interfaces/nodes significantly affects the aggregate throughput of memory-intensive HT. However, we notice that small burst transactions enhance the performance even in highly congested cases, as transactions are effectively interleaved, as happens with knn, leukocyte(d), and pathfinder. Finally, the results highlight the importance of choosing a topology with high network bandwidth, as scaling to two or four memory nodes, for parallel memory transactions, does not yield any performance benefits with the Ring topology, unlike in the Crossbar.
In brief, our results validate fundamental aspects of implementing a multi-tenant environment in reconfigurable devices. By distributing data across distinct DDR memories, our Hardware Stack attains peak aggregate throughput, providing a completely isolated execution environment to tenants. However, sharing of memory interfaces may impact the performance of HT, given that our system ensures equitable access to external interfaces. Compute-intensive tasks scale effectively given a sufficient supply of vFPGAs, while the rest of our benchmarks depend on factors like network topology or congestion when accessing I/O interfaces to achieve higher performance.
5.5 Quality of Service from Spatial Multiplexing
This section evaluates the quality of service of our intra-fpga virtualization layer, by spatially multiplexing different tenants within the same FPGA device. We conducted an experiment where HT share the same memory channel, allowing us to observe the highest level of interference among active tasks. The outcomes of our study are presented in Figure
11. Our analysis aims to offer valuable insights into the impact of concurrent execution through spatial multiplexing on the performance of HT from the perspective of tenants.
Applications with high computational intensity (e.g., lud, lavamd, nw, and fd) exhibit near-optimal and consistently predictable performance, with minimal deviations observed as outliers in the boxplot. This confirms the effectiveness of our framework for computationally oriented processes, as coexisting HT have little impact on their performance. This characteristic is noteworthy in our multi-tenant setup, where FPGAs excel in accelerating computationally demanding algorithms due to their parallel nature. In contrast, KNN is more susceptible to performance degradation when executed concurrently with another process. This can be attributed to its frequent memory accesses and small burst data transactions, with data transfers significantly contributing to the overall execution time. When KNN operates alone, it fully utilizes available bandwidth, but co-location with another accelerator may cause performance decline, especially if substantial bandwidth is required. Nonetheless, the median performance of KNN converges to 70%, indicating a significant portion of its initial performance is retained.
K-means and backprop demonstrate consistent and high performance, ranging from 0.78x to 0.98x, with a median value closer to their native performance. This can be attributed to their limited interaction with the Hardware Stack and their high data transfer per transaction (512 KB for backprop and 1 MB for k-means). The rest of HT exhibit similar behaviour, although their performance may be influenced by the presence of memory-intensive tasks. When paired with more computationally demanding algorithms, they approach their optimal performance. These findings highlight the effectiveness of our intra-fpga virtualization layer in ensuring a high quality of service. Most processes achieve near-optimal median performance, even when they involve substantial interaction with memory. Our system prioritizes performance isolation to provide high quality of service and the illusion of exclusivity to tenants. Even in congested scenarios, where memory and I/O interfaces are shared, most hardware accelerators experience small declines compared to their native performance.
7 Related Work
FPGA Abstraction. Chen et al. [
9] enable FPGA usage in the cloud through Linux-KVM in a modified OpenStack environment. hCODE [
54] introduces a multi-channel shell for managing, creating, and sharing HT via independent PCIe channels. VirtualRC [
18] implements a software middleware API as a virtualization layer, converting communication routines for virtual components into API calls for the physical platform. Tarafdar et al. [
39] develop an FPGA hypervisor that provides access to all I/O interfaces and programs a partially reconfigurable region with desired bitstreams. Similarly, Catapult [
7,
33] virtualizes FPGA resources as a common pool, enabling job scheduling on available accelerators. RACOS [
40] offers a user-friendly interface for loading/unloading reconfigurable HT and transparent I/O operations. Finally, FSRF [
22] abstracts FPGA I/O at a high level, enabling files to be mapped directly into FPGA virtual memory from the host. Our research integrates multiple elements from prior studies to enhance programming productivity. Specifically, we employ SR-IOV to enable FPGA virtualization, facilitated by the in-built QDMA engine. Additionally, our runtime manager simplifies access to vFPGAs, HT and I/O interfaces, thereby eliminating the need for tenants to possess in-depth knowledge of the underlying platform and hardware.
FPGA Sharing. In [
6,
9], FPGA resources are shared through OpenStack using partial reconfigurable regions in both temporal and spatial domains. [
43] utilizes an accelerator scheduler to match user requests with a suitable resource pool. [
19] implements a hypervisor to manage bitstreams for configuring PRRs and monitoring user access to accelerators. [
12] uses a hypervisor on the software stack to communicate with PRRs via a common interface in the static region, handling configuration and allocation of regions to users. In [
41], hardware accelerators are shared among multiple customers in a paravirtualized environment. Vital [
51] and Hetero-Vital [
52] maximize per-FPGA area utilization by segmenting designs into smaller bitstreams and mapping them onto fine-tuned slots within an FPGA cluster, supported by an augmented compiler. AmorphOS [
17] implements a “low latency” mode to enable the use of vFPGAs called Morphlets. These Morphlets are managed by a user-mode library, which handles I/O interfaces and facilitates application access. Coyote [
21] integrates OS-abstractions within the FPGA device, making it part of the host operating system. Each hardware task is paired with a custom memory management unit and translation lookaside buffers to unify FPGA and host memory. Optimus [
25], acting as a hypervisor, utilizes time-multiplexing to schedule virtual machines on pre-configured accelerators, and employs page table slicing for memory and I/O isolation. Nimblock [
26] examines scheduling on shared-FPGAs, aiming to enhance responses times and reduce deadline violations. Feniks [
53] incorporates an operating system within the FPGA and includes communication stacks and modules for off-chip memory, host CPU, server, and other cloud resources. VenOS [
30] employs a NoC architecture for sharing external memories across HT, utilizing static segments to distribute and isolate the FPGA address space among tenants. This work builds upon the principles of VenOS by introducing a virtual view of the FPGA address space and external interfaces for the HT. This approach strengthens tenant isolation by confining hardware accelerators to user-level execution and preventing any authorized access to FPGA resources, similar to software applications and operating system in the host machine. Additionally, memory segments allow HT to operate on virtual addresses without compromising their native performance due to high data fragmentation. Our memory scheme enables a more flexible management of the address space, aligning with the requirements of tenants, in contrast to static memory and I/O partitioning. The flexible management over both memory bandwidth and address space seems to uphold the inherent performance of HT, even in the presence of interference from other accelerators, up to 0.95x when compared to a dedicated, non-virtualized environment. Overall, this work advances prior research by introducing an intra-fpga virtualization layer to effectively provide architectural support for multi-tenancy. Table
1 presents a comparative analysis, demonstrating that our work provides the most comprehensive feature set for sharing FPGAs among multiple tenants.
Accelerator Libraries. Leading cloud providers, including Amazon [
2] and Microsoft [
29], now offer pre-compiled hardware accelerators, enabling software applications to harness the efficiency of FPGA devices through simple routines. This approach simplifies programming complexity and enables a
Software as a Service (
SaaS) model, that decouples application development from FPGA design optimization. Similarly, InAccel [
14] facilitates large-scale data acceleration across an FPGA cluster using familiar software programming models. While our work is orthogonal from this approach, implementing a SaaS model on top of our framework is a natural fit. In this scenario, the pre-compiled kernels can be executed in parallel within our Hardware Stack, effectively accelerating workloads from multiple tenants simultaneously. Adopting a SaaS model can significantly enhance and facilitate the functionality of our framework, as the HT have predefined operations, set by the provider.
Overlays. Overlays offer a higher level of abstraction that allow configurations to be architecture-agnostic, ensuring code portability and minimum compilation overheads across different FPGA platforms. Upon this approach, authors in [
11] propose a virtual reconfigurable architecture which hides the complexity of fine-grained reconfigurable resources. Similarly, Koch et al. [
20] leverage overlays through custom instruction set extensions, to utilize FPGA platforms in a more efficient and flexible manner. In [
44], the authors extend ZUMA to provide bitstream compatibility between different devices, allowing the integration of ReconOS programming model to facilitate the extension of software applications to reconfigurable hardware. Finally, recent works [
27,
28] leverage overlays to provide multi-tenancy within the reconfigurable fabric and enable communication between software applications and the FPGA device through VirtIO. Nevertheless, overlay architectures often sacrifice the performance of hardware accelerators and introduce significant resource overheads. They also reduce the ability to reconfigure the devices with new HT, which is a key advantage of FPGAs over other processing units. For this reason, cloud environments encourage the use of native hardware accelerators, providing maximum performance and efficiency in accelerating workloads.
FPGA OSes. BORPH [
5,
37] extends the Linux OS to manage HT like software processes, treating their compilation and execution in a similar manner. Furthermore, inter-process communication is facilitated through UNIX pipes. Similarly, ReconOS [
24], and Hthread [
32] extend the multi-threaded programming model on FPGAs, providing support for inter-process communication, and synchronization. FUSE [
15] provides native OS support for integrating HT in FPGA devices transparently. Leap [
1,
13] introduces OS-managed channels latency-sensitive to enable communication between different hardware modules and an partitioning algorithm to share the on-board memory. Finally, Wassi et al. [
42] optimize the resource usage of FPGA devices through a real-time operating system and a multi-shape task manager to select the proper version of a hardware task. While these studies explore the potential of providing native OS support for FPGAs, they do not address any of the challenges associated with implementing multi-tenancy on reconfigurable devices. Their main focus lies in treating HT similar to software processes, allowing for inter-process communication and access to the OS and its resources.