1 Introduction
As a brain-inspired computational model,
spiking neural networks (SNNs) have emerged as a strong candidate for future AI platform. SNNs are considered the third-generation of artificial neural networks [
35] and more closely resemble biological neurons in the brain than the more conventional second-generation ANNs based on commonly used activation functions such as sigmoid or
rectified linear unit (ReLU). We refer to the second-generation of neural networks as non-spiking ANNs or simply ANNs in this article. In ANNs, signals are continuous-valued and model the averaged input and output firing rates of different neurons. A key distinction between SNNs and ANNs is that the former class of neural networks explicitly models all-or-none firing spikes across both space and time, as seen in biological neural circuits. Apart from (firing) rate based coding, SNNs can go beyond the more conventional ANNs in terms of exploring a diverse family of
temporal codes and are intrinsically better positioned for processing complex
spatiotemporal data [
6,
35,
36]. Furthermore, biologically-inspired [
5,
8,
59] and backpropagation based SNN training methods [
27,
32,
47,
50,
53] have emerged and demonstrated competitive performances for various image and speech tasks.
The spiking nature of operation also set off the development of event-driven neuromorphic processors in both academia [
2,
14,
15,
17,
24,
29,
37,
39,
41,
43,
44,
52] and industry including IBM’s TrueNorth [
2] and Intel’s Loihi [
17] neuromorphic chips. While there is a body of spiking neural network hardware design work using emerging devices [
54], this work is primarily focused on digital SNN accelerators.
While feedforward neural networks are widely adopted, we emphasize that recurrent spiking neural networks (R-SNNs) more realistically match the wiring structure of biological brains, e.g., that of the 6-layer minicolumns of the cerebral cortex. Hence, R-SNNs form a powerful bio-inspired computing substrate that is both dynamical and recurrent. More specifically, R-SNNs can extract complex spatiotemporal features via dynamics created by recurrence across different levels and implement powerful
temporally based distributed memory motivated by the essential role of working memory in cognitive processes for reasoning and decision-making [
18].
Figure
1(a) shows a recurrent layer being unrolled through time. Unlike feedforward neural networks, the output from the recurrent layer becomes its input at the next time point. This property of recurrent layers allows for processing and storing received input information through time at different spatial locations, making them well suited for sequential learning tasks [
51,
58]. Clearly, R-SNNs are significantly more complicated than their feedforward counterparts. Fortunately, R-SNN architectures and training methods with state-of-the-art performances in various image, speech, and natural language processing tasks have emerged [
4,
56].
While there exists a large body of DNN hardware accelerator and dataflow work based on the non-spiking ANN models, e.g., [
11,
13,
19,
22,
42], much less prior work has been devoted to SNN hardware accelerator architectures. TrueNorth [
2] and Loihi [
17], the two best-known industrial neuromorphic chips, are based on a many-core architecture with an asynchronous mesh supporting sparse core-to-core communication. Each core emulates 256 spiking neurons (TrueNorth) or 1,024 spiking neural compartments (Loihi) in a sequential manner. Hence, one key drawback of these two designs is that each core lacks any parallelism. While being a very important research problem, SNN dataflow has not been extensively studied before. For instance, hardware acceleration of SNNs is performed by simply adopting a dataflow commonly used for ANN accelerators in a temporally sequential manner [
7,
52], which however fails to consider spatiotemporal characteristics of SNN operation and the tradeoffs involved. The recent SNN architecture SpinalFlow is tailored for spiking neural networks using a novel compressed, time-stamped, and sorted spike input/output representation, exploiting high degree of network firing sparsity [
38]. However, SpinalFlow only deals with temporally coded SNNs in which each neuron is allowed to fire once, limiting achievable accuracy of decision making [
16,
28]. Furthermore, this restricted form of temporal coding makes SpinalFlow inapplicable to a large class of rate-coded and other temporally-coded SNNs in which spiking neurons fire more than once. Importantly, note that all these aforementioned SNN dataflow architectures only target feedforward SNNs [
7,
38,
52].
This work is motivated by the lack of optimized accelerator architectures for general recurrent spiking neural networks (R-SNNs). We propose the first architecture for systolic-array acceleration of recurrent spiking neural networks, dubbed SaARSP, to efficiently support spike-based spatiotemporal learning tasks.
The main contributions of this work are:
•
Unlike the prior work that targets only feedforward SNNs and/or limits how temporal information is coded, the proposed SaARSP architecture is applicable to the most general R-SNN models.
•
We demonstrate a novel decoupling scheme to separate the processing of feedforward and recurrent synaptic connections, allowing for parallel computation over multiple time points and improved data reuse.
•
We characterize the temporal granularity of the proposed decoupling by defining a batch size called the time window size, which specifies the number of simultaneously processed time points, and show that time window size optimization (TWSO) significantly impacts the overall accelerator performance.
•
Stationary dataflow and time window size are jointly optimized to trade off between weight data reuse and movements of partial sums, the two bottlenecks in latency and energy dissipation of the accelerator.
•
We configure the systolic array considering layer-dependent connectivity structure to boost latency and energy efficiency.
We evaluate the proposed SaARSP architecture and dataflows by developing an R-SNN architecture simulator and using a comprehensive set of recurrent spiking neural network benchmarks. Compared to a conventional baseline that does not explore the proposed decoupling, the SaARSP architecture improves energy-delay product (EDP) by 4,000X on average for different benchmarks.
2 Background
Conventional deep learning has demonstrated superb outcomes in many machine learning tasks. Nevertheless, SNNs have emerged as a promising biologically plausible alternative. With event-driven operations, SNN models running on dedicated neuromorphic hardware can improve energy efficiency by orders of magnitude for a variety of tasks [
2,
17] . Recurrence in recurrent SNNs (R-SNNs) enable the formation of temporally based distributed memory, making them well suited for processing highly complex spatiotemporal data. In this section, we provide a brief background on spiking neural networks, including R-SNNs.
2.1 Spiking Neurons
To mimic biological neurons in the nervous system, a spiking neuron model, such as the most prevailing
leaky integrate-and-fire (LIF) model [
21] is adopted, as shown in Figure
1(b). Common to literally all spiking neural models, operations in one spiking neuron comprise three main steps at each time point
\(t\): (1) integration of pre-synaptic spike inputs, (2) update of the post-synaptic membrane potential, and (3) conditional generation of post-synaptic spike output (action potential). As shown in Figure
1(b) and during Step (1), if a particular pre-synaptic neuron fires, the induced pre-synaptic current will be integrated by the post-synaptic neuron. From a modeling perspective, in this case, the corresponding synaptic weight between the two neurons, or more generally a quantity determined by the weight, will be added (accumulated) to the post-synaptic membrane potential. After integrating all pre-synaptic currents, in Step (2), the post-synaptic spiking neuron updates its membrane potential by adding to it the sum of integrated synaptic currents. Temporal decaying of the membrane potential can also be included if the neural model is leaky. In the last step, the spiking neuron compares its updated membrane potential with a pre-determined firing threshold and generates an output spike (action potential) if the membrane potential exceeds the threshold, as shown in Figure
1(b). The same process repeats for all time points involved in a given spike-based task.
2.2 Spiking Neural Networks
Spiking neurons are wired to form a network. The aforementioned temporal processing of individual spiking neurons are brought into a network setting in which neurons communicate with each other and perform computation by receiving and generating stereotyped all-or-none spikes both spatially and temporally. This article considers the most general (deep) multi-layer recurrent spiking neural network (R-SNN) architecture, which comprises multiple feedforward or recurrent layers with inter-layer feedforward connections. The proposed accelerator architecture intends to accelerate SNN inference on a layer-by-layer basis.
2.2.1 Feedforward Spiking Layers.
Feedforward SNNs are special cases of the more general R-SNNs. The feedforward synaptic weights between a layer and its preceding layer can be represented by a weight matrix
\({\bf W}\). At a given time point, the binary pre-synaptic/post-synaptic spikes of all neurons in the layer are now vectors of ones and zeros. For SNN acceleration on digital accelerators, it is a common practice to adopt the LIF model [
21] with a zero-th order synaptic response model discretized over time, leading to three steps of layer processing at each time point
\(t_k\):
Step 1: Feedforward synaptic input integration:
Step 2: Membrane potential update:
Step 3: Conditional spike output generation:
where the
\((l-1)\)th layer and
\(l\)th layer are the pre- and post-synaptic layers, and
\(j\) and
\(i\) are the neuron index in the two layers, respectively, as shown in Figure
2(a).
\({{\bf f}_{i}^{l}[{t}_{k}]}\),
\({{\bf v}_{i}^{l}[{t}_{k}]}\), and
\({{\bf s}_{i}^{(l)}[{t}_{k}]}\) denote the integrated pre-synaptic spike inputs, membrane potential, and spike output of the
\(i\)th neuron in layer
\(l\) at time
\({t}_{k}\), respectively.
\({{\bf W}_{ji}^F}\) is the feedforward synaptic weight between neurons
\(i\) and
\(j\) of the two layers, and
\(M^{l}\) is the number of neurons in layer
\(l\).
\({\bf V}_{th}\) and
\({\bf V}_{leak}\) are the firing threshold and leaky parameter, respectively. In the above, the processing of each spiking neuron follows the three steps described in the previous subsection.
\({\bf Step 1}\) of the layer processing can be seen as a matrix-vector multiplication while all other computations are scalar operations. Therefore, \({\bf Step 1}\) is computationally expensive and is the dominant complexity of hardware acceleration.
2.2.2 Recurrent Spiking Layers.
While exploiting recurrence in the network connectivity, the network includes rich dynamics, and R-SNNs can implement temporally based local memory, as depicted in Figure
2(b). Processing a recurrent layer follows the same three-step procedure of feedforward layers with one difference.
\({\bf Step 1}\) of feedforward layers only performs integration of feedforward synaptic inputs from the preceding layer, whereas it must consider two different types of synaptic connections for a recurrent layer: (1) feedforward inputs from the preceding layer, and (2) lateral recurrent inputs within the same layer. The latter of the two is computed in an additional step:
Step 1*: Recurrent input integration for recurrent layers:
where
\({{\bf s}_{p}^{(l)}[{t}_{k-1}]}\) is the spike output of neuron
\(p\) of layer
\(l\) at time
\({t}_{k-1}\),
\({{\bf r}_{i}^{l}[{t}_{k}]}\) is the integrated recurrent synaptic inputs for neuron
\(i\), and
\({{\bf W}_{pi}^{R,l}}\) is the recurrent synaptic weight between neurons
\(p\) and
\(i\) in the recurrent layer
\(l\). In (
5),
\({{\bf r}_{i}^{l}[{t}_{k}]}\) is added to the
\({{\bf f}_{i}^{l}[{t}_{k}]}\) computed in (
1) to form the total integrated inputs for neuron
\(i\). While recurrence adds significantly to the computing capability of an R-SNN, the computation of
\({{\bf r}_{i}^{l}[{t}_{k}],}\) however introduces additional tightly coupled data dependencies both in space and time, i.e., across different neurons (space) and different time points (time), which is tackled by the proposed architecture as discussed in Section
3.
2.3 Computational Potential of R-SNNs
While R-SNNs produce complex network dynamics via recurrent connections, they have gathered significant recent research interests that perceive R-SNNs as a promising biologically inspired paradigm for time series data processing, speech recognition, and predictive learning. In particular, the recurrent connections in an R-SNN form temporally based local memory, opening up opportunities for supporting broad ranges of spatiotemporal learning tasks.
Recent advances in R-SNN network architectures and end-to-end supervised training methods have led to high-performance R-SNNs that are not attainable using unsupervised biologically plausible learning mechanisms such as
spike-timing dependent plasticity (STDP). For example, deep R-SNNs have been trained using recent SNN backpropagation techniques [
4,
56,
57], achieving state-of-the-art accuracy on commonly used spike-based neuromorphic image and speech recognition datasets such as MNIST [
31], Fashion-MNIST [
55], N-TIDIGITS [
3], TI46 speech corpus [
33], and Sequential MNIST and TIMIT [
20]. Belec et al. [
4] presented a powerful R-SNN architecture, called
long short-memory spiking neural networks (LSNNs), comprising multiple distinct leaky integrate-and-fire (LIF) recurrent spiking neural populations. Promising supervised and reinforcement learning, and
Learning-to-Learn (L2L) capabilities have been demonstrated using LSNNs.
2.4 R-SNN Accelerators
While SNNs have gathered siginificant research interest as promising bio-inspired models of computation, only a very few works have been done on SNN hardware accelerators [
15,
24,
38,
39,
45], particularly array-based accelerators [
1,
6,
23,
49,
52] due to the fact that the spatiotemporal nature of the spikes make it difficult to design an efficient architecture. Importantly, these limited existing works have primarily focused on feedforward SNNs where there exist very limited works that are capable of executing R-SNNs [
2,
17,
37]. For example, [
1,
23,
49,
52] introduced a systolic-array-based accelerator for spiking-CNNs. However, these works are only targeting feedforward networks where efficient method to handle recurrence, which produces tightly-coupled data dependency in both time and space, has not been proposed. There is a key difficulty in developing optimized hardware architecture as strict spatiotemporal dependency resides in R-SNNs. Furthermore, RNN accelerator designs and techniques are incompatible with R-SNNs due to unique properties of spiking models, such as disparity in data representation which lead to different trade-offs in data sharing and storage compared to non-spiking models.
There exist few types of neuromorphic hardware that are capable of executing R-SNNs [
2,
17,
37]. Akopyan et al. [
2] and Davies et al. [
17] are the two best-known industrial large-scale neuromorphic chips, based on a many core architecture. Both chips are fully programmable and have capability of executing R-SNNs, as Akopyan et al. [
2] is based on intra-core crossbar memory and long-range connections through an inter-core network and Davies et al. [
17] adopt neuron-to-neuron mesh routing model [
9,
37] presented a multicore neuromorphic processor chip that employs hybrid analog/digital circuit. With novel architecture that incorporates distributed and heterogeneous memory structures with a flexible routing scheme, [
37] can support a wide range of networks including recurrent network.
However, existing architectures are limited to process R-SNNs in a time-sequential manner, which requires alternating access to two different types of weight matrices for every time point, i.e., feedforward weight matrix and recurrent weight matrix. The major shortcomings in the above architectures originate from the stereotype that, both the feedforward and recurrent inputs must be accumulated to generate final activations at a given time point, which are the recurrent inputs to the next time point, before the next time point can be processed. For example, large-scale multicore neuromorphic chips, IBM’s TrueNorth with 256M synapses [
2] and Intel’s Loihi with 130M synapses [
17], are based on the assumption that all weights of the network are fully stored on-chip. Using a very large core memory can reduce inefficiencies, i.e., lack of parallelism and weight reuse, which makes the weight reuse and data movement less important compared to many other practical cases.
On the other hand, the main idea of our article allows parallel compute over multiple time points as opposed to existing architectures and maximizes weight reuse benefits. Our main target is to minimize data-movement energy cost for memory-intensive SNN accelerators, especially for a practical case that the accelerator cannot load entire weight matrices. Without the proposed idea to minimize the data movement, processing R-SNNs require alternating access to two different types of weight matrices for every time point. Based on the fact above, we defined the baseline architecture (without decoupling) which keeps the essential ideas of existing SNN accelerator works and extended them for recurrent SNNs. With inherent advantages in exploiting spatiotemporal parallelism as will discussed in Section
3.2, our main accelerator is based on systolic array architecture. This work is motivated by the lack of optimized accelerator architectures for general recurrent spiking neural networks (R-SNNs).
3 SaARSP: Proposed Architecture
We present the proposed architecture for systolic-array acceleration of recurrent spiking neural networks, dubbed SaARSP, which accelerate a given R-SNN in layer-by-layer manner. SaARSP addresses the data dependencies introduced by temporal processing in R-SNNs via decoupled feedforward/recurrent synaptic integration scheme and a novel time window size optimization to enable an optimal time-domain parallelism, and hence reduces latency and energy. It supports both the output stationary (OS) and weight stationary (WS) dataflows for maximized data reuse.
3.1 Decoupled Feedforward/Recurrent Synaptic Integration
As discussed in Section
2.2.2, recurrence in R-SNNs introduces tightly coupled data dependencies both in space and time, which may prevent direct parallelization in the time domain and hence limit the overall performance. We address this challenge by proposing a parallel processing of spike input integration by decoupling the integration of feedforward and recurrent synaptic inputs. The key idea here is to re-structure the spike integration process, as shown in Figure
3(b). One key observation is that while the complete processing of a recurrent layer involves temporal data dependency, the feedforward synaptic input integration, i.e.,
Step 1 corresponding to (
1), has no temporal data dependency and can be parallelized over multiple time points. And then, the following steps of recurrent input integration (
Step 1*), membrane potential update (
Step 2), and spike output generation (
Step 3) are done in a sequential manner time-step by time-step. For example, in conventinal approaches, spike integration step at a given time point
\(t_{k}\) requires two different weight matrix, which is repeated for every time point, as depicted in Figure
3(a). However, decoupling feedforward and recurrent stage enables to reuse each of the two weight data for consecutive time points in a row, as shown in Figure
3(b). This decoupling scheme can be represented based on two macro-steps below:
Step A: Feedforward spike input integration for
\(t_{k}\) to
\(t_{k+TW-1}\) over a time window of
\(TW\) points.
Step B: Process
Step 1*, 2, and 3 for
\(t_{k}\) to
\(t_{k+TW-1}\) sequentially.
In the above,
\(TW\) is the time window size which specifies the temporal granularity of the decoupling. For instance, feedforward synaptic integration step is processed first, over
\(TW\) time points, followed by the rest steps in sequential manner. Processing the feedforward input integration over multiple time points as in (
6) is possible since the output spikes from the preceding layer over the same time window, which are the inputs to the present layer, have already been computed at this point in the layer-by-layer processing sequence. We further introduce the optimization technique in terms of time window size in the later section.
Weight data reuse opportunity. Importantly,
this decoupling scheme opens up two weight matrix data reuse opportunities. First all, it is easy to see that the feedforward weight matrix
\({\bf W}^{F,l}\) can be reused across all
\(TW\) time points in
Step A. We group the rest of the steps (
Step 1*,
2, and
3) into
Step B and process it sequentially. This is because that the layer’s spike outputs at the present time point cannot be determined without knowing the spike outputs at the preceding time point, which feed back to the same recurrent layer via recurrent connections according to (
2), (
3), (
4), and (
5). Nevertheless, it is important to note that despite the fact that
Step B is performed sequentially, decoupling it from
Step A allows for reuse of recurrent weight matrix
\({\bf W}^{R,l}\) across
\(TW\) time points in
Step B. The decoupling scheme offers a unifying solution to enhance weight data reuse for both feedforward integration and recurrent integration, and are applicable to both feedforward and recurrent SNNs.
3.2 Proposed SaARSP Architecture
Without the proposed time-domain parallelism, accelerating recurrent layers in the conventional approach effectively operates on a 1-D array that performs serial processing time point by time point, as shown in Figure
3(a). As discussed above, there exist very limited works for SNN hardware accelerators, and most of prior works have primarily focused on feedforward SNNs where the decoupling scheme has not been explored. We keep the essential ideas of existing SNN accelerator works while extending them for recurrent SNNs without time-domain parallelism, giving rise to the baseline architecture adopted throughout this article for comparison.
In contrast, Figure
4 shows the overall SaARSP architecture, comprising controllers, caches, and a reconfigurable systolic array. Memory hierarchy design is critical to the overall performance of neural network acceleration due to its memory-intensive nature. As a standard practice, we adopt three levels of memory hierarchy, as shown in Figure
4: (1) DRAM, (2) Global buffer, and (3) double-buffered L1 cache [
30,
46]. We employ programmable data links with simple control logic among the PEs in the 2-D systolic array such that it can be reconfigured to a 1-D array. Each processing element (PE) consists of a scratchpad and an AC unit that accumulates the pre-synaptic weights when the corresponding input spikes are presented. The SaARSP architecture supports two different stationary dataflows based on systolic array.
Systolic Arrays. A major benefit in using systolic arrays is parallel computing in a simultaneous row-wise and column-wise manner with local communications [
52]. Furthermore, systolic arrays are inherently suitable for exploiting spatiotemporal parallelism, i.e., across different neurons (space) and across multiple time-points (time). Especially, systolic arrays are well suited for our main idea to perform parallel computing across time-domain. In addition, separately, the row-wise and the column-wise unidirectional links can be adopted and optimized for the specific data disparity in SNNs.
Stationary Dataflows. For non-spiking ANNs, many previous works have proposed to leverage stationary dataflows to reduce expensive data movement [
12,
13,
46]. By keeping one type of data (input, weight, or output) in each PE, an input stationary (IS), weight stationary (WS), and output stationary (OS) dataflow reduces the movement of the corresponding data type. However, a stationary dataflow may result in a different impact for spiking models, considering their unique properties. We explore OS and WS flows to mitigate the movements of large volumes of multi-bit Psum and weight data, respectively. We do not consider input stationary (IS) dataflows that are commonly used in conventional DNN accelerators because binary input spikes are of low volume, and reusing binary input data offers limited benefits.
3.2.1 Output Stationary Dataflow.
The two processing steps are executed on SaARSP under the OS dataflow as follows. In
Step A, the systolic array is configured into a 2-D shape to exploit temporal parallelism, both across space and time, as illustrated in Figure
4(a). Input spikes at different time points within the time window are fetched to the corresponding columns in the array from the top. Spike input integration for different time points is processed column-wise, where input spikes propagate vertically from top to bottom. The feedforward weight matrix is fetched from the left and then propagates horizontally from left to right, enabling weight data reuse at different time points.
Since
Step B is performed sequentially, the 2-D systolic array is reconfigured into a 1-D array to fully utilize the compute resources to maximize spatial parallelism at each individual time point as shown in Figure
4(b).
3.2.2 Weight Stationary Dataflow.
In WS, weight data as opposed to Psums resides stationary in the scratchpads to maximize weight data reuse. While the weight data are in the PEs, input spikes and Psums propagate vertically through the PEs. Unlike OS, there is no horizontal data propagation in WS except when the array retrieves new weight data for computation. WS suffers from increased cost for storing and moving Psums while further maximizing weight data reuse.
3.3 Time-Window Size Optimization (TWSO)
Processing across
\(TW\) time points within the chosen time window as in (
6) with the decoupling scheme allows the exploitation of temporal parallelism. However, there exist a fundamental trade-off between weight reuse and Psum storage. The key advantage of decoupling, i.e., weight data reuse across time points, may be completely offset by the need for storing incomplete partial results across multiple time points [
48]. Blindly decoupling can exacerbate the above trade-off, and even result in worse performance, as shown later in Figures
7 and
8.
In order to address above issue, we introduce a novel time window size optimization (TWSO) technique to address the fundamental trade-off, optimize the window size
\(TW\) to maximize the latency and energy dissipation benefits. We present the first work to explore temporal-granularity in terms of TWSO, which is much powerful and flexible than solely applying the decoupling scheme [
48].
TWSO Address the Fundamental Tradeoff. As discussed above, the number of time steps that can be executed simultaneously in one array iteration is limited by the array width
\(H\). To accommodate higher degrees of time domain parallelisms, we define
K as the time-iteration factor such that
\(TW = K \cdot H\), i.e., the parallel integration of feedforward pre-synaptic inputs of a recurrent layer consumes
\(K\) array processing iterations, as shown in Figure
5. For a given
K, the array reuses a single weight matrix, either
\({{\bf W}^F}\) or
\({{\bf W}^R}\), for
\(TW\) time points.
On the other hand, there may exist optimal choices for the value of
\(TW\) (
\(K\)). According to Figure
5 and Equation (
6), the decoupling scheme batches the feedforward and recurrent input integration steps, and hence it enables reuse of both the feedforward and recurrent weight matrices
\({{\bf W}^F}\) and
\({{\bf W}^R}\) over multiple time points, avoiding expensive alternating access to them across
Step A and
Step B. However, parallel processing in the time domain can degrade the performance due to an increased amount of partial sums (Psums). Upon completion of
Step A across many time points, there exists a large amount of incomplete, multi-bit partial sum (Psum) data waiting to be processed in
Step B. More Psums can result in degraded performance due to the increased latency and energy dissipation of Psum data movement across the memory hierarchy. TWSO address the aforementioned fundamental tradeoff by exploring granularity with varying time-window size, and finding optimal point for the tradeoff. TWSO is generally applicable to both non-spiking RNNs and spiking RNNs.
TWSO Offers Application Flexibility. Typically, a decoupling scheme has a key limitation since it does not provide benefit beyond the first layer due to temporal dependency. This is because as the RNN is unrolled over all time points, both the feedforward and recurrent inputs must be accumulated to generate a recurrent layer’s final activations, which are part of the inputs to the next layer, before the next layer can be processed. However, TWSO subdivides all time points into multiple time windows and performs decoupled accumulation of feedforward/recurrent inputs with a granularity specified by time window size, allowing overlapping the processing of different recurrent layers across different time windows. For example, upon completing processing time window i for recurrent layer k, the activities of layer k in time window (i + 1) and those of layer (k + 1) in time window i can be processed concurrently. Therefore, with TWSO, decoupling can be applied to multiple recurrent layers concurrently.
TWSO Is Specifically Beneficial for Spiking Models. Since the partial sums of each layer are multi-bit while its outputs are only binary, the benefit of increased weight data reuse resulted from decoupling may be more easily offset by the increased storage requirement for multi-bit Psums. Optimizing the granularity of decoupling becomes even more critical for R-SNNs as addressed by TWSO.
Also, TWSO alleviates bottleneck due to data bandwidth by allowing weight reuse across chosen time window size as opposed to iterative weight access, and thus the time window size is not enforced due to the memory bandwidth compared to conventional approaches.
As we demonstrate in our experimental studies in Section
5, TWSO can significantly improve the overall performance.
6 Conclusion
This work is motivated by the lack of an efficient architecture and dataflow for efficient acceleration of complex spatiotemporal dynamics arising in R-SNNs. To the best of our knowledge, the proposed architecture for systolic-array acceleration of recurrent spiking neural networks, dubbed SaARSP, presents the first systematic study of array-based hardware accelerator architectures for recurrent spiking neural networks.
One major challenge in accelerating R-SNNs stems from the tightly coupled data dependency in both time and space resulted from the recurrent connections. This challenge prevents direct exploration of time-domain parallelism and may severely degrade the overall performance due to poor data reuse patterns.
The proposed SaARSP architecture is built upon a decoupling scheme and novel TWSO technique to enable the parallel acceleration of computation across multiple time points. This is achieved by cleverly decoupling the processes of feedforward and recurrent synaptic input integration, two dominant costs in processing recurrent network structures. We further boost the accelerator performance by optimizing the temporal granularity of the proposed decoupling and stationary dataflows in a layer-dependent manner. The SaARSP architecture can be applied to the acceleration of both feedforward and recurrent layers and hence is able to support a broad class of spiking neural network topologies.
Experimentally, the proposed SaARSP architecture and optimization scheme reduce the latency, energy dissipation, and energy-delay product (EDP) of the array accelerator by up to 102X and 161X, and 4,000X on average, respectively, over a conventional baseline for a comprehensive set of benchmark R-SNNs.