CN115936064B

CN115936064B - Neural network acceleration array based on weight circulation data stream

Info

Publication number: CN115936064B
Application number: CN202211141844.9A
Authority: CN
Inventors: 程筱舒; 王忆文; 娄鸿飞; 李平
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2024-09-20
Anticipated expiration: 2042-09-20
Also published as: CN115936064A

Abstract

The invention particularly relates to a neural network acceleration array based on a weight circulation data stream, which sufficiently multiplexes weight values read from a memory and input feature map data, greatly reduces access to an external memory, and belongs to the technical field of hardware acceleration of the neural network. In the field of artificial intelligent chips, convolution operation occupies more than ninety percent of the calculated amount of the whole convolution neural network model, and in order to reduce repeated calling and movement of input data in an airspace calculation structure and maximize data multiplexing, the invention provides a weight circulation data stream. By designing the PE array based on the weight circulation data flow, the convolution operation is optimized, and the power consumption and delay of a hardware acceleration structure are effectively reduced, so that the overall performance of the system is improved.

Description

Neural network acceleration array based on weight circulation data stream

Technical Field

The invention relates to the technical field of hardware acceleration of neural networks, in particular to a neural network acceleration array design method based on a weight circulation data stream.

Background

Along with the rapid development of the internet of things technology, the wearable intelligent product takes all parts of a human body as interfaces of the Internet, truly realizes man-machine integrated product experience with the characteristics of miniaturized, portable and intelligent wearable products, provides portable real-time information acquisition and data service for consumers, and has larger technical content and market attraction.

Unfortunately, it is not easy to design a high performance wearable computing device, which is faced with many challenges. The field is the intersection of different research fields such as computer science and engineering, and various technologies such as microelectronics and wireless communication are used. Advances in microelectronics have led to small-sized low-power artificial intelligence chips suitable for use in wearable computing devices.

In the field of artificial intelligent chips, deep learning belonging to the category of machine learning is widely applied to aspects of image classification, voice recognition, object detection and the like, and remarkable results are obtained. Convolutional neural networks, recurrent neural networks, and deep belief networks are the main focus of deep learning studies, with the most advanced being convolutional neural networks.

Typical convolutional neural network structures include: convolution layer, activation layer, pooling layer, full connection layer, input-output feature map, etc. Wherein the function of the convolution layer is feature extraction, the function of the pooling layer is pixel compression, and the function of the full connection layer is classification. The convolution layer is computationally intensive, while the full connection layer is data intensive.

Convolutional neural networks have three bottleneck problems in data processing: first, data is intensive and the amount of data that needs to be processed is extremely large. And secondly, the data processing and storage needs to consume a large amount of computation resources and a large amount of time. Third, the speed mismatch problem, i.e. the data processing speed is slower than the data generation speed. Therefore, development of a dedicated chip suitable for an artificial intelligence architecture is needed, and implementation of a neural network accelerator for high-speed data transmission and high-speed computation has important significance in various aspects.

Currently, due to the improvement of hardware performance, the form of CPU, GPU, FPGA and ASIC is mainly adopted in the process of accelerating neural network training and inference. Approximately twenty years ago, CPUs have been the mainstay of implementing neural network algorithms, and the optimization field thereof has been mainly focused on the software part. The increasing computational cost of CNNs makes it necessary for the hardware to accelerate its reasoning process. In terms of GPUs, GPU clusters can accelerate in parallel very large networks with 10 hundred million parameters. Mainstream GPU clustered neural networks typically use distributed SGD algorithms. Many studies have further exploited this parallelism in an effort to achieve communication between different clusters. FPGAs have many attractive features and thus become a good platform for CNN hardware acceleration. Generally, FPGAs provide higher energy efficiency than CPUs and GPUs, and have higher performance than CPUs. Compared with the GPU, the throughput of the FPGA is tens of giga times, and the memory access is limited. Furthermore, it does not support floating point computing itself, but has lower power consumption. An ASIC is a special processor designed for a specific application, and has the advantages of small size, low power consumption, high computation speed, high reliability, and the like, although the ASIC has low flexibility, long development period, and high cost.

There are two more typical architectures for neural network hardware accelerators: time domain computing architecture (tree structure) and space domain computing architecture (PE array structure). The tree structure centrally controls the arithmetic computation units and the storage resources based on the instruction stream, and each arithmetic logic unit acquires operation data from the centralized storage system and writes back the result. It consists of a multiply-add tree, a buffer for distributing the input values and a prefetch buffer. In the PE array structure, each arithmetic operation unit is provided with a local memory, the whole architecture adopts data flow control, namely, all PE units form a processing chain relation, and data is directly transmitted among PEs. It consists of a global buffer, FIFO and PE array. Each PE is composed of one or more multipliers and adders, which can realize highly parallel computation.

There are four more typical data flow patterns for neural network hardware accelerators: and (3) no local multiplexing data stream, inputting a fixed stream, and outputting a fixed stream and a weight fixed stream. For non-local multiplexed data streams, to maximize storage capacity and minimize off-chip memory bandwidth, the PE is not allocated local storage, but instead all regions are allocated to global buffers to increase its capacity, it is necessary to multiplex the input signature, single pass the convolution kernel weights, and then accumulate partial sums through the PE array. For an input fixed stream, the computing core reads the input feature map into a local input register; the computing core fully multiplexes the input data, and updates all relevant output part sums in the output buffer; the updated output portion sum will be rewritten back to the output buffer. For an output fixed stream, the computing core reads each channel of the input feature map into a local input register; the output portions stored in the compute core output registers are sufficiently multiplexed to complete a full accumulation in the three-dimensional convolution channel direction; the final output feature map is written to the output buffer after pooling. For the weight fixed stream, the computing core reads the input feature map block to a local input register; the computing core updates the output part sum of the blocks by using the input data; the partitioned convolution kernel weights stored in the weight cache are sufficiently multiplexed to update the output portion sums stored in the output cache.

Fig. 1 is a simple convolution layer diagram, in which 10 is a 7×7 input feature map, 11 is a 3×3 convolution kernel, and 12 is a5×5 output feature map. The convolution kernel window slides on 10 line by line in a Z shape to perform convolution operation, and a result 12 is obtained. The convolution operation occupies more than ninety percent of the calculated amount of the whole convolution neural network model, so that the area and the power consumption of a hardware acceleration structure can be effectively reduced by designing a structured PE array to optimize the convolution operation, and the overall performance of the system is improved.

Disclosure of Invention

Aiming at the background, the invention provides an accelerating array design method aiming at the operation of a neural network convolution layer. The method can be applied to the design of the FPGA or ASIC neural network hardware accelerator as a calculation part of the AI acceleration processor.

In the airspace calculation structure, each operation unit is controlled through data flow, so that the key is to solve the data flow problem. To minimize repeated calls and shifts to the input data, maximize data multiplexing, preferably one-time input feature maps, a weighted round robin data stream (WEIGHT RING dataflow, WR) is proposed.

The invention constructs a new neural network PE array architecture based on the presented WR data flow. Assuming that the convolution kernel size of a certain convolution layer is K ², the corresponding PE array size is also K ². PE units are interconnected laterally for data movement and longitudinally for weight cycling.

The invention has the advantages that: the weight value read from the memory and the input characteristic diagram data are fully multiplexed, so that the access to the external memory is greatly reduced, and the overall power consumption and delay are reduced.

Drawings

FIG. 1 is a schematic diagram of a convolution layer;

FIG. 2 is a schematic diagram of a weight cycle data flow;

FIG. 3 is a schematic diagram of a neural network acceleration array based on a weight-cycled data stream;

FIG. 4 is a flowchart illustrating the operation of the PE array for the weight loop data stream

Detailed Description

The weight loop data flow and corresponding array hardware implementation of the present invention are described in detail below with reference to the attached drawing figures:

The weighted round robin data stream is embodied as shown in fig. 2, 20 is an input signature of size N ², where N is assumed to be 7, and the first periodic convolution kernel 201 is of size K ², where K is assumed to be 3, and the number of sliding rows ranges from 1 to K. In the convolution kernel 202 of the next cycle, the sliding line number ranges from 2 to k+1 lines, and so on to the convolution kernels of the cycle 203 and 204, and so on;

As can be seen from the operation rule of the convolution layer in FIG. 1, only one line of data is updated after each line change of the data in the 11 window. I.e. only one row of the original K rows of data is discarded. Therefore, if the data is updated again after each line feed, the data access amount is greatly increased. Therefore, the unique data line is kept unchanged in position, and only the discarded data line is covered with the updated line.

The neural network acceleration array based on the weight-cycled data stream is shown in fig. 3, and the array containing K ² PE units 301 is a simulation of the sliding of the convolution kernel window on the image line by line. The weight value on the convolution kernel is first preset into the PE array 30, and when the K rows of image are simultaneously shifted into the array to perform convolution, the process is originally called a weight fixed stream. However, the WR flow proposed by this study is slightly different, connecting the weights of each column, providing a path for the weight flow, i.e. a vertical circulation line in 30.

For the PE array workflow shown in FIG. 4, the PE array operates as follows:

1) Firstly, 1 to K rows of image rows are input at the same time, the image rows are moved to the right according to the whole step length until the input feature map data is filled with 30 for the first time, and convolution operation is carried out in sequence;

2) And updating one row of image rows after the input feature map of 1 to K rows is traversed. The rest of the K-1 data which are not updated are continuously input in a recycling way;

3) When each periodic image line of 31 is updated, the weight of 30 is moved down in a whole cycle of lines, and is stored in a single weight register of 301 of each line, and the K lines of image lines of 30 are moved right as a whole to be convolved;

4) When the image line is updated, 30 completes the convolution of the last K lines of the input feature map, and the operation is finished.

The single 301 is mainly composed of multipliers, adders and weight registers. The preset weights are first input into the weight registers and are stored and reused until the image lines of the next line are more, and are not respectively cycled down into the weight registers of the next line 301.

Before the input signature data fills 30 for the first time, the input stimulus is directly output to the forward stimulus signal without passing through a multiplier, and the data is transferred to the right 301.

When the input signature data is full of 30, after the convolution operation is started, the input excitation and weights are input to the multiplier to generate a partial sum. This partial sum is summed with the input partial sum signal from the left to produce a forward partial sum which is input to the right 301. When the partial sums of each row reach the rightmost side of 30, the final convolution result is obtained by summing the K-1 edge adders.

The adder section may be replaced with an addition tree structure in addition to the above-described structure.

The above description is only of the preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Any person skilled in the art can make possible variations and modifications to the technical solution of the present invention, or modifications to equivalent embodiments, using the methods and technical solutions proposed above, without departing from the scope of the technical solution of the present invention. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. A neural network acceleration array based on a weight circulation data stream, wherein the PE array is in the size of a convolution window, PE units are transversely connected for data movement and longitudinally and circularly interconnected for weight circulation movement;

presetting a weight value on a convolution kernel into a PE array, directly outputting an input feature map to a PE unit on the right side without a multiplier before the input feature map data is fully filled into the PE array for the first time, and performing convolution operation until the input feature map data is fully filled into the PE array;

The size of the convolution window is K multiplied by K,1 to K rows of image rows are input at the same time, until the input feature map data is filled with PE array for the first time, the image rows are moved right according to the whole step length, and convolution operation is performed in sequence; updating one row of image rows after the input feature graphs of the 1 to K rows are traversed; the rest of the K-1 data which are not updated are continuously input in a recycling way; when the image rows are updated, the weights of the PE arrays are circularly moved downwards in the whole row, the weights are stored in the weight registers of the single PE units in each row, and the K rows of the image rows of the PE arrays are circularly moved rightwards in the whole row to be convolved; when the image line is updated, the PE array finishes the convolution of the last K lines of the input feature map, and the operation is finished;

In the right shift process of the input feature map, the weight is kept unchanged at the position of the array, and the array carries out convolution once every time the data shifts by one bit;

When the input feature map data is full of the PE array, after convolution operation is started, the input feature map and the weight are input into a multiplier to generate a partial sum; the partial sum is accumulated with the input partial sum signal from the left side to generate a forward partial sum which is input into the PE unit on the right side; when the partial sum of each row reaches the rightmost edge of the PE array, the final convolution result is obtained by summarizing K-1 edge adders.

2. The neural network accelerating array of claim 1, wherein instead of flowing once per beat as with a systolic array, the interval of image line updates is one cycle after the PE array traverses convolved K lines, except for the beginning of K image lines that are entered simultaneously.

3. The neural network acceleration array of claim 1, wherein the PE units are comprised of at least one data register, a multiplier, and an adder.

4. The neural network acceleration array of claim 1, wherein the PE units have control signals to control data flow in addition to data transmission signals.