CN115936064B - Neural network acceleration array based on weight circulation data stream - Google Patents
Neural network acceleration array based on weight circulation data stream Download PDFInfo
- Publication number
- CN115936064B CN115936064B CN202211141844.9A CN202211141844A CN115936064B CN 115936064 B CN115936064 B CN 115936064B CN 202211141844 A CN202211141844 A CN 202211141844A CN 115936064 B CN115936064 B CN 115936064B
- Authority
- CN
- China
- Prior art keywords
- array
- data
- input
- neural network
- weight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 22
- 230000001133 acceleration Effects 0.000 title claims abstract description 15
- 238000000034 method Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 2
- 238000004064 recycling Methods 0.000 claims description 2
- 238000003491 array Methods 0.000 claims 2
- 238000004364 calculation method Methods 0.000 abstract description 3
- 238000003062 neural network model Methods 0.000 abstract description 2
- 239000000872 buffer Substances 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000011176 pooling Methods 0.000 description 3
- 241001522296 Erithacus rubecula Species 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000004377 microelectronic Methods 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 208000032060 Weight Cycling Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Complex Calculations (AREA)
Abstract
The invention particularly relates to a neural network acceleration array based on a weight circulation data stream, which sufficiently multiplexes weight values read from a memory and input feature map data, greatly reduces access to an external memory, and belongs to the technical field of hardware acceleration of the neural network. In the field of artificial intelligent chips, convolution operation occupies more than ninety percent of the calculated amount of the whole convolution neural network model, and in order to reduce repeated calling and movement of input data in an airspace calculation structure and maximize data multiplexing, the invention provides a weight circulation data stream. By designing the PE array based on the weight circulation data flow, the convolution operation is optimized, and the power consumption and delay of a hardware acceleration structure are effectively reduced, so that the overall performance of the system is improved.
Description
Technical Field
The invention relates to the technical field of hardware acceleration of neural networks, in particular to a neural network acceleration array design method based on a weight circulation data stream.
Background
Along with the rapid development of the internet of things technology, the wearable intelligent product takes all parts of a human body as interfaces of the Internet, truly realizes man-machine integrated product experience with the characteristics of miniaturized, portable and intelligent wearable products, provides portable real-time information acquisition and data service for consumers, and has larger technical content and market attraction.
Unfortunately, it is not easy to design a high performance wearable computing device, which is faced with many challenges. The field is the intersection of different research fields such as computer science and engineering, and various technologies such as microelectronics and wireless communication are used. Advances in microelectronics have led to small-sized low-power artificial intelligence chips suitable for use in wearable computing devices.
In the field of artificial intelligent chips, deep learning belonging to the category of machine learning is widely applied to aspects of image classification, voice recognition, object detection and the like, and remarkable results are obtained. Convolutional neural networks, recurrent neural networks, and deep belief networks are the main focus of deep learning studies, with the most advanced being convolutional neural networks.
Typical convolutional neural network structures include: convolution layer, activation layer, pooling layer, full connection layer, input-output feature map, etc. Wherein the function of the convolution layer is feature extraction, the function of the pooling layer is pixel compression, and the function of the full connection layer is classification. The convolution layer is computationally intensive, while the full connection layer is data intensive.
Convolutional neural networks have three bottleneck problems in data processing: first, data is intensive and the amount of data that needs to be processed is extremely large. And secondly, the data processing and storage needs to consume a large amount of computation resources and a large amount of time. Third, the speed mismatch problem, i.e. the data processing speed is slower than the data generation speed. Therefore, development of a dedicated chip suitable for an artificial intelligence architecture is needed, and implementation of a neural network accelerator for high-speed data transmission and high-speed computation has important significance in various aspects.
Currently, due to the improvement of hardware performance, the form of CPU, GPU, FPGA and ASIC is mainly adopted in the process of accelerating neural network training and inference. Approximately twenty years ago, CPUs have been the mainstay of implementing neural network algorithms, and the optimization field thereof has been mainly focused on the software part. The increasing computational cost of CNNs makes it necessary for the hardware to accelerate its reasoning process. In terms of GPUs, GPU clusters can accelerate in parallel very large networks with 10 hundred million parameters. Mainstream GPU clustered neural networks typically use distributed SGD algorithms. Many studies have further exploited this parallelism in an effort to achieve communication between different clusters. FPGAs have many attractive features and thus become a good platform for CNN hardware acceleration. Generally, FPGAs provide higher energy efficiency than CPUs and GPUs, and have higher performance than CPUs. Compared with the GPU, the throughput of the FPGA is tens of giga times, and the memory access is limited. Furthermore, it does not support floating point computing itself, but has lower power consumption. An ASIC is a special processor designed for a specific application, and has the advantages of small size, low power consumption, high computation speed, high reliability, and the like, although the ASIC has low flexibility, long development period, and high cost.
There are two more typical architectures for neural network hardware accelerators: time domain computing architecture (tree structure) and space domain computing architecture (PE array structure). The tree structure centrally controls the arithmetic computation units and the storage resources based on the instruction stream, and each arithmetic logic unit acquires operation data from the centralized storage system and writes back the result. It consists of a multiply-add tree, a buffer for distributing the input values and a prefetch buffer. In the PE array structure, each arithmetic operation unit is provided with a local memory, the whole architecture adopts data flow control, namely, all PE units form a processing chain relation, and data is directly transmitted among PEs. It consists of a global buffer, FIFO and PE array. Each PE is composed of one or more multipliers and adders, which can realize highly parallel computation.
There are four more typical data flow patterns for neural network hardware accelerators: and (3) no local multiplexing data stream, inputting a fixed stream, and outputting a fixed stream and a weight fixed stream. For non-local multiplexed data streams, to maximize storage capacity and minimize off-chip memory bandwidth, the PE is not allocated local storage, but instead all regions are allocated to global buffers to increase its capacity, it is necessary to multiplex the input signature, single pass the convolution kernel weights, and then accumulate partial sums through the PE array. For an input fixed stream, the computing core reads the input feature map into a local input register; the computing core fully multiplexes the input data, and updates all relevant output part sums in the output buffer; the updated output portion sum will be rewritten back to the output buffer. For an output fixed stream, the computing core reads each channel of the input feature map into a local input register; the output portions stored in the compute core output registers are sufficiently multiplexed to complete a full accumulation in the three-dimensional convolution channel direction; the final output feature map is written to the output buffer after pooling. For the weight fixed stream, the computing core reads the input feature map block to a local input register; the computing core updates the output part sum of the blocks by using the input data; the partitioned convolution kernel weights stored in the weight cache are sufficiently multiplexed to update the output portion sums stored in the output cache.
Fig. 1 is a simple convolution layer diagram, in which 10 is a 7×7 input feature map, 11 is a 3×3 convolution kernel, and 12 is a5×5 output feature map. The convolution kernel window slides on 10 line by line in a Z shape to perform convolution operation, and a result 12 is obtained. The convolution operation occupies more than ninety percent of the calculated amount of the whole convolution neural network model, so that the area and the power consumption of a hardware acceleration structure can be effectively reduced by designing a structured PE array to optimize the convolution operation, and the overall performance of the system is improved.
Disclosure of Invention
Aiming at the background, the invention provides an accelerating array design method aiming at the operation of a neural network convolution layer. The method can be applied to the design of the FPGA or ASIC neural network hardware accelerator as a calculation part of the AI acceleration processor.
In the airspace calculation structure, each operation unit is controlled through data flow, so that the key is to solve the data flow problem. To minimize repeated calls and shifts to the input data, maximize data multiplexing, preferably one-time input feature maps, a weighted round robin data stream (WEIGHT RING dataflow, WR) is proposed.
The invention constructs a new neural network PE array architecture based on the presented WR data flow. Assuming that the convolution kernel size of a certain convolution layer is K 2, the corresponding PE array size is also K 2. PE units are interconnected laterally for data movement and longitudinally for weight cycling.
The invention has the advantages that: the weight value read from the memory and the input characteristic diagram data are fully multiplexed, so that the access to the external memory is greatly reduced, and the overall power consumption and delay are reduced.
Drawings
FIG. 1 is a schematic diagram of a convolution layer;
FIG. 2 is a schematic diagram of a weight cycle data flow;
FIG. 3 is a schematic diagram of a neural network acceleration array based on a weight-cycled data stream;
FIG. 4 is a flowchart illustrating the operation of the PE array for the weight loop data stream
Detailed Description
The weight loop data flow and corresponding array hardware implementation of the present invention are described in detail below with reference to the attached drawing figures:
The weighted round robin data stream is embodied as shown in fig. 2, 20 is an input signature of size N 2, where N is assumed to be 7, and the first periodic convolution kernel 201 is of size K 2, where K is assumed to be 3, and the number of sliding rows ranges from 1 to K. In the convolution kernel 202 of the next cycle, the sliding line number ranges from 2 to k+1 lines, and so on to the convolution kernels of the cycle 203 and 204, and so on;
As can be seen from the operation rule of the convolution layer in FIG. 1, only one line of data is updated after each line change of the data in the 11 window. I.e. only one row of the original K rows of data is discarded. Therefore, if the data is updated again after each line feed, the data access amount is greatly increased. Therefore, the unique data line is kept unchanged in position, and only the discarded data line is covered with the updated line.
The neural network acceleration array based on the weight-cycled data stream is shown in fig. 3, and the array containing K 2 PE units 301 is a simulation of the sliding of the convolution kernel window on the image line by line. The weight value on the convolution kernel is first preset into the PE array 30, and when the K rows of image are simultaneously shifted into the array to perform convolution, the process is originally called a weight fixed stream. However, the WR flow proposed by this study is slightly different, connecting the weights of each column, providing a path for the weight flow, i.e. a vertical circulation line in 30.
For the PE array workflow shown in FIG. 4, the PE array operates as follows:
1) Firstly, 1 to K rows of image rows are input at the same time, the image rows are moved to the right according to the whole step length until the input feature map data is filled with 30 for the first time, and convolution operation is carried out in sequence;
2) And updating one row of image rows after the input feature map of 1 to K rows is traversed. The rest of the K-1 data which are not updated are continuously input in a recycling way;
3) When each periodic image line of 31 is updated, the weight of 30 is moved down in a whole cycle of lines, and is stored in a single weight register of 301 of each line, and the K lines of image lines of 30 are moved right as a whole to be convolved;
4) When the image line is updated, 30 completes the convolution of the last K lines of the input feature map, and the operation is finished.
The single 301 is mainly composed of multipliers, adders and weight registers. The preset weights are first input into the weight registers and are stored and reused until the image lines of the next line are more, and are not respectively cycled down into the weight registers of the next line 301.
Before the input signature data fills 30 for the first time, the input stimulus is directly output to the forward stimulus signal without passing through a multiplier, and the data is transferred to the right 301.
When the input signature data is full of 30, after the convolution operation is started, the input excitation and weights are input to the multiplier to generate a partial sum. This partial sum is summed with the input partial sum signal from the left to produce a forward partial sum which is input to the right 301. When the partial sums of each row reach the rightmost side of 30, the final convolution result is obtained by summing the K-1 edge adders.
The adder section may be replaced with an addition tree structure in addition to the above-described structure.
The above description is only of the preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Any person skilled in the art can make possible variations and modifications to the technical solution of the present invention, or modifications to equivalent embodiments, using the methods and technical solutions proposed above, without departing from the scope of the technical solution of the present invention. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.
Claims (4)
1. A neural network acceleration array based on a weight circulation data stream, wherein the PE array is in the size of a convolution window, PE units are transversely connected for data movement and longitudinally and circularly interconnected for weight circulation movement;
presetting a weight value on a convolution kernel into a PE array, directly outputting an input feature map to a PE unit on the right side without a multiplier before the input feature map data is fully filled into the PE array for the first time, and performing convolution operation until the input feature map data is fully filled into the PE array;
The size of the convolution window is K multiplied by K,1 to K rows of image rows are input at the same time, until the input feature map data is filled with PE array for the first time, the image rows are moved right according to the whole step length, and convolution operation is performed in sequence; updating one row of image rows after the input feature graphs of the 1 to K rows are traversed; the rest of the K-1 data which are not updated are continuously input in a recycling way; when the image rows are updated, the weights of the PE arrays are circularly moved downwards in the whole row, the weights are stored in the weight registers of the single PE units in each row, and the K rows of the image rows of the PE arrays are circularly moved rightwards in the whole row to be convolved; when the image line is updated, the PE array finishes the convolution of the last K lines of the input feature map, and the operation is finished;
In the right shift process of the input feature map, the weight is kept unchanged at the position of the array, and the array carries out convolution once every time the data shifts by one bit;
When the input feature map data is full of the PE array, after convolution operation is started, the input feature map and the weight are input into a multiplier to generate a partial sum; the partial sum is accumulated with the input partial sum signal from the left side to generate a forward partial sum which is input into the PE unit on the right side; when the partial sum of each row reaches the rightmost edge of the PE array, the final convolution result is obtained by summarizing K-1 edge adders.
2. The neural network accelerating array of claim 1, wherein instead of flowing once per beat as with a systolic array, the interval of image line updates is one cycle after the PE array traverses convolved K lines, except for the beginning of K image lines that are entered simultaneously.
3. The neural network acceleration array of claim 1, wherein the PE units are comprised of at least one data register, a multiplier, and an adder.
4. The neural network acceleration array of claim 1, wherein the PE units have control signals to control data flow in addition to data transmission signals.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211141844.9A CN115936064B (en) | 2022-09-20 | 2022-09-20 | Neural network acceleration array based on weight circulation data stream |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211141844.9A CN115936064B (en) | 2022-09-20 | 2022-09-20 | Neural network acceleration array based on weight circulation data stream |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115936064A CN115936064A (en) | 2023-04-07 |
CN115936064B true CN115936064B (en) | 2024-09-20 |
Family
ID=86654617
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211141844.9A Active CN115936064B (en) | 2022-09-20 | 2022-09-20 | Neural network acceleration array based on weight circulation data stream |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115936064B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107578098A (en) * | 2017-09-01 | 2018-01-12 | 中国科学院计算技术研究所 | Neural network processor based on systolic arrays |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10438117B1 (en) * | 2015-05-21 | 2019-10-08 | Google Llc | Computing convolutions using a neural network processor |
CN107918794A (en) * | 2017-11-15 | 2018-04-17 | 中国科学院计算技术研究所 | Neural network processor based on computing array |
CN110674927A (en) * | 2019-09-09 | 2020-01-10 | 之江实验室 | Data recombination method for pulse array structure |
CN113869507B (en) * | 2021-12-02 | 2022-04-15 | 之江实验室 | Neural network accelerator convolution calculation device and method based on pulse array |
-
2022
- 2022-09-20 CN CN202211141844.9A patent/CN115936064B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107578098A (en) * | 2017-09-01 | 2018-01-12 | 中国科学院计算技术研究所 | Neural network processor based on systolic arrays |
Also Published As
Publication number | Publication date |
---|---|
CN115936064A (en) | 2023-04-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nguyen et al. | Layer-specific optimization for mixed data flow with mixed precision in FPGA design for CNN-based object detectors | |
CN109409511B (en) | Convolution operation data flow scheduling method for dynamic reconfigurable array | |
CN111967468B (en) | Implementation method of lightweight target detection neural network based on FPGA | |
Guo et al. | Software-hardware codesign for efficient neural network acceleration | |
CN107704916B (en) | Hardware accelerator and method for realizing RNN neural network based on FPGA | |
US20180157969A1 (en) | Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network | |
CN110263925B (en) | Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA | |
CN109409510B (en) | Neuron circuit, chip, system and method thereof, and storage medium | |
Zainab et al. | Fpga based implementations of rnn and cnn: A brief analysis | |
Kästner et al. | Hardware/software codesign for convolutional neural networks exploiting dynamic partial reconfiguration on PYNQ | |
CN111105023B (en) | Data stream reconstruction method and reconfigurable data stream processor | |
CN113033794B (en) | Light weight neural network hardware accelerator based on deep separable convolution | |
CN110674927A (en) | Data recombination method for pulse array structure | |
CN112950656A (en) | Block convolution method for pre-reading data according to channel based on FPGA platform | |
Lym et al. | FlexSA: Flexible systolic array architecture for efficient pruned DNN model training | |
Que et al. | Recurrent neural networks with column-wise matrix–vector multiplication on FPGAs | |
CN115423081A (en) | Neural network accelerator based on CNN _ LSTM algorithm of FPGA | |
CN116822600A (en) | Neural network search chip based on RISC-V architecture | |
Aung et al. | Deepfire2: A convolutional spiking neural network accelerator on fpgas | |
CN111506344A (en) | Deep learning hardware system based on systolic array architecture | |
CN115936064B (en) | Neural network acceleration array based on weight circulation data stream | |
Kim et al. | EPU: An Energy-Efficient Explainable AI Accelerator With Sparsity-Free Computation and Heat Map Compression/Pruning | |
Zhang et al. | Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature] | |
Mahajan et al. | Review of Artificial Intelligence Applications and Architectures | |
CN113869494A (en) | Neural network convolution FPGA embedded hardware accelerator based on high-level synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |