CN115936064B - Neural network acceleration array based on weight circulation data stream - Google Patents

Neural network acceleration array based on weight circulation data stream Download PDF

Info

Publication number
CN115936064B
CN115936064B CN202211141844.9A CN202211141844A CN115936064B CN 115936064 B CN115936064 B CN 115936064B CN 202211141844 A CN202211141844 A CN 202211141844A CN 115936064 B CN115936064 B CN 115936064B
Authority
CN
China
Prior art keywords
array
data
input
neural network
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211141844.9A
Other languages
Chinese (zh)
Other versions
CN115936064A (en
Inventor
程筱舒
王忆文
娄鸿飞
李平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202211141844.9A priority Critical patent/CN115936064B/en
Publication of CN115936064A publication Critical patent/CN115936064A/en
Application granted granted Critical
Publication of CN115936064B publication Critical patent/CN115936064B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Complex Calculations (AREA)

Abstract

The invention particularly relates to a neural network acceleration array based on a weight circulation data stream, which sufficiently multiplexes weight values read from a memory and input feature map data, greatly reduces access to an external memory, and belongs to the technical field of hardware acceleration of the neural network. In the field of artificial intelligent chips, convolution operation occupies more than ninety percent of the calculated amount of the whole convolution neural network model, and in order to reduce repeated calling and movement of input data in an airspace calculation structure and maximize data multiplexing, the invention provides a weight circulation data stream. By designing the PE array based on the weight circulation data flow, the convolution operation is optimized, and the power consumption and delay of a hardware acceleration structure are effectively reduced, so that the overall performance of the system is improved.

Description

Neural network acceleration array based on weight circulation data stream
Technical Field
The invention relates to the technical field of hardware acceleration of neural networks, in particular to a neural network acceleration array design method based on a weight circulation data stream.
Background
Along with the rapid development of the internet of things technology, the wearable intelligent product takes all parts of a human body as interfaces of the Internet, truly realizes man-machine integrated product experience with the characteristics of miniaturized, portable and intelligent wearable products, provides portable real-time information acquisition and data service for consumers, and has larger technical content and market attraction.
Unfortunately, it is not easy to design a high performance wearable computing device, which is faced with many challenges. The field is the intersection of different research fields such as computer science and engineering, and various technologies such as microelectronics and wireless communication are used. Advances in microelectronics have led to small-sized low-power artificial intelligence chips suitable for use in wearable computing devices.
In the field of artificial intelligent chips, deep learning belonging to the category of machine learning is widely applied to aspects of image classification, voice recognition, object detection and the like, and remarkable results are obtained. Convolutional neural networks, recurrent neural networks, and deep belief networks are the main focus of deep learning studies, with the most advanced being convolutional neural networks.
Typical convolutional neural network structures include: convolution layer, activation layer, pooling layer, full connection layer, input-output feature map, etc. Wherein the function of the convolution layer is feature extraction, the function of the pooling layer is pixel compression, and the function of the full connection layer is classification. The convolution layer is computationally intensive, while the full connection layer is data intensive.
Convolutional neural networks have three bottleneck problems in data processing: first, data is intensive and the amount of data that needs to be processed is extremely large. And secondly, the data processing and storage needs to consume a large amount of computation resources and a large amount of time. Third, the speed mismatch problem, i.e. the data processing speed is slower than the data generation speed. Therefore, development of a dedicated chip suitable for an artificial intelligence architecture is needed, and implementation of a neural network accelerator for high-speed data transmission and high-speed computation has important significance in various aspects.
Currently, due to the improvement of hardware performance, the form of CPU, GPU, FPGA and ASIC is mainly adopted in the process of accelerating neural network training and inference. Approximately twenty years ago, CPUs have been the mainstay of implementing neural network algorithms, and the optimization field thereof has been mainly focused on the software part. The increasing computational cost of CNNs makes it necessary for the hardware to accelerate its reasoning process. In terms of GPUs, GPU clusters can accelerate in parallel very large networks with 10 hundred million parameters. Mainstream GPU clustered neural networks typically use distributed SGD algorithms. Many studies have further exploited this parallelism in an effort to achieve communication between different clusters. FPGAs have many attractive features and thus become a good platform for CNN hardware acceleration. Generally, FPGAs provide higher energy efficiency than CPUs and GPUs, and have higher performance than CPUs. Compared with the GPU, the throughput of the FPGA is tens of giga times, and the memory access is limited. Furthermore, it does not support floating point computing itself, but has lower power consumption. An ASIC is a special processor designed for a specific application, and has the advantages of small size, low power consumption, high computation speed, high reliability, and the like, although the ASIC has low flexibility, long development period, and high cost.
There are two more typical architectures for neural network hardware accelerators: time domain computing architecture (tree structure) and space domain computing architecture (PE array structure). The tree structure centrally controls the arithmetic computation units and the storage resources based on the instruction stream, and each arithmetic logic unit acquires operation data from the centralized storage system and writes back the result. It consists of a multiply-add tree, a buffer for distributing the input values and a prefetch buffer. In the PE array structure, each arithmetic operation unit is provided with a local memory, the whole architecture adopts data flow control, namely, all PE units form a processing chain relation, and data is directly transmitted among PEs. It consists of a global buffer, FIFO and PE array. Each PE is composed of one or more multipliers and adders, which can realize highly parallel computation.
There are four more typical data flow patterns for neural network hardware accelerators: and (3) no local multiplexing data stream, inputting a fixed stream, and outputting a fixed stream and a weight fixed stream. For non-local multiplexed data streams, to maximize storage capacity and minimize off-chip memory bandwidth, the PE is not allocated local storage, but instead all regions are allocated to global buffers to increase its capacity, it is necessary to multiplex the input signature, single pass the convolution kernel weights, and then accumulate partial sums through the PE array. For an input fixed stream, the computing core reads the input feature map into a local input register; the computing core fully multiplexes the input data, and updates all relevant output part sums in the output buffer; the updated output portion sum will be rewritten back to the output buffer. For an output fixed stream, the computing core reads each channel of the input feature map into a local input register; the output portions stored in the compute core output registers are sufficiently multiplexed to complete a full accumulation in the three-dimensional convolution channel direction; the final output feature map is written to the output buffer after pooling. For the weight fixed stream, the computing core reads the input feature map block to a local input register; the computing core updates the output part sum of the blocks by using the input data; the partitioned convolution kernel weights stored in the weight cache are sufficiently multiplexed to update the output portion sums stored in the output cache.
Fig. 1 is a simple convolution layer diagram, in which 10 is a 7×7 input feature map, 11 is a 3×3 convolution kernel, and 12 is a5×5 output feature map. The convolution kernel window slides on 10 line by line in a Z shape to perform convolution operation, and a result 12 is obtained. The convolution operation occupies more than ninety percent of the calculated amount of the whole convolution neural network model, so that the area and the power consumption of a hardware acceleration structure can be effectively reduced by designing a structured PE array to optimize the convolution operation, and the overall performance of the system is improved.
Disclosure of Invention
Aiming at the background, the invention provides an accelerating array design method aiming at the operation of a neural network convolution layer. The method can be applied to the design of the FPGA or ASIC neural network hardware accelerator as a calculation part of the AI acceleration processor.
In the airspace calculation structure, each operation unit is controlled through data flow, so that the key is to solve the data flow problem. To minimize repeated calls and shifts to the input data, maximize data multiplexing, preferably one-time input feature maps, a weighted round robin data stream (WEIGHT RING dataflow, WR) is proposed.
The invention constructs a new neural network PE array architecture based on the presented WR data flow. Assuming that the convolution kernel size of a certain convolution layer is K 2, the corresponding PE array size is also K 2. PE units are interconnected laterally for data movement and longitudinally for weight cycling.
The invention has the advantages that: the weight value read from the memory and the input characteristic diagram data are fully multiplexed, so that the access to the external memory is greatly reduced, and the overall power consumption and delay are reduced.
Drawings
FIG. 1 is a schematic diagram of a convolution layer;
FIG. 2 is a schematic diagram of a weight cycle data flow;
FIG. 3 is a schematic diagram of a neural network acceleration array based on a weight-cycled data stream;
FIG. 4 is a flowchart illustrating the operation of the PE array for the weight loop data stream
Detailed Description
The weight loop data flow and corresponding array hardware implementation of the present invention are described in detail below with reference to the attached drawing figures:
The weighted round robin data stream is embodied as shown in fig. 2, 20 is an input signature of size N 2, where N is assumed to be 7, and the first periodic convolution kernel 201 is of size K 2, where K is assumed to be 3, and the number of sliding rows ranges from 1 to K. In the convolution kernel 202 of the next cycle, the sliding line number ranges from 2 to k+1 lines, and so on to the convolution kernels of the cycle 203 and 204, and so on;
As can be seen from the operation rule of the convolution layer in FIG. 1, only one line of data is updated after each line change of the data in the 11 window. I.e. only one row of the original K rows of data is discarded. Therefore, if the data is updated again after each line feed, the data access amount is greatly increased. Therefore, the unique data line is kept unchanged in position, and only the discarded data line is covered with the updated line.
The neural network acceleration array based on the weight-cycled data stream is shown in fig. 3, and the array containing K 2 PE units 301 is a simulation of the sliding of the convolution kernel window on the image line by line. The weight value on the convolution kernel is first preset into the PE array 30, and when the K rows of image are simultaneously shifted into the array to perform convolution, the process is originally called a weight fixed stream. However, the WR flow proposed by this study is slightly different, connecting the weights of each column, providing a path for the weight flow, i.e. a vertical circulation line in 30.
For the PE array workflow shown in FIG. 4, the PE array operates as follows:
1) Firstly, 1 to K rows of image rows are input at the same time, the image rows are moved to the right according to the whole step length until the input feature map data is filled with 30 for the first time, and convolution operation is carried out in sequence;
2) And updating one row of image rows after the input feature map of 1 to K rows is traversed. The rest of the K-1 data which are not updated are continuously input in a recycling way;
3) When each periodic image line of 31 is updated, the weight of 30 is moved down in a whole cycle of lines, and is stored in a single weight register of 301 of each line, and the K lines of image lines of 30 are moved right as a whole to be convolved;
4) When the image line is updated, 30 completes the convolution of the last K lines of the input feature map, and the operation is finished.
The single 301 is mainly composed of multipliers, adders and weight registers. The preset weights are first input into the weight registers and are stored and reused until the image lines of the next line are more, and are not respectively cycled down into the weight registers of the next line 301.
Before the input signature data fills 30 for the first time, the input stimulus is directly output to the forward stimulus signal without passing through a multiplier, and the data is transferred to the right 301.
When the input signature data is full of 30, after the convolution operation is started, the input excitation and weights are input to the multiplier to generate a partial sum. This partial sum is summed with the input partial sum signal from the left to produce a forward partial sum which is input to the right 301. When the partial sums of each row reach the rightmost side of 30, the final convolution result is obtained by summing the K-1 edge adders.
The adder section may be replaced with an addition tree structure in addition to the above-described structure.
The above description is only of the preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Any person skilled in the art can make possible variations and modifications to the technical solution of the present invention, or modifications to equivalent embodiments, using the methods and technical solutions proposed above, without departing from the scope of the technical solution of the present invention. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims (4)

1. A neural network acceleration array based on a weight circulation data stream, wherein the PE array is in the size of a convolution window, PE units are transversely connected for data movement and longitudinally and circularly interconnected for weight circulation movement;
presetting a weight value on a convolution kernel into a PE array, directly outputting an input feature map to a PE unit on the right side without a multiplier before the input feature map data is fully filled into the PE array for the first time, and performing convolution operation until the input feature map data is fully filled into the PE array;
The size of the convolution window is K multiplied by K,1 to K rows of image rows are input at the same time, until the input feature map data is filled with PE array for the first time, the image rows are moved right according to the whole step length, and convolution operation is performed in sequence; updating one row of image rows after the input feature graphs of the 1 to K rows are traversed; the rest of the K-1 data which are not updated are continuously input in a recycling way; when the image rows are updated, the weights of the PE arrays are circularly moved downwards in the whole row, the weights are stored in the weight registers of the single PE units in each row, and the K rows of the image rows of the PE arrays are circularly moved rightwards in the whole row to be convolved; when the image line is updated, the PE array finishes the convolution of the last K lines of the input feature map, and the operation is finished;
In the right shift process of the input feature map, the weight is kept unchanged at the position of the array, and the array carries out convolution once every time the data shifts by one bit;
When the input feature map data is full of the PE array, after convolution operation is started, the input feature map and the weight are input into a multiplier to generate a partial sum; the partial sum is accumulated with the input partial sum signal from the left side to generate a forward partial sum which is input into the PE unit on the right side; when the partial sum of each row reaches the rightmost edge of the PE array, the final convolution result is obtained by summarizing K-1 edge adders.
2. The neural network accelerating array of claim 1, wherein instead of flowing once per beat as with a systolic array, the interval of image line updates is one cycle after the PE array traverses convolved K lines, except for the beginning of K image lines that are entered simultaneously.
3. The neural network acceleration array of claim 1, wherein the PE units are comprised of at least one data register, a multiplier, and an adder.
4. The neural network acceleration array of claim 1, wherein the PE units have control signals to control data flow in addition to data transmission signals.
CN202211141844.9A 2022-09-20 2022-09-20 Neural network acceleration array based on weight circulation data stream Active CN115936064B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211141844.9A CN115936064B (en) 2022-09-20 2022-09-20 Neural network acceleration array based on weight circulation data stream

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211141844.9A CN115936064B (en) 2022-09-20 2022-09-20 Neural network acceleration array based on weight circulation data stream

Publications (2)

Publication Number Publication Date
CN115936064A CN115936064A (en) 2023-04-07
CN115936064B true CN115936064B (en) 2024-09-20

Family

ID=86654617

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211141844.9A Active CN115936064B (en) 2022-09-20 2022-09-20 Neural network acceleration array based on weight circulation data stream

Country Status (1)

Country Link
CN (1) CN115936064B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578098A (en) * 2017-09-01 2018-01-12 中国科学院计算技术研究所 Neural network processor based on systolic arrays

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10438117B1 (en) * 2015-05-21 2019-10-08 Google Llc Computing convolutions using a neural network processor
CN107918794A (en) * 2017-11-15 2018-04-17 中国科学院计算技术研究所 Neural network processor based on computing array
CN110674927A (en) * 2019-09-09 2020-01-10 之江实验室 Data recombination method for pulse array structure
CN113869507B (en) * 2021-12-02 2022-04-15 之江实验室 Neural network accelerator convolution calculation device and method based on pulse array

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107578098A (en) * 2017-09-01 2018-01-12 中国科学院计算技术研究所 Neural network processor based on systolic arrays

Also Published As

Publication number Publication date
CN115936064A (en) 2023-04-07

Similar Documents

Publication Publication Date Title
Nguyen et al. Layer-specific optimization for mixed data flow with mixed precision in FPGA design for CNN-based object detectors
CN109409511B (en) Convolution operation data flow scheduling method for dynamic reconfigurable array
CN111967468B (en) Implementation method of lightweight target detection neural network based on FPGA
Guo et al. Software-hardware codesign for efficient neural network acceleration
CN107704916B (en) Hardware accelerator and method for realizing RNN neural network based on FPGA
US20180157969A1 (en) Apparatus and Method for Achieving Accelerator of Sparse Convolutional Neural Network
CN110263925B (en) Hardware acceleration implementation device for convolutional neural network forward prediction based on FPGA
CN109409510B (en) Neuron circuit, chip, system and method thereof, and storage medium
Zainab et al. Fpga based implementations of rnn and cnn: A brief analysis
Kästner et al. Hardware/software codesign for convolutional neural networks exploiting dynamic partial reconfiguration on PYNQ
CN111105023B (en) Data stream reconstruction method and reconfigurable data stream processor
CN113033794B (en) Light weight neural network hardware accelerator based on deep separable convolution
CN110674927A (en) Data recombination method for pulse array structure
CN112950656A (en) Block convolution method for pre-reading data according to channel based on FPGA platform
Lym et al. FlexSA: Flexible systolic array architecture for efficient pruned DNN model training
Que et al. Recurrent neural networks with column-wise matrix–vector multiplication on FPGAs
CN115423081A (en) Neural network accelerator based on CNN _ LSTM algorithm of FPGA
CN116822600A (en) Neural network search chip based on RISC-V architecture
Aung et al. Deepfire2: A convolutional spiking neural network accelerator on fpgas
CN111506344A (en) Deep learning hardware system based on systolic array architecture
CN115936064B (en) Neural network acceleration array based on weight circulation data stream
Kim et al. EPU: An Energy-Efficient Explainable AI Accelerator With Sparsity-Free Computation and Heat Map Compression/Pruning
Zhang et al. Machine Learning Hardware Design for Efficiency, Flexibility, and Scalability [Feature]
Mahajan et al. Review of Artificial Intelligence Applications and Architectures
CN113869494A (en) Neural network convolution FPGA embedded hardware accelerator based on high-level synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant