WO2023173639A1

WO2023173639A1 - Method executed by accelerator, and electronic device

Info

Publication number: WO2023173639A1
Application number: PCT/CN2022/107061
Authority: WO
Inventors: 杨经纬; 葛建明; 李甲; 桑永奇; 谢钢锋; 姚飞; 仇小钢
Original assignee: 海飞科(南京)信息技术有限公司
Priority date: 2022-03-14
Filing date: 2022-07-21
Publication date: 2023-09-21
Also published as: CN114579929A; CN114579929B

Abstract

A method executed by an accelerator, and an electronic device. The method comprises: receiving a first tensor multiplication instruction for a first thread of an accelerator (802); a first thread set broadcasting a second factor set in a second tensor to a second thread set on the basis of a logical memory address for the second tensor; and a first thread in the second thread set performing a dot product operation on a first factor set and the second factor set on the basis of a first factor register representation, so as to generate a first dot product set in a first row of a third tensor. A matrix is decomposed and threads are allocated according to rows, such that a plurality of threads can process a plurality of rows of a matrix tensor in parallel, thereby improving the processing efficiency of matrix multiplication. In addition, a programmer knows the row-column structure of a matrix tensor and thread condition in an accelerator during programming, therefore the programmer can flexibly use the threads to process matrix multiplication in parallel, thereby improving the programming flexibility.

Description

Accelerator execution methods and electronic equipment

Technical field

Embodiments of the present disclosure relate generally to the field of electronics, and more specifically to a method performed by an accelerator and an accelerator.

Background technique

Parallel high-performance multi-threaded multi-core processing systems such as graphics processing units (GPUs) process data much faster than in the past. These processing systems can break down complex calculations into smaller tasks and process them in parallel across multiple cores to increase processing efficiency and reduce processing time.

In some situations, multi-core processors such as GPUs are particularly advantageous for processing tensors with large amounts of data in the same or similar form. Tensor data usually represents one-dimensional or multi-dimensional array data in the computer field. For example, image data is a conventional two-dimensional tensor data, which can be represented by a two-dimensional array. For another example, a color image is a three-dimensional array data. In addition to a two-dimensional pixel array including width and height, the color image also includes red, green, and blue (RGB) channel dimensions. Processing tensors such as two-dimensional arrays can include matrix multiplication, for example. Conventional matrix multiplication based on internal accelerators such as GPUs is usually not available to programmers, so programmers often do not understand the process of hardware performing matrix multiplication, and therefore cannot optimize the calculation of matrix multiplication for hardware, which leads to program execution Efficiency and efficiency of tensor processing are generally lower.

Contents of the invention

Embodiments of the present disclosure provide a method and electronic apparatus for execution by an accelerator.

In a first aspect, a method performed by an accelerator is provided. The method includes receiving a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor register representation for the first tensor, a memory logical address for the second tensor, and a first product register representation for the third tensor; the first set of threads broadcasts a second set of factors in the second tensor to the second set based on the memory logical address for the second tensor. Thread set, the second thread set is different from the first thread set; the first thread in the second thread set performs a dot product of the first factor set and the second factor set in the first row of the first tensor based on the first factor register representation operations to generate a first set of point products in the first row of the third tensor; and accumulating, by the first thread, the first set of point products into a first set of product registers corresponding to the first product register representation. By decomposing the matrix and assigning threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing of matrix multiplications. In addition, because programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.

In a possible implementation, the first factor set includes at least part of the factor data in the first row of the first tensor. The second factor set includes factor data for at least a portion of the second tensor. The first point product set includes at least a portion of the product data in the first row of the third tensor.

In a possible implementation, each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a part of the data in a row of the first factor matrix, and the second set of registers is used to store A row of data in the product matrix. Data in a column of the second factor matrix may be stored in on-chip memory, L1 cache, or off-chip memory. In this way, during the execution of matrix multiplication, the execution unit of the first thread can only read the data in one row of the first factor matrix from the first set of registers once, and perform the subsequent dot product operation on each column of the second factor matrix. reused. Furthermore, data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, the transmission of data between different storage devices can be reduced, thereby reducing the time delay caused by data transmission during matrix multiplication calculations.

In one possible implementation, the method further includes, in response to receiving the second set of factors, a second thread in the second set of threads converting the third factor in the second row of the first tensor based on the first factor register representation. performing a dot product operation on the set and the second set of factors to generate a second dot product set in the second row of the third tensor; and accumulating, by the second thread, the second dot product set corresponding to the first product register representation in the second set of product registers.

In a possible implementation, the first tensor multiplication instruction further includes a first merge calculation mode indication. Generating the first set of point products in the first row of the third tensor includes: combining, by the first thread, the first set of factors and the second factor in the first row based on the first merge calculation mode indication and the first factor register representation. The dot product operation is performed on the sets to produce the first set of dot products in the first row of the third tensor.

In a possible implementation, the method further includes: based on the first combined calculation mode indication and the first factor register representation, using a third thread in the first thread set to combine the first factor set and the fourth part of the second tensor. perform a dot product operation on the set of factors to generate a third set of dot products in the first row of the third tensor, the fourth set of factors being different from the set of second factors, and the third set of dot products being different from the set of first dot products; and The third point product is accumulated by the third thread into a third set of product registers corresponding to the first product register representation.

In a possible implementation, the first tensor multiplication instruction further includes a second merge calculation mode indication. Generating the first point product set in the first row of the third tensor includes: based on the second merge calculation mode indication and the first factor register representation, the first thread combines the first factor set in the first row with the second tensor The second set of factors is dot producted to produce the first set of dot products in the first row of the third tensor.

In a possible implementation, the method further includes: based on the second combined calculation mode indication and the first factor register representation, the fourth thread in the second thread set combines the fifth factor set and the sixth factor set of the second tensor. The factor sets are dot producted to produce the fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, the fourth point The product set is different from the first dot product set; and the fourth dot product set is accumulated by the fourth thread to the first set of product registers corresponding to the first product register representation.

In one possible implementation, the first tensor multiplication instruction also includes a transpose instruction. Generating the first dot product set in the first row of the third tensor includes, based on the transpose indication and the first factor register representation, combining the first set of factors in the first row and the second set of factors in the second tensor by the first thread. The set of factors are dot producted to produce the first set of dot products in the first row of the third tensor.

In a possible implementation, based on the transposition indication and the first factor register representation, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to generate the The first dot product set in the first row of three tensors consists of: loading the factors of multiple rows in the second tensor into cache based on the transpose directive and the memory logical address; selecting from the factors of multiple rows by column factors to form a second factor set; and based on the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate the first row of the third tensor The first point accumulation of .

In a possible implementation, multiple unselected factors in multiple rows are retained in the cache until the multiple unselected factors are selected for calculation of matrix multiplication.

In a possible implementation, the first thread set provides the second factor set corresponding to the memory logical address in parallel to the computing units in all threads in the second thread set in the form of broadcast, but not to all threads. register.

In a possible implementation, the memory logical address includes segment reference data and offset data. The segment reference data represents the starting address in the second tensor, and the offset data represents each dimension in the multiple dimensions of the second tensor. offset on.

In a possible implementation, the first product register represents one or more product registers, the number of one or more product registers is related to the combined calculation mode and the number of columns of the second tensor, and the product registers of different threads Constituting a result tensor, each thread's product register contains part or all of each row of the result tensor; and the result tensor has the same number of rows as the first tensor, and the result tensor has the same number of columns as the second tensor The number of columns is the same.

In a possible implementation, the number of product registers in the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the value of the multiplication instruction in the second tensor. Column access; and if the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.

In a possible implementation, the first tensor multiplication instruction is issued multiple times, where the first tensor multiplication instruction is issued for the first time in the form of a storage instruction to obtain column data or rows in the second tensor. data; and in response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction is quadratic or in the form of a mathematical calculation instruction. Emit multiple times to perform the calculation of the results of each column within the row of the third tensor.

In a possible implementation, before performing the second or multiple transmissions, the corresponding token status of the first factor register is checked; if the token status indicates that the data of the first tensor has been stored in the first factor register , it is issued in the form of mathematical calculation instructions, otherwise the emission queue is blocked until the data of the first tensor has been stored in the first factor register.

In one possible implementation, based on the first product register representation, it is determined whether the product register usage range for the third tensor exceeds the range of the register file within a single thread; and if the product register usage range for the third tensor is determined If the scope exceeds the scope of the register file within a single thread, calculation operations or memory access operations that exceed the scope of the register file are ignored and an error is reported.

According to a second aspect of the present disclosure, an electronic device is provided. The electronic device includes: a stream processor; a page table device coupled to the stream processor; a memory; a processing engine unit coupled to the stream processor, the memory and the page table device configured to perform the method according to the first aspect.

According to a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: a receiving unit configured to receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, for the first tensor a first factor register representation of, a memory logical address for the second tensor, and a first product register representation for the third tensor; a broadcast unit configured to be configured by the first set of threads based on the memory logic address for the second tensor The address broadcasts a second set of factors in the second tensor to a second set of threads, the second set of threads being different from the first set of threads; the generation unit is configured to represent, by the first thread in the second thread set, the first factor register based on the first set of threads. perform a dot product operation on the first factor set and the second factor set in the first row of the first tensor to generate the first dot product set in the first row of the third tensor; and a storage unit configured to be A thread accumulates the first point product into a first set of product registers corresponding to the first product register representation. By decomposing the matrix and allocating threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing efficiency of matrix multiplication. In addition, because programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.

In a possible implementation, each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a part of the data in a row of the first factor matrix, and the second set of registers is used to store A row of data in the product matrix. The data in a column of the second factor matrix may come from on-chip memory, L1 cache, or off-chip memory. In this way, during the execution of matrix multiplication, the execution unit of the first thread can read the data in one row of the first factor matrix from the first set of registers only once, and reuse it during subsequent dot product operations. Furthermore, data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.

In one possible implementation, the generation unit is further configured to, in response to receiving the second set of factors, a second thread in the second set of threads convert the first element in the second row of the first tensor based on the first factor register representation. The three-factor set and the second factor set are dot producted to produce the second dot product set in the second row of the third tensor. Storage unit 908 is further configured to accumulate, by the second thread, the second dot product set into a second set of product registers corresponding to the first product register representation.

In a possible implementation, the first tensor multiplication instruction further includes a first merge calculation mode indication. The generation unit is further configured to: based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate a third The first point in the first row of the quantity accumulates.

In a possible implementation, the generation unit is further configured to use a third thread in the first thread set to combine the first factor set and the second tensor of the second tensor based on the first combined calculation mode indication and the first factor register representation. The four-factor set performs a dot product operation to produce a third dot product set in the first row of the third tensor, the fourth factor set is different from the second factor set, and the third dot product set is different from the first dot product set. The storage unit is further configured to accumulate, by the third thread, a third point product into a third set of product registers corresponding to the first product register representation.

In a possible implementation, the first tensor multiplication instruction also includes a second merge calculation mode indication. The generation unit is further configured to perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor based on the second merge calculation mode indication and the first factor register representation by the first thread to generate The first point product in the first row of the third tensor.

In a possible implementation, the generation unit is further configured to use the fourth thread in the second thread set to combine the fifth factor set and the second tensor of the second tensor based on the second combined calculation mode indication and the first factor register representation. The dot product operation is performed on the set of six factors to produce the fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth The point product set is different from the first point product set. The storage unit is further configured to accumulate, by the fourth thread, a fourth point product set to the first set of product registers corresponding to the first product register representation.

In one possible implementation, the first tensor multiplication instruction also includes a transpose instruction. The generation unit is further configured to: based on the transpose indication and the first factor register representation, perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread to generate a third The first point product in the first row of the tensor.

In a possible implementation, the generation unit is further configured to: load the factors of the plurality of rows in the second tensor into the cache based on the transposition indication and the memory logical address; select from the factors of the plurality of rows by column factors to form a second factor set; and based on the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate the first row of the third tensor The first point accumulation of . In a possible implementation, multiple unselected factors in multiple rows are retained in the first-level cache until the multiple unselected factors are selected for calculation of matrix multiplication.

In a possible implementation manner, the first thread set provides the second factor set corresponding to the memory logical address to all threads in the second thread set in parallel in the form of broadcast.

In a possible implementation, the memory logical address includes segment reference data and offset data. The segment reference data represents the starting address of the second tensor, and the offset data represents each dimension of the second tensor in multiple dimensions. offset on.

In a possible implementation, the accelerator further includes a checking unit configured to check the corresponding token status of the first factor register before performing the second or multiple transmissions; if the token status represents the first If the amount of data has been stored in the first factor register, it is issued in the form of a mathematical calculation instruction, otherwise the emission queue is blocked until the first amount of data has been stored in the first factor register.

In a possible implementation, the accelerator further includes an out-of-bounds checking unit. The out-of-bounds checking unit is configured to determine, based on the first product register representation, whether the product register usage range for the third tensor exceeds the range of a register file within a single thread; and if it is determined that the product register usage range for the third tensor exceeds the range of a single The range of the register file within the thread, then calculation operations or memory access operations that exceed the range of the register file are ignored and an error is reported.

According to the method and electronic device of the embodiments of the present disclosure, programmers can consider thread task allocation from a matrix perspective, so that one or more threads can be used to calculate the dot product of one row of the first factor matrix and the second factor matrix, and The corresponding results are accumulated into the product register within the same thread, thereby increasing programming flexibility for matrix multiplication and improving the execution efficiency of matrix multiplication.

Description of the drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing the exemplary embodiments of the present disclosure in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present disclosure, the same reference numerals generally represent Same parts.

1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented;

Figure 2 shows a schematic block diagram of a chip according to an embodiment of the present disclosure;

Figure 3 shows a schematic block diagram of a three-dimensional tensor according to one embodiment of the present disclosure;

Figure 4 shows a schematic diagram of page allocation of image data according to one embodiment of the present disclosure;

Figure 5 shows a schematic diagram of matrix multiplication according to one embodiment of the present disclosure;

Figure 6 shows a schematic diagram of a portion of matrix multiplication according to one embodiment of the present disclosure;

7 shows a schematic diagram of a portion of matrix multiplication according to another embodiment of the present disclosure;

8 shows a schematic flow diagram of a method performed by an accelerator according to one embodiment of the present disclosure; and

Figure 9 shows a schematic block diagram of an electronic device according to one embodiment of the present disclosure.

Detailed ways

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As used herein, the term "include" and its variations mean an open inclusion, ie, "including but not limited to." Unless otherwise stated, the term "or" means "and/or". The term "based on" means "based at least in part on." The terms "one example embodiment" and "an embodiment" mean "at least one example embodiment." The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," etc. may refer to different or the same object. Other explicit and implicit definitions may be included below.

As mentioned earlier, regular matrix multiplication based on internal hardware accelerators such as GPUs is usually not available to programmers. Therefore, programmers usually do not understand the process of hardware performing matrix multiplication, and therefore cannot calculate matrix multiplication for hardware. Optimizations are performed, which results in program execution and tensor processing being generally less efficient.

In some embodiments of the present disclosure, programmers can consider thread task allocation from the perspective of the row-column structure of the matrix, so that one or more threads can be used to calculate the dot product of a row of the first factor matrix and the second factor matrix, and The corresponding result finds the product register within the same thread, thereby increasing programming flexibility for matrix multiplication and improving the execution efficiency of matrix multiplication.

Figure 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented. Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer. In one embodiment, example environment 100 includes, for example, central processing unit (CPU) 20, system memory 10, northbridge/memory bridge 30, accelerator 40, device memory 50, and southbridge/input-output (IO) bridge 60. System memory 10 may be, for example, volatile memory such as dynamic random access memory (DRAM). The northbridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the southbridge/IO bridge 60. Southbridge/IO bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers, etc. The accelerator 40 may include, for example, a device or chip such as a graphics processing unit (GPU) and/or an artificial intelligence (AI) accelerator for accelerating processing of graphics, video, and other data. In one embodiment, accelerator 40 may be a GPU. In another embodiment, accelerator 40 may be an AI chip. Device memory 50 may be, for example, volatile memory external to accelerator 40 such as DRAM. In this disclosure, device memory 50 is also referred to as off-chip memory, i.e., memory located outside the chip of accelerator 40. Relatively speaking, the accelerator 40 also has volatile memory inside the chip, such as a level one (L1) cache (cache) and an optional level two (L2) cache. This will be described in detail below in connection with some embodiments of the present disclosure. Although an example environment 100 in which various embodiments of the disclosure can be implemented is shown in FIG. 1 , the disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments such as ARM architecture and RISC-V architecture with accelerators such as GPUs.

Figure 2 shows a schematic block diagram of an accelerator 200 according to one embodiment of the present disclosure. The accelerator 200 may be, for example, a specific implementation of the chip of the accelerator 40 in FIG. 1 . The accelerator 200 is, for example, an accelerator chip such as a GPU. In one embodiment, accelerator 200 includes stream processor (SP) 210, page table device 220, processing engine (PE) unit 230, direct memory access (DMA) controller 240, L1 cache (cache) 260, and L2 cache. Cache 250.

The accelerator 200 is controlled by a host device such as the CPU 20 and receives instructions from the CPU 20. The SP 210 analyzes instructions from the CPU 20 and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing. The page table device 220 is used to manage the on-chip virtual storage of the accelerator 200 . In the present disclosure, L2 cache 250 and off-chip memory such as device memory 50 in FIG. 1 constitute a virtual storage system. Page table device 220 is jointly maintained by SP 210, PE unit 230 and DMA controller 240.

The PE unit 230 includes a plurality of processing engines (PE) PE_1, PE_2...PE_N, where N represents an integer greater than 1. Each PE in PE unit 230 may be a single instruction multi-thread (SIMT) device. In PE, each thread can have its own register file, and all threads of each PE also share a unified register file. Multiple PEs can perform the same or different processing work in parallel, and can perform address translation and access to target data in the memory as described below in parallel, thereby reducing processing time. It can be understood that the target elements processed by multiple PEs are not the same, and the segments, pages, cache lines where the target elements are located, and the attributes, sizes, dimension ordering, etc. of the elements may be different, as described in detail below.

In one embodiment, the logical address of the target element can be expressed as seg:RF:imm, where seg represents the segment base address register, RF represents the offset register, and imm represents the offset immediate value. From a tensor perspective, the logical address can include the reference data and offset data of the target element in each dimension of the first tensor. The offset data represents the offset of the target element on each of the multiple dimensions of the first segment, and the segment reference data is the address of the segment starting point.

In one embodiment, the first segment includes at least one page, and the accelerator 200 may convert the logical address into a linear address based at least on dimensions of each dimension of the target element page. The linear address includes the one-dimensional page identifier of the target element page and the one-dimensional offset value of the target element within the target element page. Specifically, the accelerator 200 can obtain the page number offset of the target element in each dimension according to the page size of the page in each dimension in the first segment, thereby obtaining the one-dimensional identification of the page where the target element is located. For example, the target element is located at the top level of the tensor in Figure 3, and the page identifier of the target element can be determined to be P[1] through the above method.

In addition, the accelerator can also obtain the relative offset of the target element in each dimension within the page, and based on this, determine the one-dimensional linear offset of the target element relative to the starting position of the page. The one-dimensional identifier of the page and the one-dimensional linear offset within the page together constitute the linear address of the target element.

The accelerator 200 converts the linear address into a physical address based on the page table entry for the target element page, which page table entry includes the page physical address of each page in at least one page. Specifically, in one embodiment, after obtaining the page identifier of the target element, the accelerator 200 can search the corresponding item in the page table device 220 according to the page identifier to obtain the physical address of the page. This physical address plus the one-dimensional linear offset of the target element on the target element page is the physical address of the target element. The physical address may represent the storage address of the target element on the off-chip device memory 50 or on-chip memory, such as the L2 cache 250 . Alternatively, the page table entry of the target element page can also store the physical address relative to other pages, and the target is obtained based on the offset of the target element page relative to other pages, the physical addresses of other pages, and the one-dimensional linear offset The physical address of the element.

In addition to the physical address, page table entries can also include other attributes, such as status, which indicates whether the page has been loaded, that is, whether it is available. This disclosure does not limit this. Although two-level translation of addresses is shown here, the disclosure is not limited thereto. Alternatively, more stages of conversion are possible. For example, the page offset, cache line offset, and element offset are calculated hierarchically, and are sequentially added to the physical address to obtain the final physical address of the target element.

In one embodiment, the accelerator 200 moves the first page of the plurality of pages from the off-chip memory into the on-chip memory, and establishes a first page table entry corresponding to the first page. The first page table entry stores the first page. Physical address in memory. If the first page of the plurality of pages is moved from the memory to the off-chip memory, the accelerator 200 may delete the first page table entry corresponding to the first page.

The accelerator converts the logical address of the target element in the first segment S1 to a physical address in on-chip virtual memory. On-chip virtual memory may include on-chip L2 cache 250 and off-chip device memory 50 . The logical address includes the segment reference data and offset data of the first segment in the tensor. The segment reference data and offset data respectively represent the base address and offset of the target element in each of the multiple dimensions of the first segment.

Each thread can perform thread-level data exchange between its own register file and memory subsystem. Each thread has its own arithmetic logic execution unit and uses its own storage address, which uses a typical load-store architecture. Each execution unit includes a floating-point/fixed-point unit that supports multiple data types and an arithmetic logic unit.

Most instructions perform arithmetic and logical operations, such as addition, subtraction, multiplication, and division of floating-point and fixed-point numbers, or logical AND, OR, NOT, etc. The operands come from registers. Memory read and write instructions can provide data exchange between registers and on-chip/off-chip memory. Generally, all execution units in a PE can execute the same instructions synchronously. By using the predicate register, part of the execution unit can be masked to implement the function of the branch instruction.

In one embodiment, the accelerator 200 of FIG. 2 may, for example, perform the following operations: 1) assemble page table entry content and initial state; 2) transfer data on an off-chip memory such as the device memory 50 in FIG. 1 to On-chip memory, such as L2 cache 250; 3) Start and execute the program; 4) Define each segment and describe the tensor and stored attributes; 5) When the program execution is completed, write the execution result data off-chip memory.

It can be understood that in the disclosed embodiments, the data processed by the accelerator 200 is mainly aimed at multi-dimensional tensors. For example, in one embodiment, the tensor may be a four-dimensional tensor with four dimensions D1, D2, D3, and D4, and the tensor may have different dimensions in each dimension. In other embodiments, the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited by this disclosure.

In addition, in the embodiment of the present disclosure, the tensor can internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this. For tensor addressing, the basic unit is element. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double bytes, and so on.

In some cases, the amount of data contained in the tensor may be large, and the capacity of the L2 cache 250 is limited, so the entire tensor cannot be loaded into the on-chip L2 cache 250 . In some embodiments of the present disclosure, to facilitate parallel processing of the tensor, the tensor may be divided into at least one segment. In the case where a tensor consists of only one segment, the tensor is a segment. In the case of a tensor containing multiple segments, the segments are part of the tensor. The CPU 20 can specify which PE to process each part of the segment through instructions.

Figure 3 shows a schematic block diagram of a three-dimensional tensor 300 according to one embodiment of the present disclosure. The three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3. CPU 20 may specify that the tensor elements of segment S1 are processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8. In addition, CPU20 also specifies that the tensor elements of the second segment S2 are processed by PE_1-PE_4. In embodiments of the present disclosure, each segment may have different sizes, so programmers can flexibly configure the segments based on design needs. In fact, page division can be implemented in any one or more dimensions, and the number of pages divided in each dimension is independent of each other.

In one embodiment, tensor data may be stored in on-chip high-speed memory, such as L2 cache 250. However, due to the small capacity of the on-chip high-speed memory, when the tensor size is large, programmers can divide the tensor into multiple segments, and each segment describes a part of the tensor. The core program (kernel) can be started multiple times. Each time, the DMA controller 240 transfers a segment of the tensor from off-chip storage to on-chip storage in advance and makes it available for kernel operations. After starting the kernel multiple times, all segments contained in the tensor are processed and the entire running process ends. When the on-chip high-speed memory is enough to accommodate all the tensors that the kernel needs to access, a tensor only needs one segment description, and the kernel only needs to be started once.

Further, in some embodiments of the present disclosure, within a segment, at least one page can also be set to further subdivide the tensor. For example, in the first segment S1, there are 4 pages P[1], P[2], P[3] and P[4]. The second segment S2 has only one page. In embodiments of the present disclosure, each segment may have a different number of pages, so programmers can flexibly configure the size of pages within a segment based on design needs. For example, the page is configured to fit into L2 cache 250 in its entirety.

As mentioned above, when addressing a tensor, the smallest unit of addressing is the element. A page can usually contain multiple elements. The page on which the target element is located is referred to herein as the "target element page". In some embodiments of the present disclosure, a page may include multiple cache lines. While the target element page may be located in the L2 cache 250, if the PE reads the target element via the L1 cache 260, the L2 cache 250 needs to concatenate the physical addresses of a small portion of the L2 cache 250 including the target element. The data is transferred to L1 cache 260 in its entirety. This small part of data is also called cache line data, and this caching mechanism is based on the principle of spatial proximity. It only takes a few clock cycles for the PE to read data from the L1 cache 260, while it may take dozens or even hundreds of clock cycles for the L1 cache 260 to read data from the L2 cache 250. Therefore, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250. Although a "cache line" is used here to describe the smallest unit of data transferred from the L2 cache 250 to the L1 cache 260, in this disclosure, this part of the data may not necessarily be arranged in rows or columns. A "cache line" The data inside is distributed in multiple dimensions, and the size of the data distributed in each dimension is not limited to 1. PE performs parallel processing on the data within a segment. The allocation of PE is expanded in the logical address space of the data and is independent of the physical storage structure of the segment, as described below.

In Figure 3, the first set of cache lines in the first page P[1] is designated to be processed by PE_1, and the second set of cache lines is designated to be processed by PE_2. Although the tensor is shown to be processed by multiple PEs in sequence here, it can be understood that the processing of tensor data is independent of the order of the PEs, and the present disclosure is not limited to this. For example, PE_2 in Figure 3 represents part of the tensor data that can be processed by PE_M, where M represents any integer not greater than N.

FIG. 4 shows a schematic diagram of page allocation of image data 400 according to one embodiment of the present disclosure. Image data is typically a two-dimensional tensor. In one embodiment, the image data 400 is, for example, 8*8 pixels. In other words, image data 400 has 8 pixels in the first dimension D1 and also has 8 pixels in the second dimension D2. Therefore, image data 400 has pixels P00, P01...P77. In the embodiment of Figure 4, the image data 400 has only one segment, but is divided into 4 pages P[1], P[2], P[3] and P[4] in two dimensions. The four pages can be divided according to the second dimension D2 to be allocated to PE_1 and PE_2 for processing, or they can be divided according to the first dimension D1 to be allocated to PE_1 and PE_2 for processing. In addition, it can also be divided diagonally. This disclosure does not limit this.

Figure 5 shows a schematic diagram of matrix multiplication 500 according to one embodiment of the present disclosure. Tensors can typically contain one or more dimensions. Two-dimensional tensors can be thought of as matrices. In some situations, it may be necessary to perform matrix multiplication of two two-dimensional matrices to obtain the product matrix. In the present disclosure, for matrix multiplication C=A×B, matrix C represents a product matrix, matrix A represents a first factor matrix, and matrix B represents a second factor matrix. In Figure 5, the first factor matrix A 502 is multiplied by the second factor matrix B 504 to obtain the product matrix C 506. In the present disclosure, a "dot product operation" may include a multiplication operation of corresponding matrix elements and optionally a product addition operation. Specifically, the first factor matrix 502 may be an m×k matrix, and the second factor matrix 504 may be a k×n matrix, where m, k and n all represent positive integers. According to the rules of matrix multiplication, the product matrix is therefore an m×n matrix. It can be seen that the first factor matrix 502 includes m rows and k columns, the second factor matrix 504 includes k rows and n columns, and the product matrix therefore includes m rows and n columns.

When performing matrix multiplication, you can perform a dot product operation on the first row A[1][1]...A[1][k] and B[1][1]...B[k][1] to get C[ 1][1]. Specifically, C[1][1] can be expressed by the following formula (1):

C[1][1]＝A[1][1]×B[1][1]+A[1][2]×B[2][1]…+A[1][k]×B [k][1] (1)

Similarly, a dot product operation can be performed to obtain C[m][1] and C[m][n]. C[m][1] and C[m][n] can be expressed by the following formulas (2) and (3):

C[m][1]＝A[m][1]×B[1][1]+A[m][2]×B[2][1]…+A[m][k]×B [k][1] (2)

C[m][n]＝A[m][1]×B[1][n]+A[m][2]×B[2][n]…+A[m][k]×B [k][n] (3)

It can be seen that matrix C includes m×n matrix elements, and each matrix element is formed by adding k product results. In the present disclosure, for the above-mentioned product matrix C=A×B, the product result represents the result of multiplying one matrix element of matrix A and one matrix element of matrix B, and the dot product result represents multiple matrix elements of matrix A. The result obtained by multiplying the corresponding multiple matrix elements in matrix B and adding the multiple product results.

Figure 6 shows a schematic diagram of matrix multiplication 600 according to one embodiment of the present disclosure. In one embodiment, product matrix C 602 may include m rows and n columns, with each row corresponding to a thread. Each thread includes n registers for storing n dot product results for each row. When PE is executed, m threads can be executed in parallel to improve execution efficiency. During the specific execution process, all registers corresponding to matrix C can be initialized to 0 first. Taking C[1][1] as an example, as shown in the above formula (1), the calculation of C[1][1] includes k multiplication calculations and k-1 addition operations (actually equivalent to k times Accumulation, because the matrix elements are initialized to 0, the first product element is equivalent to accumulation with 0). Then calculate it sequentially. For example, the first thread first calculates the first product result A[1][1]×B[1][1] of the matrix element C[1][1], and the second thread first calculates it in parallel. The first product of matrix element C[2][1] results in A[2][1]×B[1][1], and so on. That is, all m threads first calculate the first product result of the first matrix element of a row of the corresponding matrix C. It can be understood that at this time, the complete result of the first column of the product matrix C 602 has not been obtained, nor has the calculation of other columns except the first column of each row of the product matrix C 602 been carried out.

The first thread then computes the first product result A[1][1]×B[1][2] of the second column element C[1][2] of the matrix, and the second thread computes the matrix element C in parallel The first product result of [2][2] is A[2][1]×B[1][2], and so on. That is, m threads calculate the first product result of the second matrix element of a row of the corresponding matrix C. At this time, the complete results of the first and second columns of the product matrix C 602 have not been obtained, and the calculation of other columns except the first and second columns of each row of the product matrix C 602 has not been carried out. After M threads perform parallel calculations to the nth round, the first product result of all column matrix elements in each row of the product matrix C602 is obtained. The first thread then computes the second product result A[1][2]×B[2][1] of matrix element C[1][1] and multiplies it with the first product A[1][1 ]×B[1][1] is added, and the second thread first calculates the second product result A[2][2]×B[2][1] of the matrix element C[2][1] in parallel. And add it to the first product A[2][1]×B[1][1], and so on. After M threads perform parallel calculations to the nth round, all columns of matrix C 602 are calculated. . That is, all m threads first calculate the result of adding the second product and the first product of each element of a row of the corresponding matrix C.

By analogy, the k-th product result of each matrix element is calculated and added to the sum of the previous k-1 product results to obtain the final matrix C 604. In other words, during the calculation process, matrix C 604 actually includes k rounds of calculations. Each round calculates a part of each matrix element of matrix C, and accumulates the calculation results with the calculation results of previous rounds in the corresponding register. As shown in Figure 6, each matrix element of matrix C 602 has the same color pattern, which indicates that each matrix element is calculated for the same number of rounds of multiplication and accumulation. Each matrix element of matrix C 604 is the final result obtained after k rounds of accumulation, so the color of each matrix element is darker than the color of matrix C 602.

Although in the embodiment of FIG. 6 , only one product result is calculated at a time and accumulated with the previous results in the register, this is only illustrative and does not limit the scope of the present disclosure. In other embodiments, multiple product results may be calculated in each round and accumulated. For example, the k dimension can be divided into s segments, and the accumulation of the product results within the s segments is calculated each time. For example, in the case of s=k/2, for C[1][1], the first round of calculation can calculate A[1][1]×B[1][1]+A[1][2] ×B[2][1]. After executing s rounds, the complete value of C[1][1] can be obtained. In this way, these computing resources can be used more flexibly based on the allocation of computing resources of the PE unit, thereby giving programmers greater programming flexibility.

Figure 7 shows a schematic diagram of matrix multiplication 700 according to another embodiment of the present disclosure. Different from Figure 6, in Figure 7, multiple threads can first calculate in parallel the accumulation of all product results of each matrix element of matrix C, and then calculate the matrix elements of the next column of matrix C column by column. As shown in Figure 7, the matrix elements in the first column of matrix C 702 have a darker color than the nth column, which indicates that the matrix elements in the first column have been calculated for the same number of rounds of multiplication and accumulation, while the matrix elements in the last column The matrix elements have not been calculated at this time, for example, they still have the initial value 0. Each matrix element of matrix C 704 is the final result obtained after k rounds of accumulation. The color of the matrix element in the first column of matrix C 704 is the same as the color of the matrix element in the first column of matrix C 702. This shows that The first column of matrix C 702 is calculated first, and then the next round of calculation is performed. Similar to the embodiment of Figure 6, the k dimension can also be divided into s segments, and the accumulation of the product results in the s segments is calculated each time.

Although in the embodiments of FIG. 6 and FIG. 7 , each row of the product matrix C is obtained by using one thread to perform matrix calculation, this is only for illustration and does not limit the scope of the present disclosure. When the number of threads is significantly greater than the number of matrix rows, for example, when the number of threads is 2 times, 3 times, or more than the number of rows of the product matrix, 2, 3, or more threads can be used for each row of the product matrix C. More threads are used to calculate the product matrix C, as described below.

Since a row of the product matrix C can be obtained by one or more threads performing matrix calculations, the programmer can calculate the rows of the product matrix C based on the first factor matrix A, the second factor matrix B and the resulting matrix multiplication. Flexible allocation of threads based on the number and number of columns. Specifically, in some embodiments, in the tensor multiplication instruction, each thread can be assigned corresponding information about the first factor matrix A, the second factor matrix B, and the product matrix C to perform part of the matrix multiplication task. To flexibly and efficiently utilize the computing resources in the PE unit. The general concept of matrix multiplication is described above in Figures 5 to 7. Some embodiments of matrix multiplication will be described in detail below in conjunction with Figure 8.

Figure 8 shows a schematic flow diagram of a method 800 performed by an accelerator according to one embodiment of the present disclosure. Method 800 is used to perform matrix multiplication as shown above in conjunction with Figures 5-7. At 802, receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor for a first tensor a register representation, a memory logical address for the second tensor, and a first product register representation for the third tensor. In one embodiment, the electronic device may have two thread sets, wherein the first thread set is used to broadcast the data of matrix B to the computing units of the threads in the second thread set. For example, the first thread set provides the second factor set corresponding to the memory logical address in parallel to all threads or part of the threads in the second thread set in the form of broadcast. In other words, the first set of threads is configured to broadcast data for matrix B, and the first set of threads is configured to perform A × B in response to receiving data for matrix A. Each thread in the second thread set includes a first set of registers for storing at least a portion of the data in a row of the first factor matrix, and a second set of registers for storing at least a portion of the data in a row of the first factor matrix. One row of data.

An illustrative example of a first tensor multiply instruction is for example @p1,mm.R0,ur4:rf290:0x00,R256, where @p1 represents the guard predicate operand associated with the first thread. @p1 can for example be a boolean predicate variable for the first thread. If the predicate value is false, the data load operation for this instruction is not performed. If the predicate value is true, then ur4:rf290:0x00 is used to access the on-chip memory normally, such as L1 cache 260, L2 cache 250 or dynamic random access memory such as DDR (Double Data Rate) memory controlled by DMA 240 ( dynamic random access memory (DRAM), and the first thread set broadcasts the obtained data content to all threads in the second thread set. In other words, execution conditions for each thread can be provided. For threads that do not meet the execution conditions, their memory access is regarded as exceeding the tensor address range and is ignored, or the tensor multiplication operation to be performed by the corresponding thread of the second thread set is abandoned. R0 represents the starting register in the second set of registers used to store each product element of a row in the product matrix C. For example, the registers R0-R255 are used to store each product element of a row in the product matrix C. ur4:rf290:0x00 represents the logical address of the second factor matrix, such as a specific example of the logical address seg:RF:imm of the target element mentioned above. R256 represents the starting register in the first group of registers. The first group of registers is used to store the dot product (multiplication and accumulation of corresponding elements in matrix A and matrix B) operations in a row in the first factor matrix. The correlation matrix elements involved in . In one embodiment, the first set of registers and the second set of registers are located in the same thread, which can reduce power consumption and time of data transmission during the calculation process.

It can be understood that the first product register representation may correspond to one or more product registers. The number of one or more product registers is related to the merge calculation mode and the number of columns of the second tensor, as detailed below. The product registers of the different threads form a result tensor with the same number of rows as the first tensor and the same number of columns as the second tensor. For example, 256 threads can form a resulting tensor with 256 rows. Each thread's product register file includes part or all of each row of the result tensor. For example, each thread's product register file could correspond to one row of the result tensor. In merged computation mode, each thread's product register can correspond to part of a row of the result tensor.

Additionally, it will be appreciated that the number of product registers within the threads in the second thread set may be variable. The number of product registers depends on the execution conditions of the first tensor multiply instruction. The execution condition determines access to the columns in the second tensor. For example, in some situations, only a portion of all product registers within the threads in the second thread set may be used. In other cases, another portion or all of the product registers within the threads in the second thread set are used. If the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.

In one implementation, the first tensor multiply instruction may be issued two or more times. In the first issue, the first tensor instruction is issued to the memory system, and the matrix multiply (mm) instruction can be fetched from the cache or instruction section of the accelerator 200 and sent to the pipeline unit of the accelerator 200 , after decoding, it is issued as a regular access instruction, and its access address is seg:RF:imm such as ur4:rf290:0x00. In other words, the first tensor multiplication instruction is issued for the first time in the form of a storage instruction to obtain the column data or row data in the second tensor.

In response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction is issued two or more times in the form of a mathematical calculation instruction. , for performing the calculation of the results of each column in the row of the third tensor.

The accelerator 200 can read the data block registers corresponding to matrix C and matrix A, such as R0-R255 and R256-R257, and then read the data block of the second factor matrix B obtained during the first emission process and perform a dot product operation, And write the temporary calculation result into the corresponding register, such as a register in R0-R255. In this way, during the execution of matrix multiplication, the execution unit of the first thread can read the data in one row of the first factor matrix from the first set of registers only once, and reuse it during subsequent dot product operations. It is understood that in some cases, the range of product register usage for the third tensor may exceed the range of the register file within a single thread. For example, data block registers R0-R255 are not enough to store one row of product data in the third tensor. For example, one row of product data in the third tensor requires 300 data registers to store. In one embodiment, the accelerator 200 may determine whether the product register usage range for the third tensor exceeds the range of the register file within a single thread based on the first product register representation. If it is determined that the product register usage range for the third tensor exceeds the range of the register file within a single thread, calculation operations or memory access operations that exceed the range of the register file are ignored and an error is reported.

In some embodiments, before transmitting again, it is necessary to check the readiness status of the first factor register, specifically check its corresponding token status. If the token status indicates that the first factor is ready, transmit it in the form of a mathematical calculation instruction. , otherwise the transmit queue is blocked until the first factor register is ready. Specifically, before performing the second or more transmissions, the accelerator 200 may check the corresponding token status of the first factor register. If the token status indicates that the data of the first tensor has been stored in the first factor register, then issue a mathematical calculation instruction, otherwise block the emission queue until the data of the first tensor has been stored. in the first factor register.

Since each thread performing parallel mm calculations involves substantially the same matrix element data block of the second factor matrix B, each data block of the second factor matrix B is broadcast to all threads for parallel execution. In one embodiment, the calculation task of a piece of data can be completed in n steps. The calculation starts from the 0th column of the second factor matrix B and the product matrix C, and moves backward one column at a time until all columns are cycled. Each thread can specify an independent column address for the mm instruction, and the data retrieved from each column is broadcast to all threads for calculation.

In one embodiment, the data in a column of the second factor matrix B may come from L1 cache, L2 cache or off-chip memory. In this way, data in a column of the second factor matrix can be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.

At 804, the first set of threads broadcasts the second set of factors in the second tensor to the second set of threads based on the memory logical address for the second tensor, as described above.

At 806, the first thread in the second set of threads performs a dot product operation on the first set of factors and the second set of factors in the first row of the first tensor based on the first factor register representation to generate a first set of factors of the third tensor. Accumulate the first point in the row. Dot product budgeting can include multiplication and addition operations. The first factor register representation is for example R256 and the memory logical address is seg:RF:imm such as ur4:rf290:0x00. In some embodiments, the number of registers within each thread in the second thread set is variable, specifically controlled by the execution condition of the tensor multiplication instruction, which condition controls access to each column in the second tensor. If a certain If the column has not been accessed, the column does not participate in the matrix multiplication calculation, so the product register corresponding to the column does not exist.

It can be understood that in the embodiment of the present disclosure, the matrix multiplication is not completed at one time, but is based on the size of the register, the type of the matrix elements in the first factor matrix A, the second factor matrix B and the product matrix C, and the accelerator 200 The calculation capability of the computing unit and other factors are comprehensively considered and executed multiple times to complete the process. In other words, the first factor register set within a single thread includes at least part of the data in a single row of the first tensor, and the first factor register set includes one or more registers, the specific number of which can be supported by a single round of tensor multiply instructions. The length is determined by, for example, 2 registers, each register including one or more data elements; for example, for the int8 data type, 2 registers include 8 data elements. The number of threads involved in tensor multiplication is proportional to the number of rows of the first tensor. For example, the number of rows of the first tensor can be 256, and the number of threads participating in tensor multiplication can also be 256.

In one embodiment, the first tensor multiply instruction may further be, for example, @p1,mm8.sa.ub.R0,ur4:rf290:0x00,R256. It is the same or similar to @p1, mm.R0, ur4:rf290:0x00, R256 and will not be repeated here. Please refer to the relevant description above. mm8 indicates that the data type of the elements involved in matrix multiplication is 8 bits, sa indicates that the element data in the first factor matrix A associated with register R256 is signed int8, ub indicates that it is related to the logical address ur4:rf290:0x00 The element data in the connected second factor matrix B is unsigned uint8. It can be understood that the types of matrix elements in the first factor matrix A, the second factor matrix B and the product matrix C can also be other data types, and this disclosure is not limited thereto.

Since matrix multiplication involves multiple dot product operations of multiple matrix elements, in some embodiments multiple operations can be performed in segments, and the results of multiple dot product operations are accumulated to obtain the final mm result. In one embodiment, for example based on @p1,mm8.sa.sb.R0,ur4:rf290:0x00,R256, the first factor register representation R256 and the memory logical address ur4:rf290:0x00 can be determined. For the first thread in the second thread set, the first factor register representation R256 and the memory logical address ur4:rf290:0x00 may, for example, correspond to the first register in the first set of registers and the reference point of the tensor segment of matrix B data block. The first factor set is stored in the first register, such as A[1][1], and the data block of the reference point of the tensor segment of matrix B represents the data block of the reference point of the tensor segment of matrix B, such as B[ 1][1]. After matrix multiplication, the first point product set A[1][1]×B[1][1] in the first row of the third tensor of the product matrix C can be obtained. In another embodiment, the first set of factors may include A[1][1] and A[1][2], and the second set of factors may include B[1][1] and B[2][1] , so the first point product set can include A[1][1]×B[1][1]+A[1][2]×B[2][1]. In yet another embodiment, the first set of factors may be A[1][1], A[1][2], and A[1][3], and the second set of factors may include B[1][1] , B[2][1] and B[3][1], so the first point product set can include A[1][1]×B[1][1]+A[1][2]×B [2][1]+A[1][3]×B[3][1]. The present disclosure does not limit the range of the first factor set, the second factor set and the first point product set. The range can be flexibly adjusted by the programmer based on factors such as the data type of the matrix element and the register capacity when programming the matrix multiplication. Configuration, such as automatic configuration by setting the data type in the tensor multiply instruction.

Although an example of a single product element C[1][1] in the product matrix C is used to describe here, it can be understood that this is only illustrative and does not limit the scope of the present disclosure. In some embodiments, a single thread can perform parallel computations on multiple product elements in a row of the product matrix C. For example, the first thread in the second thread set can calculate the respective first point product sets A[1][1]×B[1][1] of C[1][1]-C[1][8] in parallel. , A[1][1]×B[1][2], A[1][1]×B[1][3], A[1][1]×B[1][4], A [1][1]×B[1][5], A[1][1]×B[1][6], A[1][1]×B[1][7] and A[1 ][1]×B[1][8]. In another embodiment, the first thread can also calculate the respective first point product sets A[1][1]×B[1][ of C[1][1]-C[1][8] in parallel. 1]+A[1][2]×B[2][1], A[1][1]×B[1][2]+A[1][2]×B[2][2] , A[1][1]×B[1][3]+A[1][2]×B[2][3], A[1][1]×B[1][4]+A [1][2]×B[2][4], A[1][1]×B[1][5]+A[1][2]×B[2][5], A[1 ][1]×B[1][6]+A[1][2] ×B[2][6], A[1][1]×B[1][7]+A[1][ 2]×B[2][7] and A[1][1]×B[1][8]+A[1][2]×B[2][8].

At 808, the first point product is accumulated by the first thread in the second set of threads into a first set of product registers corresponding to the first product register representation. For example, the first thread can accumulate the dot product result of the above calculation into the corresponding first set of product registers, such as R0-R7 registers. Similar to the above, the range of registers included in the first group of product registers can be flexibly configured by the mm instruction. By decomposing the matrix and allocating threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing efficiency of matrix multiplication. In addition, because programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.

In some embodiments, method 800 further includes, in response to receiving the second set of factors, a second thread in the second set of threads converting the third set of factors in the second row of the first tensor to the third set of factors based on the first factor register representation. performing a dot product operation on the two factor sets to generate a second dot product set in the second row of the third tensor; and accumulating the second dot product set by the second thread into the second set corresponding to the first product register representation in the product register. It can be understood that although the first thread and the second thread in the second thread set have the same first tensor multiplication instruction, for example, in one embodiment, the first tensor multiplication instruction may be represented as @p1,mm8.sa. sb.R0,ur4:rf290:0x00,R256, but since some other instructions such as load instructions can be used to load the first row of data of the first tensor to the first thread, and the second row of data to the second thread , so the first thread and the second thread can correctly perform dot product operations based on the loaded data of the first tensor.

Identical or similar to the first thread in the second thread set, the second thread in the second thread set also includes a first set of registers, such as R256-R257 for storing the third set of factors in the second row of the first factor matrix, And also includes a second set of registers, such as R0-R255, for storing the second dot product set of the second row of the third tensor. The first thread and the second thread actually perform parallel mm calculations respectively for the first and second rows of the first factor matrix A and respectively for the first and second rows of the first product matrix C, so by parallel Calculation can greatly save calculation time. In addition, since there is a fixed correspondence between each thread and each matrix row, it can also avoid multiple threads dynamically allocating matrix multiplication calculation tasks according to their busyness (for example, one thread can calculate two matrix rows, while another thread only The overhead caused by computing a portion of a matrix row).

In some situations, such as when the number of threads is much larger than the number of rows in the product matrix C, some threads may become idle. For example, when the PE unit includes 64 threads and the number of rows in the product matrix C is only 16, if each row is still assigned only one thread, 48 threads will be idle. In this case, multiple threads (eg, the first thread and the third thread in the second thread set) can be used by setting the first merge mode indication or the second merge mode indication in the tensor multiplication instruction. Calculation of a row in the product matrix C.

For example, in one embodiment, the first tensor multiply instruction also includes a first merge calculation mode indication, such as KA2. KA2 indicates that two threads participate in the calculation of a matrix row. In other embodiments, the first combined calculation mode indication may include other indications such as KA1, KA3, KA4, etc., and the only difference lies in the number following KA. KA1 means that a single thread participates in the calculation of one matrix row, KA3 means that three threads participate in the calculation of one matrix row, and so on. In some embodiments, in the absence of a first merge calculation mode indication, a single thread may default to performing the calculation of one matrix row. In the case where the first combined calculation mode is indicated as KA2, an illustrative example of the first tensor multiplication instruction received by the first thread and the third thread may be, for example, @p1,mm8.KA2.sa.sb.R0,ur4 :rf290:0x00,R256. It can be understood that KA1-KA4 only use one implementation method for representing the first combined calculation mode indication, and other characters or other representation methods may be used to express the first combined calculation mode indication.

It can be seen that by adding the first combined calculation mode indication KA2, the first thread and the third thread in the second thread set jointly calculate the product elements in the same row in the product matrix C. For example, the first thread is used to calculate the first set of product elements C[1][1]-C[1][127], while the third thread is used to calculate the second set of product elements C[1][128]-C [1][256], or the first thread is used to calculate the first set of product elements C[1][1], C[1][3], C[1][5]…C[1][255] , and the third thread is used to calculate the second set of product elements C[1][2], C[1][4], C[1][6]...C[1][256].

In this case, based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to generate the The first point product in the first row of the three tensors is accumulated, and the first point product is accumulated into the second set of registers in the first thread. The third thread performs a dot product operation on the first factor set and the fourth factor set of the second tensor based on the first merge calculation mode indication and the first factor register representation to generate the first factor set in the first row of the third tensor. In the three-point product set, the fourth factor set is different from the second factor set, and the third point product set is different from the first point product set. The third thread further accumulates the third point product into a third set of product registers corresponding to the first product register representation, the third set of product registers being located in the third thread. It can be understood that the first combined calculation mode indication calculation mode indication can be used in conjunction with the above embodiment with respect to FIG. 8 , so various aspects described with respect to FIG. 8 will not be described again here.

In another embodiment, the first tensor multiply instruction also includes a second merge calculation mode indication, such as KB2. KB2 indicates that two threads jointly participate in the calculation of each product element in the product matrix. In other embodiments, the second combined calculation mode indication may include other indications such as KB1, KB3, KB4, etc., and the only difference lies in the number following KB. KB1 indicates that a single thread participates in the calculation of each product element in the product matrix, KB3 indicates that three threads jointly participate in the calculation of each product element in the product matrix, and so on. In some embodiments, in the absence of a second merge calculation mode indication, a single thread may default to performing the calculation of one matrix row. In the case where the second merge calculation mode is indicated as KB2, an illustrative example of the first tensor multiplication instruction received by the first thread and the fourth thread in the second thread set may be, for example, @p1,mm8.KB2.sa .sb.R0,ur4:rf290:0x00,R256. It can be understood that KB1-KB4 only use one implementation method for expressing the second combined calculation mode indication, and other characters or other expression methods may be used to express the second combined calculation mode indication.

It can be seen that by adding the second combined calculation mode indication KB2, the first thread and the fourth thread in the second thread set jointly participate in the calculation of each product element in the product matrix. Specifically, for example, for the dot product A[1][1]×B[1][1]+A[1][2]×B[2][1], the first thread can calculate A[1][ 1]×B[1][1], the fourth thread can compute A[1][2]×B[2][1] in parallel with the first thread, and the first and fourth threads then add. For example, the fourth thread sends the product of to the first thread, and the first thread performs an addition operation to get the dot product result. The first thread accumulates the dot product result into the product register. For A[1][1]×B[1][2]+A[1][2]×B[2][2], A[1][1]×B[1][3]+A [1][2]×B[2][3], A[1][1]×B[1][4]+A[1][2]×B[2][4], A[1 ][1]×B[1][5]+A[1][2]×B[2][5], A[1][1]×B[1][6]+A[1][ 2]×B[2][6], A[1][1]×B[1][7]+A[1][2]×B[2][7] and A[1][1] ×B[1][8]+A[1][2]×B[2][8], the first thread and the fourth thread can operate similarly to obtain the first point product set. In another embodiment, the first thread may default to sending the product to the fourth thread, and the fourth thread performs the addition of the products and accumulates the dot product result into the fourth thread's product register.

For another example, for the dot product A[1][1]×B[1][1]+A[1][2]×B[2][1]+A[1][3]×B[3] [1]+A[1][4]×B[4][1], the first thread can calculate A[1][1]×B[1][1]+A[1][2]×B [2][1], the fourth thread in the second thread set can calculate A[1][3]×B[3][1]+A[1][4]×B[4 in parallel with the first thread ][1], the first thread then performs addition processing. For example, the fourth thread sends the dot product to the first thread, and the first thread performs an addition operation to get the dot product result. The first thread then accumulates the dot product result into the product register. In another embodiment, the first thread may default to sending the dot product to the fourth thread, and the fourth thread performs the addition of the dot product and accumulates the dot product result into the product register of the fourth thread.

In this case, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor based on the second merge calculation mode indication and the first factor register representation, to generate the first point product in the first row of the third tensor and accumulate the first point product into the second set of registers of the first thread. Based on the second combined calculation mode indication and the first factor register representation, the fourth thread performs a dot product operation on the fifth factor set in the first row and the sixth factor set of the second tensor to generate a third The fourth point product set in the first row of the tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth point product set is different from the first point product set. The first thread further accumulates the fourth point product set to a third set of product registers corresponding to the first product register representation, the third set of product registers being located in the first thread. It can be understood that the second combined calculation mode indication calculation mode indication can be used in conjunction with the above embodiment with respect to FIG. 8 , so various aspects described with respect to FIG. 8 will not be described again here.

In addition, in some situations, for example, when the number of threads is much larger than the number of rows of the product matrix, the first merge calculation instruction can be used in combination with the second merge calculation instruction. That is, not only can each row of the product matrix be divided into different parts and calculated by different thread groups, but each dot product element within each row can also be calculated by a different thread. For example, for C[1][1]-C[1][8], C[1][1]-C[1][4] can be calculated by the first group of threads, while C[1][5 ]-C[1][8] can be calculated by the second set of threads. Further, for each dot product element, for example, C[1][1]=A[1][1]×B[1][1]+A[1][2]×B[2][1] +A[1][3]×B[3][1]+A[1][4]×B[4][1]+A[1][5]×B[5][1]+A [1][6]×B[6][1]+A[1][7]×B[7][1]+A[1][8]×B[8][1], the first The first thread in the group of threads calculates A[1][1]×B[1][1]+A[1][2]×B[2][1]+A[1][3]×B[ 3][1]+A[1][4]×B[4][1], while the second thread in the first set of threads computes A[1][5]×B[5][1]+A [1][6]×B[6][1]+A[1][7]×B[7][1]+A[1][8]×B[8][1], and so on .

In the calculation process of matrix multiplication, the second factor matrix is usually column-wise and the dot product operation is performed with the row elements of the first factor matrix. However, in some cases, the second factor matrix stored in a memory such as a DDR is typically physically stored row by row. Therefore, when a thread reads an element of the second factor matrix from the memory, such as B[1][1], based on the principle of spatial proximity, it usually also reads some elements that are physically close to the element to the L1 high speed at one time. In the cache, for example, B[1][2], B[1][3], B[1][4] and B[1][1] are read together into the L1 cache. However, during matrix multiplication, a thread may actually need elements of the same column, such as B[1][1] and B[2][1]. At this time, it takes several clock cycles to read B[2][1] from the memory as well as B[2][2], B[2][3] and B[ that are not needed in this calculation process. 2][4] into the L1 cache. In the normal case, B[1][2], B[1][3], B[1][4], B[2][2], B[2][3] and B[2][ 4] Typically discarded due to the dynamic flushing rules of the L1 cache. In the subsequent matrix calculation process, when B[1][2], B[1][3], B[1][4], B[2][2], B[2][3] or B [2][4], the thread re-reads the corresponding data from the memory into the L1 cache. It can be seen that such repeated reads greatly waste the time of transferring data from the memory to the L1 cache.

In some embodiments of the present disclosure, for a situation where, for example, the matrix elements of the second factor matrix B are stored in rows, a transpose indication is further set in the tensor multiplication instruction. In one embodiment, the first tensor multiply instruction also includes a transpose instruction. A further illustrative example of a first tensor multiplication instruction is @p1,mm8.KA1.T1.sa.sb R0,ur4:rf290:0x00,R256, where T1 indicates that the second factor matrix B needs to be transposed. In other embodiments, when the tensor multiplication instruction does not include a transpose indication, it may be assumed that the second factor matrix B does not need to be transposed. In yet other embodiments, T0 may be used in tensor multiplication instructions to indicate that the second factor matrix B does not need to be transposed.

The first thread in the second thread set can therefore dot product the first set of factors in the first row of the first tensor and the second set of factors in the second tensor based on the transpose indication and the first factor register representation to Generate the first point product in the first row of the third tensor. Specifically, the first set of threads loads the factors of multiple rows in the second tensor into the cache based on the transpose indication and the memory logical address. For example, the first thread set can divide B[1][1]-B[1][4], B[2][1]-B[2][4], B[3][1]-B[ 3][4] and B[4][1]-B[4][4] are both loaded into the L1 cache. The first set of threads then selects factors from multiple rows of factors by column, such as selecting B[1][1], B[2][1], B[3][1], and B[4][1] , to form a second set of factors and broadcast it to the second set of threads. The second set of threads then performs a dot product operation on the first set of factors and the second set of factors in the first row based on the first factor register representation to generate a first set of dot products in the first row of the third tensor. Note that since there is a transposition instruction T1 at this time, B[1][2]-B[1][4], B[2][2]-B[2][4], B[3][2 ]-B[3][4] and B[4][2]-B[4][4] are directly retained in the cache without being dynamically refreshed. In this way, the first thread in the second thread set does not need to read B[1][2]-B[1][4], B[2][2]-B[ from the memory again when performing subsequent matrix calculations. 2][4], B[3][2]-B[3][4] and B[4][2]-B[4][4], which can greatly save time.

Although here B[1][1]-B[1][4], B[2][1]-B[2][4], B[3][1]-B[3][4 ] and B[4][1]-B[4][4] are taken as examples to illustrate the transposition instruction, but it can be understood that this is only an illustration. The range of the second factor matrix B that can be used for transposition can vary. For example, when the number of rows of the second factor matrix B is another number of rows, such as 256 rows, the cache lines of all rows can be loaded into the high speed. Cache, and release from the cache until the data in the cache line has been used for matrix multiplication calculations. In this way, the time required to repeatedly read data from memory to the L1 cache can be greatly saved.

In the above, the principles and examples of matrix multiplication according to embodiments of the present disclosure are mainly described in the form of two-dimensional tensors. However, it can be understood that the present disclosure is not limited to matrix multiplication calculations in the form of two-dimensional tensors, but may include calculations of multiplications or convolutions of one-dimensional tensors or more-dimensional tensors. For a one-dimensional tensor, it is equivalent to one dimension of the two-dimensional tensor being 1, so I won’t go into details here.

For three-dimensional or higher-dimensional matrix calculations, the dimensions other than k dimensions in the first factor matrix A and the second factor matrix B can be reduced and decomposed to obtain an equivalent two-dimensional matrix. The k dimension is usually It is not decomposed because in order to perform matrix multiplication, the number of k columns in the first factor matrix A and the number of k rows in the second factor matrix B need to be equal.

In one embodiment, assume that the first factor tensor A is a three-dimensional tensor of m×x×k, and the second factor tensor B is a four-dimensional tensor of k×n×y×z, where k, m, n , x, y and z all represent positive integers. The first factor tensor A can be converted into a two-dimensional tensor of the form (m×x,k). That is, cut on the x dimension, and concatenate the cut x m×k two-dimensional tensors row by row to obtain the two-dimensional equivalent matrix A'. In this case, m×x threads can be used for parallel computation. In addition, similarly, the second factor tensor can be cut into y×z k×n two-dimensional matrices and spliced column-by-column in sequence to obtain a two-dimensional equivalent matrix B’. It can be understood that although the multiplication (convolution) of a three-dimensional tensor and a four-dimensional tensor is used as an example to illustrate matrix dimensionality reduction, this is only illustrative and does not limit the scope of the present disclosure. Dimensionality reduction of other multi-dimensional mm can be handled similarly and will not be repeated here. For mm after dimensionality reduction, please refer to the detailed description of mm mentioned above in Figure 8, which will not be described again here.

Figure 9 shows a schematic block diagram of an electronic device 900 according to one embodiment of the present disclosure. The electronic device 900 may be used to perform the method 800 shown in FIG. 8 , and therefore various aspects described with respect to FIG. 8 may be selectively applicable to the electronic device 900 . The electronic device 900 includes a receiving unit 902, a broadcast unit 903, a generating unit 904, and a storage unit 906.

The receiving unit 902 is configured to receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor register for the first tensor representation, the memory logical address for the second tensor, and the first product register representation for the third tensor. Broadcast unit 903 is configured to broadcast, by the first set of threads, a second set of factors in the second tensor to a second set of threads, the second set of threads being different from the first set of threads based on the memory logical address for the second tensor. The generation unit 904 is configured to perform a dot product operation on the first factor set and the second factor set in the first row of the first tensor based on the first factor register representation by the first thread in the second thread set to generate a third tensor. The first point in the first row of the quantity accumulates. Storage unit 906 is configured to accumulate the first point product by the first thread into the first set of product registers corresponding to the first product register representation. By decomposing the matrix and allocating threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing efficiency of matrix multiplication. In addition, because programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.

In one embodiment, each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a portion of the data in a row of the first factor matrix, and the second set of registers is used to store the data in a row of the product matrix. a row of data. The data in a column of the second factor matrix may come from on-chip memory, L1 cache, or off-chip memory. In this way, during the execution of matrix multiplication, the execution unit of the first thread can read the data in one row of the first factor matrix from the first set of registers only once, and reuse it during subsequent dot product operations. Furthermore, data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.

In one embodiment, the generation unit 904 is further configured to, in response to receiving the second set of factors, a second thread in the second set of threads convert the third factor in the second row of the first tensor based on the first factor register representation. The dot product operation is performed on the set and the second set of factors to produce the second dot product set in the second row of the third tensor. Storage unit 908 is further configured to accumulate, by the second thread, the second dot product set into a second set of product registers corresponding to the first product register representation.

In one embodiment, the first tensor multiply instruction further includes a first merge calculation mode indication. The generation unit 904 is further configured to: based on the first combined calculation mode indication and the first factor register representation, perform a dot product operation on the first factor set and the second factor set in the first row by the first thread to generate a third The first point product in the first row of the tensor.

In one embodiment, the generation unit 904 is further configured to combine the first set of factors and the fourth factor of the second tensor by a third thread in the first set of threads based on the first combined calculation mode indication and the first factor register representation. The sets are subjected to a dot product operation to produce a third set of point products in the first row of the third tensor, the fourth set of factors is different from the set of second factors, and the third set of point products is different from the set of first point products. Storage unit 906 is further configured to accumulate, by the third thread, a third point product into a third set of product registers corresponding to the first product register representation.

In one embodiment, the first tensor multiply instruction further includes a second merge calculation mode indication. The generation unit 904 is further configured to perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread based on the second combined calculation mode indication and the first factor register representation, to Generate the first point product in the first row of the third tensor.

In one embodiment, the generation unit 904 is further configured to combine the fifth set of factors and the sixth factor of the second tensor by the fourth thread in the second set of threads based on the second combined calculation mode indication and the first factor register representation. Sets are subjected to a dot product operation to produce a fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth dot product The set is different from the first point product set. Storage unit 906 is further configured to accumulate, by the fourth thread, a fourth point product to the first set of product registers corresponding to the first product register representation.

In one embodiment, the first tensor multiply instruction also includes a transpose instruction. The generation unit 904 is further configured to: based on the transposition indication and the first factor register representation, perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread to generate the The first point product in the first row of three tensors.

In one embodiment, the generation unit 904 is further configured to: load the factors of the plurality of rows in the second tensor to the cache based on the transposition indication and the memory logical address; select the factors from the factors of the plurality of rows by columns to forming a second factor set; and based on the first factor register representation, performing a dot product operation on the first factor set and the second factor set in the first row by the first thread to generate the third tensor in the first row. A little accumulation. In one embodiment, unselected factors in multiple rows are retained in the L1 cache until the unselected factors are selected for calculation of matrix multiplication.

In one embodiment, the first thread set provides the second factor set corresponding to the memory logical address in a broadcast form in parallel to all threads in the second thread set.

In one embodiment, the memory logical address includes segment reference data and offset data, the segment reference data represents the starting address of the second tensor, and the offset data represents the offset of the second tensor in each of the multiple dimensions. Shift amount.

Furthermore, although operations are depicted in a specific order, this should be understood to require that such operations be performed in the specific order shown or in sequential order, or that all illustrated operations should be performed to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Claims

A method performed by an accelerator that includes:

receiving a first tensor multiply instruction for a first thread set of the accelerator, the first tensor multiply instruction including a first thread indication for the first thread set, a first factor register representation for a first tensor, a memory logical address for the second tensor, and a first product register representation for the third tensor;

The first set of threads broadcasts a second set of factors in the second tensor to a second set of threads, the second set of threads being different from the first set of threads based on the memory logical address for the second tensor ;

The first thread in the second thread set performs a dot product operation on the first factor set in the first row of the first tensor and the second factor set based on the first factor register representation to generate the the first point product in the first row of the third tensor; and

The first point product is accumulated by the first thread into a first set of product registers corresponding to a first product register representation.
The method of claim 1, further comprising:

In response to receiving the second set of factors, a second thread in the second set of threads combines a third set of factors in a second row of the first tensor with the third set of factors based on the first factor register representation. perform a dot product operation on the set of two factors to generate a second set of dot products in the second row of the third tensor; and

The second set of point products is accumulated by the second thread into a second set of product registers corresponding to the first product register representation.

Provide execution conditions for each thread in the first thread set. For threads that do not meet the execution conditions, their memory access operations are regarded as exceeding the tensor address range and are ignored.
The method of claim 1, wherein the first tensor multiplication instruction further includes a first merge calculation mode indication;

The first point product in the first row that generates the third tensor consists of:

Based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate The first point product in the first row of the third tensor.
The method of claim 3, further comprising:

Based on the first merge calculation mode indication and the first factor register representation, the first factor set and a fourth factor set of the second tensor are merged by a third thread in the first thread set product operation to generate a third point product set in the first row of the third tensor, the fourth factor set is different from the second factor set, and the third point product set is different from the One point accumulation; and

The third set of point products is accumulated by the third thread into a third set of product registers corresponding to the first product register representation.
The method of claim 1, wherein the first tensor multiply instruction further includes a transpose instruction;

The first point product in the first row that generates the third tensor consists of:

Based on the transpose indication and the first factor register representation, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to obtain Generate the first point product in the first row of the third tensor.
The method of claim 5, wherein based on the transpose indication and the first factor register representation, the first set of factors in the first row and the first set of factors in the second tensor are combined by the first thread. The second set of factors are dot producted to generate the first dot product set in the first row of the third tensor including:

Loading factors of a plurality of rows in the second tensor into a cache based on the transpose indication and the memory logical address;

selecting factors from the factors of the plurality of rows by column to form the second set of factors; and

Based on the first factor register representation, the first thread performs a dot product operation on the first set of factors in the first row and the second set of factors to generate the first row of the third tensor The first point product in .
The method according to any one of claims 1-6, wherein the first thread set provides the second factor set corresponding to the memory logical address to the second thread set in parallel in the form of broadcast Computational units in all threads without providing registers in all threads.
The method of claim 7, wherein the memory logical address includes segment reference data representing a starting address of the second tensor and offset data representing the first tensor. The offset of the two tensors in each of multiple dimensions.
The method of claim 1 or 3, wherein the first tensor multiplication instruction further includes a second merge calculation mode indication;

The first point product in the first row that generates the third tensor consists of:

Based on the second merge calculation mode indication and the first factor register representation, the first thread performs a dot product of the first set of factors in the first row and the second set of factors in the second tensor. Operation to generate the first point product in the first row of the third tensor.
The method of claim 9, further comprising:

Based on the second merge calculation mode indication and the first factor register representation, the fifth factor set and the sixth factor set of the second tensor are combined by a fourth thread in the second thread set. product operation to generate a fourth point product set in the first row of the third tensor, the fifth factor set is different from the first factor set, and the sixth factor set is different from the a second set of factors, the fourth point product set being different from the first point product set; and

The fourth set of point products is accumulated by the fourth thread to the first set of product registers corresponding to the first product register representation.
The method of claim 1, wherein

The first product register represents one or more product registers. The number of the one or more product registers is related to the combined calculation mode and the number of columns of the second tensor. The product registers of different threads constitute the result tensor. quantities, each thread's product register contains part or all of each row of the result tensor; and

The number of rows of the result tensor is the same as the number of rows of the first tensor, and the number of columns of the result tensor is the same as the number of columns of the second tensor.
The method of claim 11, wherein

The number of product registers within the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the number of multiplication instructions in the second tensor. column access; and

If the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.
The method of claim 1, wherein

The first tensor multiplication instruction is issued multiple times during a complete execution process, wherein the first tensor multiplication instruction is issued for the first time as a storage instruction to obtain the second tensor. Column data or row data; and

In response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, a first tensor multiplication instruction is issued in the form of a mathematical calculation instruction. Two or more emissions are used to perform the calculation of the results of each column within the row of the third tensor.
The method of claim 13, wherein

Before performing the second or multiple transmissions, check the corresponding token status of the first factor register;

If the token status indicates that the data of the first tensor has been stored in the first factor register, then a mathematical calculation instruction is issued, otherwise the emission queue is blocked until the data of the first tensor has been stored. is stored in the first factor register.
The method of claim 11, further comprising:

determining whether the range of product register usage for the third tensor exceeds the range of a register file within a single thread based on the first product register representation; and

If it is determined that the usage range of the product register for the third tensor exceeds the range of the register file within a single thread, the calculation operation or memory access operation that exceeds the range of the register file is ignored and an error is reported.
An electronic device including:

stream processor;

A page table device coupled to the stream processor;

memory;

A processing engine unit, coupled to the stream processor, the memory and the page table device, is configured to perform the method of any one of claims 1-15.
An accelerator that includes:

a receiving unit configured to receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first tensor multiplication instruction for a first tensor a first factor register representation, a memory logical address for the second tensor, and a first product register representation for the third tensor;

a broadcast unit configured to broadcast, by the first set of threads, a second set of factors in the second tensor to a second set of threads based on a memory logical address for the second tensor, the second set of threads being different on the first thread set;

a generation unit configured to dot, by a first thread in the second set of threads, a first set of factors in a first row in the first tensor and the second set of factors based on the first factor register representation. a product operation to generate the first set of point products in the first row of the third tensor; and

A storage unit configured to accumulate, by the first thread, the first point product into a first set of product registers corresponding to a first product register representation.
The accelerator of claim 17, wherein

The generation unit is further configured to: in response to receiving the second set of factors, a second thread in the second set of threads generate a second row of the first tensor based on the first factor register representation. perform a dot product operation on a third set of factors and the second set of factors to generate a second set of dot products in the second row of the third tensor; and

The storage unit is further configured to accumulate, by the second thread, the second set of point products into a second set of product registers corresponding to the first product register representation.
The accelerator of claim 18, wherein the first tensor multiply instruction further includes a first merge calculation mode indication;

The generation unit is further configured to:

Based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate The first point product in the first row of the third tensor.
The accelerator of claim 19, wherein

The generating unit is further configured to: based on the first combined calculation mode indication and the first factor register representation, a third thread in the first thread set combines the first set of factors and the second set of factors. A fourth set of factors of the tensor is subjected to a dot product operation to generate a third set of dot products in the first row of the third tensor, the fourth set of factors being different from the second set of factors, the The three-point product set is different from the first point product set; and

The storage unit is further configured to accumulate, by the third thread, the third point product set into a third set of product registers corresponding to the first product register representation.
The accelerator of claim 17, wherein the first tensor multiply instruction further includes a transpose instruction;

The generation unit is further configured to:

Based on the transpose indication and the first factor register representation, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to obtain Generate the first point product in the first row of the third tensor.
The accelerator of claim 21, wherein the generation unit is further configured to:

Loading factors of a plurality of rows in the second tensor into a cache based on the transpose indication and the memory logical address;

selecting factors from the factors of the plurality of rows by column to form the second set of factors; and

Based on the first factor register representation, the first thread performs a dot product operation on the first set of factors in the first row and the second set of factors to generate the first row of the third tensor The first point product in .
The accelerator of claim 17 or 19, wherein the first tensor multiplication instruction further includes a second merge calculation mode indication;

The generation unit is further configured to:

Based on the second merge calculation mode indication and the first factor register representation, the first thread performs a dot product of the first set of factors in the first row and the second set of factors in the second tensor. Operation to generate the first point product in the first row of the third tensor.
The accelerator of claim 23, wherein the generation unit is further configured to:

Based on the second merge calculation mode indication and the first factor register representation, the fifth factor set and the sixth factor set of the second tensor are combined by a fourth thread in the second thread set. product operation to generate a fourth point product set in the first row of the third tensor, the fifth factor set is different from the first factor set, and the sixth factor set is different from the a second set of factors, the fourth point product set being different from the first point product set; and

The fourth set of point products is accumulated by the fourth thread to the first set of product registers corresponding to the first product register representation.
The accelerator of claim 17, wherein

The first product register represents one or more product registers. The number of the one or more product registers is related to the combined calculation mode and the number of columns of the second tensor. The product registers of different threads constitute the result tensor. quantities, each thread's product register contains part or all of each row of the result tensor; and

The number of rows of the result tensor is the same as the number of rows of the first tensor, and the number of columns of the result tensor is the same as the number of columns of the second tensor.
The accelerator of claim 25, wherein

The number of product registers within the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the number of multiplication instructions in the second tensor. column access; and

If the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.
The accelerator of claim 17, wherein

The first tensor multiplication instruction will be issued multiple times during a complete execution process, wherein the first tensor multiplication instruction is issued for the first time as a storage instruction to obtain the second tensor. column data or row data; and

In response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction uses a mathematical calculation instruction The method is emitted two or more times to perform the calculation of the results of each column within the row of the third tensor.
The accelerator according to claim 27, further comprising a checking unit configured to check the corresponding token status of the first factor register before performing two or more transmissions;

If the token status indicates that the data of the first tensor has been stored in the first factor register, then a mathematical calculation instruction is issued, otherwise the emission queue is blocked until the data of the first tensor has been stored. is stored in the first factor register.
The accelerator of claim 25, further comprising an out-of-bounds checking unit configured to determine, based on the first product register representation, whether the product register usage range for the third tensor exceeds the range of a register file within a single thread. ;as well as

If it is determined that the usage range of the product register for the third tensor exceeds the range of the register file within a single thread, the calculation operation or memory access operation that exceeds the range of the register file is ignored and an error is reported.
The accelerator according to any one of claims 17-22, wherein the first set of threads provides the second set of factors corresponding to the memory logical address to the second set of threads in parallel in the form of a broadcast Computational units in all threads without providing registers in all threads.