WO2023173639A1 - Method executed by accelerator, and electronic device - Google Patents

Method executed by accelerator, and electronic device Download PDF

Info

Publication number
WO2023173639A1
WO2023173639A1 PCT/CN2022/107061 CN2022107061W WO2023173639A1 WO 2023173639 A1 WO2023173639 A1 WO 2023173639A1 CN 2022107061 W CN2022107061 W CN 2022107061W WO 2023173639 A1 WO2023173639 A1 WO 2023173639A1
Authority
WO
WIPO (PCT)
Prior art keywords
tensor
product
thread
factor
row
Prior art date
Application number
PCT/CN2022/107061
Other languages
French (fr)
Chinese (zh)
Inventor
杨经纬
葛建明
李甲
桑永奇
谢钢锋
姚飞
仇小钢
Original Assignee
海飞科(南京)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海飞科(南京)信息技术有限公司 filed Critical 海飞科(南京)信息技术有限公司
Publication of WO2023173639A1 publication Critical patent/WO2023173639A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F7/00Methods or arrangements for processing data by operating upon the order or content of the data handled
    • G06F7/38Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
    • G06F7/48Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
    • G06F7/52Multiplying; Dividing
    • G06F7/523Multiplying only
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present disclosure relate generally to the field of electronics, and more specifically to a method performed by an accelerator and an accelerator.
  • Parallel high-performance multi-threaded multi-core processing systems such as graphics processing units (GPUs) process data much faster than in the past. These processing systems can break down complex calculations into smaller tasks and process them in parallel across multiple cores to increase processing efficiency and reduce processing time.
  • GPUs graphics processing units
  • Tensor data usually represents one-dimensional or multi-dimensional array data in the computer field.
  • image data is a conventional two-dimensional tensor data, which can be represented by a two-dimensional array.
  • a color image is a three-dimensional array data.
  • the color image also includes red, green, and blue (RGB) channel dimensions.
  • Processing tensors such as two-dimensional arrays can include matrix multiplication, for example.
  • Embodiments of the present disclosure provide a method and electronic apparatus for execution by an accelerator.
  • a method performed by an accelerator includes receiving a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor register representation for the first tensor, a memory logical address for the second tensor, and a first product register representation for the third tensor; the first set of threads broadcasts a second set of factors in the second tensor to the second set based on the memory logical address for the second tensor.
  • Thread set the second thread set is different from the first thread set; the first thread in the second thread set performs a dot product of the first factor set and the second factor set in the first row of the first tensor based on the first factor register representation operations to generate a first set of point products in the first row of the third tensor; and accumulating, by the first thread, the first set of point products into a first set of product registers corresponding to the first product register representation.
  • the first factor set includes at least part of the factor data in the first row of the first tensor.
  • the second factor set includes factor data for at least a portion of the second tensor.
  • the first point product set includes at least a portion of the product data in the first row of the third tensor.
  • each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a part of the data in a row of the first factor matrix, and the second set of registers is used to store A row of data in the product matrix.
  • Data in a column of the second factor matrix may be stored in on-chip memory, L1 cache, or off-chip memory.
  • the execution unit of the first thread can only read the data in one row of the first factor matrix from the first set of registers once, and perform the subsequent dot product operation on each column of the second factor matrix. reused.
  • data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, the transmission of data between different storage devices can be reduced, thereby reducing the time delay caused by data transmission during matrix multiplication calculations.
  • the method further includes, in response to receiving the second set of factors, a second thread in the second set of threads converting the third factor in the second row of the first tensor based on the first factor register representation. performing a dot product operation on the set and the second set of factors to generate a second dot product set in the second row of the third tensor; and accumulating, by the second thread, the second dot product set corresponding to the first product register representation in the second set of product registers.
  • the first tensor multiplication instruction further includes a first merge calculation mode indication.
  • Generating the first set of point products in the first row of the third tensor includes: combining, by the first thread, the first set of factors and the second factor in the first row based on the first merge calculation mode indication and the first factor register representation. The dot product operation is performed on the sets to produce the first set of dot products in the first row of the third tensor.
  • the method further includes: based on the first combined calculation mode indication and the first factor register representation, using a third thread in the first thread set to combine the first factor set and the fourth part of the second tensor. perform a dot product operation on the set of factors to generate a third set of dot products in the first row of the third tensor, the fourth set of factors being different from the set of second factors, and the third set of dot products being different from the set of first dot products; and The third point product is accumulated by the third thread into a third set of product registers corresponding to the first product register representation.
  • the first tensor multiplication instruction further includes a second merge calculation mode indication.
  • Generating the first point product set in the first row of the third tensor includes: based on the second merge calculation mode indication and the first factor register representation, the first thread combines the first factor set in the first row with the second tensor The second set of factors is dot producted to produce the first set of dot products in the first row of the third tensor.
  • the method further includes: based on the second combined calculation mode indication and the first factor register representation, the fourth thread in the second thread set combines the fifth factor set and the sixth factor set of the second tensor.
  • the factor sets are dot producted to produce the fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, the fourth point The product set is different from the first dot product set; and the fourth dot product set is accumulated by the fourth thread to the first set of product registers corresponding to the first product register representation.
  • the first tensor multiplication instruction also includes a transpose instruction.
  • Generating the first dot product set in the first row of the third tensor includes, based on the transpose indication and the first factor register representation, combining the first set of factors in the first row and the second set of factors in the second tensor by the first thread.
  • the set of factors are dot producted to produce the first set of dot products in the first row of the third tensor.
  • the first thread based on the transposition indication and the first factor register representation, performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to generate the
  • the first dot product set in the first row of three tensors consists of: loading the factors of multiple rows in the second tensor into cache based on the transpose directive and the memory logical address; selecting from the factors of multiple rows by column factors to form a second factor set; and based on the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate the first row of the third tensor The first point accumulation of .
  • multiple unselected factors in multiple rows are retained in the cache until the multiple unselected factors are selected for calculation of matrix multiplication.
  • the first thread set provides the second factor set corresponding to the memory logical address in parallel to the computing units in all threads in the second thread set in the form of broadcast, but not to all threads. register.
  • the memory logical address includes segment reference data and offset data.
  • the segment reference data represents the starting address in the second tensor
  • the offset data represents each dimension in the multiple dimensions of the second tensor. offset on.
  • the first product register represents one or more product registers
  • the number of one or more product registers is related to the combined calculation mode and the number of columns of the second tensor, and the product registers of different threads Constituting a result tensor
  • each thread's product register contains part or all of each row of the result tensor; and the result tensor has the same number of rows as the first tensor, and the result tensor has the same number of columns as the second tensor The number of columns is the same.
  • the number of product registers in the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the value of the multiplication instruction in the second tensor.
  • the first tensor multiplication instruction is issued multiple times, where the first tensor multiplication instruction is issued for the first time in the form of a storage instruction to obtain column data or rows in the second tensor. data; and in response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction is quadratic or in the form of a mathematical calculation instruction. Emit multiple times to perform the calculation of the results of each column within the row of the third tensor.
  • the corresponding token status of the first factor register is checked; if the token status indicates that the data of the first tensor has been stored in the first factor register , it is issued in the form of mathematical calculation instructions, otherwise the emission queue is blocked until the data of the first tensor has been stored in the first factor register.
  • the product register usage range for the third tensor exceeds the range of the register file within a single thread; and if the product register usage range for the third tensor is determined If the scope exceeds the scope of the register file within a single thread, calculation operations or memory access operations that exceed the scope of the register file are ignored and an error is reported.
  • an electronic device includes: a stream processor; a page table device coupled to the stream processor; a memory; a processing engine unit coupled to the stream processor, the memory and the page table device configured to perform the method according to the first aspect.
  • an electronic device includes: a receiving unit configured to receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, for the first tensor a first factor register representation of, a memory logical address for the second tensor, and a first product register representation for the third tensor; a broadcast unit configured to be configured by the first set of threads based on the memory logic address for the second tensor The address broadcasts a second set of factors in the second tensor to a second set of threads, the second set of threads being different from the first set of threads; the generation unit is configured to represent, by the first thread in the second thread set, the first factor register based on the first set of threads.
  • a thread accumulates the first point product into a first set of product registers corresponding to the first product register representation.
  • each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a part of the data in a row of the first factor matrix, and the second set of registers is used to store A row of data in the product matrix.
  • the data in a column of the second factor matrix may come from on-chip memory, L1 cache, or off-chip memory.
  • data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.
  • the generation unit is further configured to, in response to receiving the second set of factors, a second thread in the second set of threads convert the first element in the second row of the first tensor based on the first factor register representation.
  • the three-factor set and the second factor set are dot producted to produce the second dot product set in the second row of the third tensor.
  • Storage unit 908 is further configured to accumulate, by the second thread, the second dot product set into a second set of product registers corresponding to the first product register representation.
  • the first tensor multiplication instruction further includes a first merge calculation mode indication.
  • the generation unit is further configured to: based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate a third The first point in the first row of the quantity accumulates.
  • the generation unit is further configured to use a third thread in the first thread set to combine the first factor set and the second tensor of the second tensor based on the first combined calculation mode indication and the first factor register representation.
  • the four-factor set performs a dot product operation to produce a third dot product set in the first row of the third tensor, the fourth factor set is different from the second factor set, and the third dot product set is different from the first dot product set.
  • the storage unit is further configured to accumulate, by the third thread, a third point product into a third set of product registers corresponding to the first product register representation.
  • the first tensor multiplication instruction also includes a second merge calculation mode indication.
  • the generation unit is further configured to perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor based on the second merge calculation mode indication and the first factor register representation by the first thread to generate The first point product in the first row of the third tensor.
  • the generation unit is further configured to use the fourth thread in the second thread set to combine the fifth factor set and the second tensor of the second tensor based on the second combined calculation mode indication and the first factor register representation.
  • the dot product operation is performed on the set of six factors to produce the fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth The point product set is different from the first point product set.
  • the storage unit is further configured to accumulate, by the fourth thread, a fourth point product set to the first set of product registers corresponding to the first product register representation.
  • the first tensor multiplication instruction also includes a transpose instruction.
  • the generation unit is further configured to: based on the transpose indication and the first factor register representation, perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread to generate a third The first point product in the first row of the tensor.
  • the generation unit is further configured to: load the factors of the plurality of rows in the second tensor into the cache based on the transposition indication and the memory logical address; select from the factors of the plurality of rows by column factors to form a second factor set; and based on the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate the first row of the third tensor The first point accumulation of .
  • multiple unselected factors in multiple rows are retained in the first-level cache until the multiple unselected factors are selected for calculation of matrix multiplication.
  • the first thread set provides the second factor set corresponding to the memory logical address to all threads in the second thread set in parallel in the form of broadcast.
  • the memory logical address includes segment reference data and offset data.
  • the segment reference data represents the starting address of the second tensor
  • the offset data represents each dimension of the second tensor in multiple dimensions. offset on.
  • the first product register represents one or more product registers
  • the number of one or more product registers is related to the combined calculation mode and the number of columns of the second tensor, and the product registers of different threads Constituting a result tensor
  • each thread's product register contains part or all of each row of the result tensor; and the result tensor has the same number of rows as the first tensor, and the result tensor has the same number of columns as the second tensor The number of columns is the same.
  • the number of product registers in the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the value of the multiplication instruction in the second tensor.
  • the first tensor multiplication instruction is issued multiple times, where the first tensor multiplication instruction is issued for the first time in the form of a storage instruction to obtain column data or rows in the second tensor. data; and in response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction is quadratic or in the form of a mathematical calculation instruction. Emit multiple times to perform the calculation of the results of each column within the row of the third tensor.
  • the accelerator further includes a checking unit configured to check the corresponding token status of the first factor register before performing the second or multiple transmissions; if the token status represents the first If the amount of data has been stored in the first factor register, it is issued in the form of a mathematical calculation instruction, otherwise the emission queue is blocked until the first amount of data has been stored in the first factor register.
  • the accelerator further includes an out-of-bounds checking unit.
  • the out-of-bounds checking unit is configured to determine, based on the first product register representation, whether the product register usage range for the third tensor exceeds the range of a register file within a single thread; and if it is determined that the product register usage range for the third tensor exceeds the range of a single The range of the register file within the thread, then calculation operations or memory access operations that exceed the range of the register file are ignored and an error is reported.
  • the first thread set provides the second factor set corresponding to the memory logical address in parallel to the computing units in all threads in the second thread set in the form of broadcast, but not to all threads. register.
  • programmers can consider thread task allocation from a matrix perspective, so that one or more threads can be used to calculate the dot product of one row of the first factor matrix and the second factor matrix, and The corresponding results are accumulated into the product register within the same thread, thereby increasing programming flexibility for matrix multiplication and improving the execution efficiency of matrix multiplication.
  • FIG. 1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented
  • Figure 2 shows a schematic block diagram of a chip according to an embodiment of the present disclosure
  • Figure 3 shows a schematic block diagram of a three-dimensional tensor according to one embodiment of the present disclosure
  • Figure 4 shows a schematic diagram of page allocation of image data according to one embodiment of the present disclosure
  • Figure 5 shows a schematic diagram of matrix multiplication according to one embodiment of the present disclosure
  • Figure 6 shows a schematic diagram of a portion of matrix multiplication according to one embodiment of the present disclosure
  • FIG. 7 shows a schematic diagram of a portion of matrix multiplication according to another embodiment of the present disclosure.
  • FIG. 8 shows a schematic flow diagram of a method performed by an accelerator according to one embodiment of the present disclosure.
  • Figure 9 shows a schematic block diagram of an electronic device according to one embodiment of the present disclosure.
  • the term “include” and its variations mean an open inclusion, ie, "including but not limited to.” Unless otherwise stated, the term “or” means “and/or”. The term “based on” means “based at least in part on.” The terms “one example embodiment” and “an embodiment” mean “at least one example embodiment.” The term “another embodiment” means “at least one additional embodiment”. The terms “first,” “second,” etc. may refer to different or the same object. Other explicit and implicit definitions may be included below.
  • programmers can consider thread task allocation from the perspective of the row-column structure of the matrix, so that one or more threads can be used to calculate the dot product of a row of the first factor matrix and the second factor matrix, and The corresponding result finds the product register within the same thread, thereby increasing programming flexibility for matrix multiplication and improving the execution efficiency of matrix multiplication.
  • FIG. 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented.
  • Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer.
  • example environment 100 includes, for example, central processing unit (CPU) 20, system memory 10, northbridge/memory bridge 30, accelerator 40, device memory 50, and southbridge/input-output (IO) bridge 60.
  • System memory 10 may be, for example, volatile memory such as dynamic random access memory (DRAM).
  • the northbridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the southbridge/IO bridge 60.
  • Southbridge/IO bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers, etc.
  • the accelerator 40 may include, for example, a device or chip such as a graphics processing unit (GPU) and/or an artificial intelligence (AI) accelerator for accelerating processing of graphics, video, and other data.
  • accelerator 40 may be a GPU.
  • accelerator 40 may be an AI chip.
  • Device memory 50 may be, for example, volatile memory external to accelerator 40 such as DRAM. In this disclosure, device memory 50 is also referred to as off-chip memory, i.e., memory located outside the chip of accelerator 40.
  • the accelerator 40 also has volatile memory inside the chip, such as a level one (L1) cache (cache) and an optional level two (L2) cache.
  • L1 cache level one cache
  • L2 cache optional level two cache
  • FIG. 2 shows a schematic block diagram of an accelerator 200 according to one embodiment of the present disclosure.
  • the accelerator 200 may be, for example, a specific implementation of the chip of the accelerator 40 in FIG. 1 .
  • the accelerator 200 is, for example, an accelerator chip such as a GPU.
  • accelerator 200 includes stream processor (SP) 210, page table device 220, processing engine (PE) unit 230, direct memory access (DMA) controller 240, L1 cache (cache) 260, and L2 cache. Cache 250.
  • SP stream processor
  • PE processing engine
  • DMA direct memory access
  • the accelerator 200 is controlled by a host device such as the CPU 20 and receives instructions from the CPU 20.
  • the SP 210 analyzes instructions from the CPU 20 and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing.
  • the page table device 220 is used to manage the on-chip virtual storage of the accelerator 200 .
  • L2 cache 250 and off-chip memory such as device memory 50 in FIG. 1 constitute a virtual storage system.
  • Page table device 220 is jointly maintained by SP 210, PE unit 230 and DMA controller 240.
  • the PE unit 230 includes a plurality of processing engines (PE) PE_1, PE_2...PE_N, where N represents an integer greater than 1.
  • PE processing engines
  • Each PE in PE unit 230 may be a single instruction multi-thread (SIMT) device.
  • SIMT single instruction multi-thread
  • each thread can have its own register file, and all threads of each PE also share a unified register file.
  • Multiple PEs can perform the same or different processing work in parallel, and can perform address translation and access to target data in the memory as described below in parallel, thereby reducing processing time. It can be understood that the target elements processed by multiple PEs are not the same, and the segments, pages, cache lines where the target elements are located, and the attributes, sizes, dimension ordering, etc. of the elements may be different, as described in detail below.
  • the logical address of the target element can be expressed as seg:RF:imm, where seg represents the segment base address register, RF represents the offset register, and imm represents the offset immediate value.
  • the logical address can include the reference data and offset data of the target element in each dimension of the first tensor.
  • the offset data represents the offset of the target element on each of the multiple dimensions of the first segment, and the segment reference data is the address of the segment starting point.
  • the first segment includes at least one page
  • the accelerator 200 may convert the logical address into a linear address based at least on dimensions of each dimension of the target element page.
  • the linear address includes the one-dimensional page identifier of the target element page and the one-dimensional offset value of the target element within the target element page.
  • the accelerator 200 can obtain the page number offset of the target element in each dimension according to the page size of the page in each dimension in the first segment, thereby obtaining the one-dimensional identification of the page where the target element is located.
  • the target element is located at the top level of the tensor in Figure 3, and the page identifier of the target element can be determined to be P[1] through the above method.
  • the accelerator can also obtain the relative offset of the target element in each dimension within the page, and based on this, determine the one-dimensional linear offset of the target element relative to the starting position of the page.
  • the one-dimensional identifier of the page and the one-dimensional linear offset within the page together constitute the linear address of the target element.
  • the accelerator 200 converts the linear address into a physical address based on the page table entry for the target element page, which page table entry includes the page physical address of each page in at least one page. Specifically, in one embodiment, after obtaining the page identifier of the target element, the accelerator 200 can search the corresponding item in the page table device 220 according to the page identifier to obtain the physical address of the page. This physical address plus the one-dimensional linear offset of the target element on the target element page is the physical address of the target element.
  • the physical address may represent the storage address of the target element on the off-chip device memory 50 or on-chip memory, such as the L2 cache 250 .
  • the page table entry of the target element page can also store the physical address relative to other pages, and the target is obtained based on the offset of the target element page relative to other pages, the physical addresses of other pages, and the one-dimensional linear offset The physical address of the element.
  • page table entries can also include other attributes, such as status, which indicates whether the page has been loaded, that is, whether it is available. This disclosure does not limit this. Although two-level translation of addresses is shown here, the disclosure is not limited thereto. Alternatively, more stages of conversion are possible. For example, the page offset, cache line offset, and element offset are calculated hierarchically, and are sequentially added to the physical address to obtain the final physical address of the target element.
  • the accelerator 200 moves the first page of the plurality of pages from the off-chip memory into the on-chip memory, and establishes a first page table entry corresponding to the first page.
  • the first page table entry stores the first page. Physical address in memory. If the first page of the plurality of pages is moved from the memory to the off-chip memory, the accelerator 200 may delete the first page table entry corresponding to the first page.
  • the accelerator converts the logical address of the target element in the first segment S1 to a physical address in on-chip virtual memory.
  • On-chip virtual memory may include on-chip L2 cache 250 and off-chip device memory 50 .
  • the logical address includes the segment reference data and offset data of the first segment in the tensor.
  • the segment reference data and offset data respectively represent the base address and offset of the target element in each of the multiple dimensions of the first segment.
  • Each thread can perform thread-level data exchange between its own register file and memory subsystem.
  • Each thread has its own arithmetic logic execution unit and uses its own storage address, which uses a typical load-store architecture.
  • Each execution unit includes a floating-point/fixed-point unit that supports multiple data types and an arithmetic logic unit.
  • Most instructions perform arithmetic and logical operations, such as addition, subtraction, multiplication, and division of floating-point and fixed-point numbers, or logical AND, OR, NOT, etc.
  • the operands come from registers.
  • Memory read and write instructions can provide data exchange between registers and on-chip/off-chip memory.
  • all execution units in a PE can execute the same instructions synchronously. By using the predicate register, part of the execution unit can be masked to implement the function of the branch instruction.
  • the accelerator 200 of FIG. 2 may, for example, perform the following operations: 1) assemble page table entry content and initial state; 2) transfer data on an off-chip memory such as the device memory 50 in FIG. 1 to On-chip memory, such as L2 cache 250; 3) Start and execute the program; 4) Define each segment and describe the tensor and stored attributes; 5) When the program execution is completed, write the execution result data off-chip memory.
  • the data processed by the accelerator 200 is mainly aimed at multi-dimensional tensors.
  • the tensor may be a four-dimensional tensor with four dimensions D1, D2, D3, and D4, and the tensor may have different dimensions in each dimension.
  • the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited by this disclosure.
  • the tensor can internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this.
  • the basic unit is element. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double bytes, and so on.
  • the amount of data contained in the tensor may be large, and the capacity of the L2 cache 250 is limited, so the entire tensor cannot be loaded into the on-chip L2 cache 250 .
  • the tensor may be divided into at least one segment. In the case where a tensor consists of only one segment, the tensor is a segment. In the case of a tensor containing multiple segments, the segments are part of the tensor.
  • the CPU 20 can specify which PE to process each part of the segment through instructions.
  • Figure 3 shows a schematic block diagram of a three-dimensional tensor 300 according to one embodiment of the present disclosure.
  • the three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3.
  • CPU 20 may specify that the tensor elements of segment S1 are processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8.
  • CPU20 also specifies that the tensor elements of the second segment S2 are processed by PE_1-PE_4.
  • each segment may have different sizes, so programmers can flexibly configure the segments based on design needs.
  • page division can be implemented in any one or more dimensions, and the number of pages divided in each dimension is independent of each other.
  • tensor data may be stored in on-chip high-speed memory, such as L2 cache 250.
  • on-chip high-speed memory such as L2 cache 250.
  • programmers can divide the tensor into multiple segments, and each segment describes a part of the tensor.
  • the core program (kernel) can be started multiple times. Each time, the DMA controller 240 transfers a segment of the tensor from off-chip storage to on-chip storage in advance and makes it available for kernel operations. After starting the kernel multiple times, all segments contained in the tensor are processed and the entire running process ends.
  • the on-chip high-speed memory is enough to accommodate all the tensors that the kernel needs to access, a tensor only needs one segment description, and the kernel only needs to be started once.
  • At least one page can also be set to further subdivide the tensor.
  • the first segment S1 there are 4 pages P[1], P[2], P[3] and P[4].
  • the second segment S2 has only one page.
  • each segment may have a different number of pages, so programmers can flexibly configure the size of pages within a segment based on design needs.
  • the page is configured to fit into L2 cache 250 in its entirety.
  • a page can usually contain multiple elements.
  • the page on which the target element is located is referred to herein as the "target element page".
  • a page may include multiple cache lines. While the target element page may be located in the L2 cache 250, if the PE reads the target element via the L1 cache 260, the L2 cache 250 needs to concatenate the physical addresses of a small portion of the L2 cache 250 including the target element.
  • the data is transferred to L1 cache 260 in its entirety. This small part of data is also called cache line data, and this caching mechanism is based on the principle of spatial proximity.
  • L1 cache 260 It only takes a few clock cycles for the PE to read data from the L1 cache 260, while it may take dozens or even hundreds of clock cycles for the L1 cache 260 to read data from the L2 cache 250. Therefore, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250.
  • a "cache line" is used here to describe the smallest unit of data transferred from the L2 cache 250 to the L1 cache 260, in this disclosure, this part of the data may not necessarily be arranged in rows or columns.
  • a "cache line” The data inside is distributed in multiple dimensions, and the size of the data distributed in each dimension is not limited to 1.
  • PE performs parallel processing on the data within a segment. The allocation of PE is expanded in the logical address space of the data and is independent of the physical storage structure of the segment, as described below.
  • the first set of cache lines in the first page P[1] is designated to be processed by PE_1, and the second set of cache lines is designated to be processed by PE_2.
  • the tensor is shown to be processed by multiple PEs in sequence here, it can be understood that the processing of tensor data is independent of the order of the PEs, and the present disclosure is not limited to this.
  • PE_2 in Figure 3 represents part of the tensor data that can be processed by PE_M, where M represents any integer not greater than N.
  • FIG. 4 shows a schematic diagram of page allocation of image data 400 according to one embodiment of the present disclosure.
  • Image data is typically a two-dimensional tensor.
  • the image data 400 is, for example, 8*8 pixels.
  • image data 400 has 8 pixels in the first dimension D1 and also has 8 pixels in the second dimension D2. Therefore, image data 400 has pixels P00, P01...P77.
  • the image data 400 has only one segment, but is divided into 4 pages P[1], P[2], P[3] and P[4] in two dimensions.
  • the four pages can be divided according to the second dimension D2 to be allocated to PE_1 and PE_2 for processing, or they can be divided according to the first dimension D1 to be allocated to PE_1 and PE_2 for processing. In addition, it can also be divided diagonally. This disclosure does not limit this.
  • Figure 5 shows a schematic diagram of matrix multiplication 500 according to one embodiment of the present disclosure.
  • Tensors can typically contain one or more dimensions. Two-dimensional tensors can be thought of as matrices. In some situations, it may be necessary to perform matrix multiplication of two two-dimensional matrices to obtain the product matrix.
  • matrix A represents a first factor matrix
  • matrix B represents a second factor matrix.
  • the first factor matrix A 502 is multiplied by the second factor matrix B 504 to obtain the product matrix C 506.
  • a "dot product operation" may include a multiplication operation of corresponding matrix elements and optionally a product addition operation.
  • the first factor matrix 502 may be an m ⁇ k matrix
  • the second factor matrix 504 may be a k ⁇ n matrix, where m, k and n all represent positive integers.
  • the product matrix is therefore an m ⁇ n matrix. It can be seen that the first factor matrix 502 includes m rows and k columns, the second factor matrix 504 includes k rows and n columns, and the product matrix therefore includes m rows and n columns.
  • C[1][1] can be expressed by the following formula (1):
  • C[m][1] and C[m][n] can be expressed by the following formulas (2) and (3):
  • matrix C includes m ⁇ n matrix elements, and each matrix element is formed by adding k product results.
  • the product result represents the result of multiplying one matrix element of matrix A and one matrix element of matrix B
  • the dot product result represents multiple matrix elements of matrix A. The result obtained by multiplying the corresponding multiple matrix elements in matrix B and adding the multiple product results.
  • Figure 6 shows a schematic diagram of matrix multiplication 600 according to one embodiment of the present disclosure.
  • product matrix C 602 may include m rows and n columns, with each row corresponding to a thread. Each thread includes n registers for storing n dot product results for each row.
  • m threads can be executed in parallel to improve execution efficiency.
  • all registers corresponding to matrix C can be initialized to 0 first.
  • the calculation of C[1][1] includes k multiplication calculations and k-1 addition operations (actually equivalent to k times Accumulation, because the matrix elements are initialized to 0, the first product element is equivalent to accumulation with 0).
  • the first thread first calculates the first product result A[1][1] ⁇ B[1][1] of the matrix element C[1][1], and the second thread first calculates it in parallel.
  • the first product of matrix element C[2][1] results in A[2][1] ⁇ B[1][1], and so on. That is, all m threads first calculate the first product result of the first matrix element of a row of the corresponding matrix C. It can be understood that at this time, the complete result of the first column of the product matrix C 602 has not been obtained, nor has the calculation of other columns except the first column of each row of the product matrix C 602 been carried out.
  • the first thread then computes the first product result A[1][1] ⁇ B[1][2] of the second column element C[1][2] of the matrix, and the second thread computes the matrix element C in parallel
  • the first product result of [2][2] is A[2][1] ⁇ B[1][2], and so on. That is, m threads calculate the first product result of the second matrix element of a row of the corresponding matrix C. At this time, the complete results of the first and second columns of the product matrix C 602 have not been obtained, and the calculation of other columns except the first and second columns of each row of the product matrix C 602 has not been carried out.
  • the first product result of all column matrix elements in each row of the product matrix C602 is obtained.
  • the first thread then computes the second product result A[1][2] ⁇ B[2][1] of matrix element C[1][1] and multiplies it with the first product A[1][1 ] ⁇ B[1][1] is added, and the second thread first calculates the second product result A[2][2] ⁇ B[2][1] of the matrix element C[2][1] in parallel. And add it to the first product A[2][1] ⁇ B[1][1], and so on.
  • all columns of matrix C 602 are calculated. . That is, all m threads first calculate the result of adding the second product and the first product of each element of a row of the corresponding matrix C.
  • matrix C 604 actually includes k rounds of calculations. Each round calculates a part of each matrix element of matrix C, and accumulates the calculation results with the calculation results of previous rounds in the corresponding register. As shown in Figure 6, each matrix element of matrix C 602 has the same color pattern, which indicates that each matrix element is calculated for the same number of rounds of multiplication and accumulation. Each matrix element of matrix C 604 is the final result obtained after k rounds of accumulation, so the color of each matrix element is darker than the color of matrix C 602.
  • Figure 7 shows a schematic diagram of matrix multiplication 700 according to another embodiment of the present disclosure.
  • multiple threads can first calculate in parallel the accumulation of all product results of each matrix element of matrix C, and then calculate the matrix elements of the next column of matrix C column by column.
  • the matrix elements in the first column of matrix C 702 have a darker color than the nth column, which indicates that the matrix elements in the first column have been calculated for the same number of rounds of multiplication and accumulation, while the matrix elements in the last column The matrix elements have not been calculated at this time, for example, they still have the initial value 0.
  • Each matrix element of matrix C 704 is the final result obtained after k rounds of accumulation.
  • the color of the matrix element in the first column of matrix C 704 is the same as the color of the matrix element in the first column of matrix C 702. This shows that The first column of matrix C 702 is calculated first, and then the next round of calculation is performed. Similar to the embodiment of Figure 6, the k dimension can also be divided into s segments, and the accumulation of the product results in the s segments is calculated each time.
  • each row of the product matrix C is obtained by using one thread to perform matrix calculation, this is only for illustration and does not limit the scope of the present disclosure.
  • the number of threads is significantly greater than the number of matrix rows, for example, when the number of threads is 2 times, 3 times, or more than the number of rows of the product matrix, 2, 3, or more threads can be used for each row of the product matrix C. More threads are used to calculate the product matrix C, as described below.
  • each thread can be assigned corresponding information about the first factor matrix A, the second factor matrix B, and the product matrix C to perform part of the matrix multiplication task. To flexibly and efficiently utilize the computing resources in the PE unit.
  • the general concept of matrix multiplication is described above in Figures 5 to 7. Some embodiments of matrix multiplication will be described in detail below in conjunction with Figure 8.
  • Figure 8 shows a schematic flow diagram of a method 800 performed by an accelerator according to one embodiment of the present disclosure.
  • Method 800 is used to perform matrix multiplication as shown above in conjunction with Figures 5-7.
  • receive a first tensor multiplication instruction for a first thread set of the accelerator the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor for a first tensor a register representation, a memory logical address for the second tensor, and a first product register representation for the third tensor.
  • the electronic device may have two thread sets, wherein the first thread set is used to broadcast the data of matrix B to the computing units of the threads in the second thread set.
  • the first thread set provides the second factor set corresponding to the memory logical address in parallel to all threads or part of the threads in the second thread set in the form of broadcast.
  • the first set of threads is configured to broadcast data for matrix B
  • the first set of threads is configured to perform A ⁇ B in response to receiving data for matrix A.
  • Each thread in the second thread set includes a first set of registers for storing at least a portion of the data in a row of the first factor matrix, and a second set of registers for storing at least a portion of the data in a row of the first factor matrix.
  • One row of data is configured to broadcast data for matrix B
  • the first set of threads is configured to perform A ⁇ B in response to receiving data for matrix A.
  • Each thread in the second thread set includes a first set of registers for storing at least a portion of the data in a row of the first factor matrix, and a second set of registers for storing at least a portion of the data in a row of the first factor matrix.
  • a first tensor multiply instruction is for example @p1,mm.R0,ur4:rf290:0x00,R256, where @p1 represents the guard predicate operand associated with the first thread.
  • @p1 can for example be a boolean predicate variable for the first thread. If the predicate value is false, the data load operation for this instruction is not performed.
  • ur4:rf290:0x00 is used to access the on-chip memory normally, such as L1 cache 260, L2 cache 250 or dynamic random access memory such as DDR (Double Data Rate) memory controlled by DMA 240 ( dynamic random access memory (DRAM), and the first thread set broadcasts the obtained data content to all threads in the second thread set.
  • DDR Double Data Rate
  • DMA 240 dynamic random access memory (DRAM)
  • the first thread set broadcasts the obtained data content to all threads in the second thread set.
  • execution conditions for each thread can be provided. For threads that do not meet the execution conditions, their memory access is regarded as exceeding the tensor address range and is ignored, or the tensor multiplication operation to be performed by the corresponding thread of the second thread set is abandoned.
  • R0 represents the starting register in the second set of registers used to store each product element of a row in the product matrix C.
  • the registers R0-R255 are used to store each product element of a row in the product matrix C.
  • ur4:rf290:0x00 represents the logical address of the second factor matrix, such as a specific example of the logical address seg:RF:imm of the target element mentioned above.
  • R256 represents the starting register in the first group of registers.
  • the first group of registers is used to store the dot product (multiplication and accumulation of corresponding elements in matrix A and matrix B) operations in a row in the first factor matrix.
  • the correlation matrix elements involved in In one embodiment, the first set of registers and the second set of registers are located in the same thread, which can reduce power consumption and time of data transmission during the calculation process.
  • the first product register representation may correspond to one or more product registers.
  • the number of one or more product registers is related to the merge calculation mode and the number of columns of the second tensor, as detailed below.
  • the product registers of the different threads form a result tensor with the same number of rows as the first tensor and the same number of columns as the second tensor. For example, 256 threads can form a resulting tensor with 256 rows.
  • Each thread's product register file includes part or all of each row of the result tensor.
  • each thread's product register file could correspond to one row of the result tensor.
  • each thread's product register can correspond to part of a row of the result tensor.
  • the number of product registers within the threads in the second thread set may be variable.
  • the number of product registers depends on the execution conditions of the first tensor multiply instruction.
  • the execution condition determines access to the columns in the second tensor. For example, in some situations, only a portion of all product registers within the threads in the second thread set may be used. In other cases, another portion or all of the product registers within the threads in the second thread set are used. If the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.
  • the first tensor multiply instruction may be issued two or more times.
  • the first tensor instruction is issued to the memory system, and the matrix multiply (mm) instruction can be fetched from the cache or instruction section of the accelerator 200 and sent to the pipeline unit of the accelerator 200 , after decoding, it is issued as a regular access instruction, and its access address is seg:RF:imm such as ur4:rf290:0x00.
  • the first tensor multiplication instruction is issued for the first time in the form of a storage instruction to obtain the column data or row data in the second tensor.
  • the first tensor multiplication instruction is issued two or more times in the form of a mathematical calculation instruction. , for performing the calculation of the results of each column in the row of the third tensor.
  • the accelerator 200 can read the data block registers corresponding to matrix C and matrix A, such as R0-R255 and R256-R257, and then read the data block of the second factor matrix B obtained during the first emission process and perform a dot product operation, And write the temporary calculation result into the corresponding register, such as a register in R0-R255.
  • the execution unit of the first thread can read the data in one row of the first factor matrix from the first set of registers only once, and reuse it during subsequent dot product operations. It is understood that in some cases, the range of product register usage for the third tensor may exceed the range of the register file within a single thread.
  • data block registers R0-R255 are not enough to store one row of product data in the third tensor.
  • one row of product data in the third tensor requires 300 data registers to store.
  • the accelerator 200 may determine whether the product register usage range for the third tensor exceeds the range of the register file within a single thread based on the first product register representation. If it is determined that the product register usage range for the third tensor exceeds the range of the register file within a single thread, calculation operations or memory access operations that exceed the range of the register file are ignored and an error is reported.
  • the accelerator 200 may check the corresponding token status of the first factor register. If the token status indicates that the data of the first tensor has been stored in the first factor register, then issue a mathematical calculation instruction, otherwise block the emission queue until the data of the first tensor has been stored. in the first factor register.
  • each thread performing parallel mm calculations involves substantially the same matrix element data block of the second factor matrix B
  • each data block of the second factor matrix B is broadcast to all threads for parallel execution.
  • the calculation task of a piece of data can be completed in n steps. The calculation starts from the 0th column of the second factor matrix B and the product matrix C, and moves backward one column at a time until all columns are cycled.
  • Each thread can specify an independent column address for the mm instruction, and the data retrieved from each column is broadcast to all threads for calculation.
  • the data in a column of the second factor matrix B may come from L1 cache, L2 cache or off-chip memory.
  • data in a column of the second factor matrix can be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.
  • the first set of threads broadcasts the second set of factors in the second tensor to the second set of threads based on the memory logical address for the second tensor, as described above.
  • the first thread in the second set of threads performs a dot product operation on the first set of factors and the second set of factors in the first row of the first tensor based on the first factor register representation to generate a first set of factors of the third tensor. Accumulate the first point in the row.
  • Dot product budgeting can include multiplication and addition operations.
  • the first factor register representation is for example R256 and the memory logical address is seg:RF:imm such as ur4:rf290:0x00.
  • the number of registers within each thread in the second thread set is variable, specifically controlled by the execution condition of the tensor multiplication instruction, which condition controls access to each column in the second tensor. If a certain If the column has not been accessed, the column does not participate in the matrix multiplication calculation, so the product register corresponding to the column does not exist.
  • the matrix multiplication is not completed at one time, but is based on the size of the register, the type of the matrix elements in the first factor matrix A, the second factor matrix B and the product matrix C, and the accelerator 200
  • the calculation capability of the computing unit and other factors are comprehensively considered and executed multiple times to complete the process.
  • the first factor register set within a single thread includes at least part of the data in a single row of the first tensor, and the first factor register set includes one or more registers, the specific number of which can be supported by a single round of tensor multiply instructions.
  • the length is determined by, for example, 2 registers, each register including one or more data elements; for example, for the int8 data type, 2 registers include 8 data elements.
  • the number of threads involved in tensor multiplication is proportional to the number of rows of the first tensor. For example, the number of rows of the first tensor can be 256, and the number of threads participating in tensor multiplication can also be 256.
  • the first tensor multiply instruction may further be, for example, @p1,mm8.sa.ub.R0,ur4:rf290:0x00,R256. It is the same or similar to @p1, mm.R0, ur4:rf290:0x00, R256 and will not be repeated here.
  • mm8 indicates that the data type of the elements involved in matrix multiplication is 8 bits
  • sa indicates that the element data in the first factor matrix A associated with register R256 is signed int8
  • ub indicates that it is related to the logical address ur4:rf290:0x00
  • the element data in the connected second factor matrix B is unsigned uint8. It can be understood that the types of matrix elements in the first factor matrix A, the second factor matrix B and the product matrix C can also be other data types, and this disclosure is not limited thereto.
  • the first factor register representation R256 and the memory logical address ur4:rf290:0x00 can be determined.
  • the first factor register representation R256 and the memory logical address ur4:rf290:0x00 may, for example, correspond to the first register in the first set of registers and the reference point of the tensor segment of matrix B data block.
  • the first factor set is stored in the first register, such as A[1][1], and the data block of the reference point of the tensor segment of matrix B represents the data block of the reference point of the tensor segment of matrix B, such as B[ 1][1].
  • the first point product set A[1][1] ⁇ B[1][1] in the first row of the third tensor of the product matrix C can be obtained.
  • the first set of factors may include A[1][1] and A[1][2]
  • the second set of factors may include B[1][1] and B[2][1]
  • the first point product set can include A[1][1] ⁇ B[1][1]+A[1][2] ⁇ B[2][1].
  • the first set of factors may be A[1][1], A[1][2], and A[1][3]
  • the second set of factors may include B[1][1] , B[2][1] and B[3][1]
  • the first point product set can include A[1][1] ⁇ B[1][1]+A[1][2] ⁇ B [2][1]+A[1][3] ⁇ B[3][1].
  • the present disclosure does not limit the range of the first factor set, the second factor set and the first point product set. The range can be flexibly adjusted by the programmer based on factors such as the data type of the matrix element and the register capacity when programming the matrix multiplication. Configuration, such as automatic configuration by setting the data type in the tensor multiply instruction.
  • a single thread can perform parallel computations on multiple product elements in a row of the product matrix C.
  • the first thread in the second thread set can calculate the respective first point product sets A[1][1] ⁇ B[1][1] of C[1][1]-C[1][8] in parallel.
  • the first thread can also calculate the respective first point product sets A[1][1] ⁇ B[1][ of C[1][1]-C[1][8] in parallel.
  • the first point product is accumulated by the first thread in the second set of threads into a first set of product registers corresponding to the first product register representation.
  • the first thread can accumulate the dot product result of the above calculation into the corresponding first set of product registers, such as R0-R7 registers.
  • the range of registers included in the first group of product registers can be flexibly configured by the mm instruction. By decomposing the matrix and allocating threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing efficiency of matrix multiplication.
  • programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.
  • method 800 further includes, in response to receiving the second set of factors, a second thread in the second set of threads converting the third set of factors in the second row of the first tensor to the third set of factors based on the first factor register representation. performing a dot product operation on the two factor sets to generate a second dot product set in the second row of the third tensor; and accumulating the second dot product set by the second thread into the second set corresponding to the first product register representation in the product register.
  • the first tensor multiplication instruction may be represented as @p1,mm8.sa.
  • the second thread in the second thread set also includes a first set of registers, such as R256-R257 for storing the third set of factors in the second row of the first factor matrix, And also includes a second set of registers, such as R0-R255, for storing the second dot product set of the second row of the third tensor.
  • the first thread and the second thread actually perform parallel mm calculations respectively for the first and second rows of the first factor matrix A and respectively for the first and second rows of the first product matrix C, so by parallel Calculation can greatly save calculation time.
  • each thread since there is a fixed correspondence between each thread and each matrix row, it can also avoid multiple threads dynamically allocating matrix multiplication calculation tasks according to their busyness (for example, one thread can calculate two matrix rows, while another thread only The overhead caused by computing a portion of a matrix row).
  • some threads may become idle.
  • the PE unit includes 64 threads and the number of rows in the product matrix C is only 16, if each row is still assigned only one thread, 48 threads will be idle.
  • multiple threads eg, the first thread and the third thread in the second thread set
  • the first tensor multiply instruction also includes a first merge calculation mode indication, such as KA2.
  • KA2 indicates that two threads participate in the calculation of a matrix row.
  • the first combined calculation mode indication may include other indications such as KA1, KA3, KA4, etc., and the only difference lies in the number following KA.
  • KA1 means that a single thread participates in the calculation of one matrix row
  • KA3 means that three threads participate in the calculation of one matrix row, and so on.
  • a single thread in the absence of a first merge calculation mode indication, a single thread may default to performing the calculation of one matrix row.
  • an illustrative example of the first tensor multiplication instruction received by the first thread and the third thread may be, for example, @p1,mm8.KA2.sa.sb.R0,ur4 :rf290:0x00,R256. It can be understood that KA1-KA4 only use one implementation method for representing the first combined calculation mode indication, and other characters or other representation methods may be used to express the first combined calculation mode indication.
  • the first thread and the third thread in the second thread set jointly calculate the product elements in the same row in the product matrix C.
  • the first thread is used to calculate the first set of product elements C[1][1]-C[1][127]
  • the third thread is used to calculate the second set of product elements C[1][128]-C [1][256]
  • the first thread is used to calculate the first set of product elements C[1][1], C[1][3], C[1][5]...C[1][255]
  • the third thread is used to calculate the second set of product elements C[1][2], C[1][4], C[1][6]...C[1][256].
  • the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to generate the The first point product in the first row of the three tensors is accumulated, and the first point product is accumulated into the second set of registers in the first thread.
  • the third thread performs a dot product operation on the first factor set and the fourth factor set of the second tensor based on the first merge calculation mode indication and the first factor register representation to generate the first factor set in the first row of the third tensor.
  • the fourth factor set is different from the second factor set
  • the third point product set is different from the first point product set.
  • the third thread further accumulates the third point product into a third set of product registers corresponding to the first product register representation, the third set of product registers being located in the third thread.
  • first combined calculation mode indication calculation mode indication can be used in conjunction with the above embodiment with respect to FIG. 8 , so various aspects described with respect to FIG. 8 will not be described again here.
  • the first tensor multiply instruction also includes a second merge calculation mode indication, such as KB2.
  • KB2 indicates that two threads jointly participate in the calculation of each product element in the product matrix.
  • the second combined calculation mode indication may include other indications such as KB1, KB3, KB4, etc., and the only difference lies in the number following KB.
  • KB1 indicates that a single thread participates in the calculation of each product element in the product matrix
  • KB3 indicates that three threads jointly participate in the calculation of each product element in the product matrix, and so on.
  • a single thread may default to performing the calculation of one matrix row.
  • an illustrative example of the first tensor multiplication instruction received by the first thread and the fourth thread in the second thread set may be, for example, @p1,mm8.KB2.sa .sb.R0,ur4:rf290:0x00,R256. It can be understood that KB1-KB4 only use one implementation method for expressing the second combined calculation mode indication, and other characters or other expression methods may be used to express the second combined calculation mode indication.
  • the first thread and the fourth thread in the second thread set jointly participate in the calculation of each product element in the product matrix.
  • the first thread can calculate A[1][ 1] ⁇ B[1][1]
  • the fourth thread can compute A[1][2] ⁇ B[2][1] in parallel with the first thread, and the first and fourth threads then add.
  • the fourth thread sends the product of to the first thread, and the first thread performs an addition operation to get the dot product result.
  • the first thread accumulates the dot product result into the product register.
  • the first thread and the fourth thread can operate similarly to obtain the first point product set.
  • the first thread may default to sending the product to the fourth thread.
  • the first thread can calculate A[1][1] ⁇ B[1][1]+A[1][2] ⁇ B [2][1]+A[1][3] ⁇ B[3] [1]+A[1][4] ⁇ B[4][1]
  • the first thread can calculate A[1][1] ⁇ B[1][1]+A[1][2] ⁇ B [2][1]
  • the fourth thread in the second thread set can calculate A[1][3] ⁇ B[3][1]+A[1][4] ⁇ B[4 in parallel with the first thread ][1]][1]][1]][1]
  • the first thread then performs addition processing.
  • the fourth thread sends the dot product to the first thread, and the first thread performs an addition operation to get the dot product result.
  • the first thread then accumulates the dot product result into the product register.
  • the first thread may default to sending the dot product to the fourth thread, and the fourth thread performs the addition of the dot product and accumulates the dot product result into the product register of the fourth thread.
  • the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor based on the second merge calculation mode indication and the first factor register representation, to generate the first point product in the first row of the third tensor and accumulate the first point product into the second set of registers of the first thread.
  • the fourth thread Based on the second combined calculation mode indication and the first factor register representation, the fourth thread performs a dot product operation on the fifth factor set in the first row and the sixth factor set of the second tensor to generate a third The fourth point product set in the first row of the tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth point product set is different from the first point product set.
  • the first thread further accumulates the fourth point product set to a third set of product registers corresponding to the first product register representation, the third set of product registers being located in the first thread.
  • the first merge calculation instruction can be used in combination with the second merge calculation instruction. That is, not only can each row of the product matrix be divided into different parts and calculated by different thread groups, but each dot product element within each row can also be calculated by a different thread. For example, for C[1][1]-C[1][8], C[1][1]-C[1][4] can be calculated by the first group of threads, while C[1][5 ]-C[1][8] can be calculated by the second set of threads.
  • the second factor matrix is usually column-wise and the dot product operation is performed with the row elements of the first factor matrix.
  • the second factor matrix stored in a memory such as a DDR is typically physically stored row by row. Therefore, when a thread reads an element of the second factor matrix from the memory, such as B[1][1], based on the principle of spatial proximity, it usually also reads some elements that are physically close to the element to the L1 high speed at one time.
  • B[1][2], B[1][3], B[1][4] and B[1][1] are read together into the L1 cache.
  • B[1][1] and B[2][1] may actually need elements of the same column, such as B[1][1] and B[2][1].
  • B[2][1] it takes several clock cycles to read B[2][1] from the memory as well as B[2][2], B[2][3] and B[ that are not needed in this calculation process. 2][4] into the L1 cache.
  • B[1][2], B[1][3], B[1][4], B[2][2], B[2][3] and B[2][ 4] Typically discarded due to the dynamic flushing rules of the L1 cache.
  • a transpose indication is further set in the tensor multiplication instruction.
  • the first tensor multiply instruction also includes a transpose instruction.
  • a further illustrative example of a first tensor multiplication instruction is @p1,mm8.KA1.T1.sa.sb R0,ur4:rf290:0x00,R256, where T1 indicates that the second factor matrix B needs to be transposed.
  • T0 may be used in tensor multiplication instructions to indicate that the second factor matrix B does not need to be transposed.
  • the first thread in the second thread set can therefore dot product the first set of factors in the first row of the first tensor and the second set of factors in the second tensor based on the transpose indication and the first factor register representation to Generate the first point product in the first row of the third tensor.
  • the first set of threads loads the factors of multiple rows in the second tensor into the cache based on the transpose indication and the memory logical address.
  • the first thread set can divide B[1][1]-B[1][4], B[2][1]-B[2][4], B[3][1]-B[ 3][4] and B[4][1]-B[4][4] are both loaded into the L1 cache.
  • the first set of threads selects factors from multiple rows of factors by column, such as selecting B[1][1], B[2][1], B[3][1], and B[4][1] , to form a second set of factors and broadcast it to the second set of threads.
  • the second set of threads then performs a dot product operation on the first set of factors and the second set of factors in the first row based on the first factor register representation to generate a first set of dot products in the first row of the third tensor.
  • B[1][2]-B[1][4], B[2][2]-B[2][4], B[3][2 ]-B[3][4] and B[4][2]-B[4][4] are directly retained in the cache without being dynamically refreshed.
  • the first thread in the second thread set does not need to read B[1][2]-B[1][4], B[2][2]-B[ from the memory again when performing subsequent matrix calculations.
  • B[1][1]-B[1][4], B[2][1]-B[2][4], B[3][1]-B[3][4 ] and B[4][1]-B[4][4] are taken as examples to illustrate the transposition instruction, but it can be understood that this is only an illustration.
  • the range of the second factor matrix B that can be used for transposition can vary. For example, when the number of rows of the second factor matrix B is another number of rows, such as 256 rows, the cache lines of all rows can be loaded into the high speed. Cache, and release from the cache until the data in the cache line has been used for matrix multiplication calculations. In this way, the time required to repeatedly read data from memory to the L1 cache can be greatly saved.
  • the principles and examples of matrix multiplication according to embodiments of the present disclosure are mainly described in the form of two-dimensional tensors.
  • the present disclosure is not limited to matrix multiplication calculations in the form of two-dimensional tensors, but may include calculations of multiplications or convolutions of one-dimensional tensors or more-dimensional tensors.
  • For a one-dimensional tensor it is equivalent to one dimension of the two-dimensional tensor being 1, so I won’t go into details here.
  • the dimensions other than k dimensions in the first factor matrix A and the second factor matrix B can be reduced and decomposed to obtain an equivalent two-dimensional matrix.
  • the k dimension is usually It is not decomposed because in order to perform matrix multiplication, the number of k columns in the first factor matrix A and the number of k rows in the second factor matrix B need to be equal.
  • the first factor tensor A is a three-dimensional tensor of m ⁇ x ⁇ k
  • the second factor tensor B is a four-dimensional tensor of k ⁇ n ⁇ y ⁇ z, where k, m, n , x, y and z all represent positive integers.
  • the first factor tensor A can be converted into a two-dimensional tensor of the form (m ⁇ x,k). That is, cut on the x dimension, and concatenate the cut x m ⁇ k two-dimensional tensors row by row to obtain the two-dimensional equivalent matrix A'.
  • m ⁇ x threads can be used for parallel computation.
  • the second factor tensor can be cut into y ⁇ z k ⁇ n two-dimensional matrices and spliced column-by-column in sequence to obtain a two-dimensional equivalent matrix B’.
  • the multiplication (convolution) of a three-dimensional tensor and a four-dimensional tensor is used as an example to illustrate matrix dimensionality reduction, this is only illustrative and does not limit the scope of the present disclosure. Dimensionality reduction of other multi-dimensional mm can be handled similarly and will not be repeated here.
  • mm after dimensionality reduction please refer to the detailed description of mm mentioned above in Figure 8, which will not be described again here.
  • Figure 9 shows a schematic block diagram of an electronic device 900 according to one embodiment of the present disclosure.
  • the electronic device 900 may be used to perform the method 800 shown in FIG. 8 , and therefore various aspects described with respect to FIG. 8 may be selectively applicable to the electronic device 900 .
  • the electronic device 900 includes a receiving unit 902, a broadcast unit 903, a generating unit 904, and a storage unit 906.
  • the receiving unit 902 is configured to receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor register for the first tensor representation, the memory logical address for the second tensor, and the first product register representation for the third tensor.
  • Broadcast unit 903 is configured to broadcast, by the first set of threads, a second set of factors in the second tensor to a second set of threads, the second set of threads being different from the first set of threads based on the memory logical address for the second tensor.
  • the generation unit 904 is configured to perform a dot product operation on the first factor set and the second factor set in the first row of the first tensor based on the first factor register representation by the first thread in the second thread set to generate a third tensor.
  • the first point in the first row of the quantity accumulates.
  • Storage unit 906 is configured to accumulate the first point product by the first thread into the first set of product registers corresponding to the first product register representation.
  • each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a portion of the data in a row of the first factor matrix, and the second set of registers is used to store the data in a row of the product matrix. a row of data.
  • the data in a column of the second factor matrix may come from on-chip memory, L1 cache, or off-chip memory.
  • data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.
  • the generation unit 904 is further configured to, in response to receiving the second set of factors, a second thread in the second set of threads convert the third factor in the second row of the first tensor based on the first factor register representation.
  • the dot product operation is performed on the set and the second set of factors to produce the second dot product set in the second row of the third tensor.
  • Storage unit 908 is further configured to accumulate, by the second thread, the second dot product set into a second set of product registers corresponding to the first product register representation.
  • the first tensor multiply instruction further includes a first merge calculation mode indication.
  • the generation unit 904 is further configured to: based on the first combined calculation mode indication and the first factor register representation, perform a dot product operation on the first factor set and the second factor set in the first row by the first thread to generate a third The first point product in the first row of the tensor.
  • the generation unit 904 is further configured to combine the first set of factors and the fourth factor of the second tensor by a third thread in the first set of threads based on the first combined calculation mode indication and the first factor register representation.
  • the sets are subjected to a dot product operation to produce a third set of point products in the first row of the third tensor, the fourth set of factors is different from the set of second factors, and the third set of point products is different from the set of first point products.
  • Storage unit 906 is further configured to accumulate, by the third thread, a third point product into a third set of product registers corresponding to the first product register representation.
  • the first tensor multiply instruction further includes a second merge calculation mode indication.
  • the generation unit 904 is further configured to perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread based on the second combined calculation mode indication and the first factor register representation, to Generate the first point product in the first row of the third tensor.
  • the generation unit 904 is further configured to combine the fifth set of factors and the sixth factor of the second tensor by the fourth thread in the second set of threads based on the second combined calculation mode indication and the first factor register representation.
  • Sets are subjected to a dot product operation to produce a fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth dot product The set is different from the first point product set.
  • Storage unit 906 is further configured to accumulate, by the fourth thread, a fourth point product to the first set of product registers corresponding to the first product register representation.
  • the first tensor multiply instruction also includes a transpose instruction.
  • the generation unit 904 is further configured to: based on the transposition indication and the first factor register representation, perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread to generate the The first point product in the first row of three tensors.
  • the generation unit 904 is further configured to: load the factors of the plurality of rows in the second tensor to the cache based on the transposition indication and the memory logical address; select the factors from the factors of the plurality of rows by columns to forming a second factor set; and based on the first factor register representation, performing a dot product operation on the first factor set and the second factor set in the first row by the first thread to generate the third tensor in the first row. A little accumulation. In one embodiment, unselected factors in multiple rows are retained in the L1 cache until the unselected factors are selected for calculation of matrix multiplication.
  • the first thread set provides the second factor set corresponding to the memory logical address in a broadcast form in parallel to all threads in the second thread set.
  • the memory logical address includes segment reference data and offset data
  • the segment reference data represents the starting address of the second tensor
  • the offset data represents the offset of the second tensor in each of the multiple dimensions. Shift amount.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Computational Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A method executed by an accelerator, and an electronic device. The method comprises: receiving a first tensor multiplication instruction for a first thread of an accelerator (802); a first thread set broadcasting a second factor set in a second tensor to a second thread set on the basis of a logical memory address for the second tensor; and a first thread in the second thread set performing a dot product operation on a first factor set and the second factor set on the basis of a first factor register representation, so as to generate a first dot product set in a first row of a third tensor. A matrix is decomposed and threads are allocated according to rows, such that a plurality of threads can process a plurality of rows of a matrix tensor in parallel, thereby improving the processing efficiency of matrix multiplication. In addition, a programmer knows the row-column structure of a matrix tensor and thread condition in an accelerator during programming, therefore the programmer can flexibly use the threads to process matrix multiplication in parallel, thereby improving the programming flexibility.

Description

加速器执行的方法和电子设备Accelerator execution methods and electronic equipment 技术领域Technical field
本公开的实施例一般地涉及电子领域,更具体而言涉及一种由加速器执行的方法和加速器。Embodiments of the present disclosure relate generally to the field of electronics, and more specifically to a method performed by an accelerator and an accelerator.
背景技术Background technique
诸如图形处理器(GPU)之类的并行高性能多线程多核处理系统处理数据的速度比过去快得多。这些处理系统可以将复杂的计算分解为较小的任务,并且由多核并行处理以增加处理效率并且减少处理时间。Parallel high-performance multi-threaded multi-core processing systems such as graphics processing units (GPUs) process data much faster than in the past. These processing systems can break down complex calculations into smaller tasks and process them in parallel across multiple cores to increase processing efficiency and reduce processing time.
在一些情形下,诸如GPU之类的多核处理器对具有大量相同或相似形式的数据的张量的处理尤为有利。张量数据在计算机领域通常表示一维或多维数组的数据,例如图像数据就是一种常规的二维张量数据,其可以由二维数组表示。再例如,彩色图像是一种三维数组数据,除了包括宽和高的二维像素阵列之外,彩色图像还包括红绿蓝(RGB)通道维度。对诸如二维数组之类的张量进行处理例如可以包括矩阵乘法。基于GPU之类内部加速器的常规矩阵乘法对于程序编程人员通常不可获知,因此编程人员通常不了解硬件执行矩阵乘法的过程,因而也无法针对硬件对矩阵乘法的计算进行优化,这导致了程序的执行效率以及张量处理的效率通常较低。In some situations, multi-core processors such as GPUs are particularly advantageous for processing tensors with large amounts of data in the same or similar form. Tensor data usually represents one-dimensional or multi-dimensional array data in the computer field. For example, image data is a conventional two-dimensional tensor data, which can be represented by a two-dimensional array. For another example, a color image is a three-dimensional array data. In addition to a two-dimensional pixel array including width and height, the color image also includes red, green, and blue (RGB) channel dimensions. Processing tensors such as two-dimensional arrays can include matrix multiplication, for example. Conventional matrix multiplication based on internal accelerators such as GPUs is usually not available to programmers, so programmers often do not understand the process of hardware performing matrix multiplication, and therefore cannot optimize the calculation of matrix multiplication for hardware, which leads to program execution Efficiency and efficiency of tensor processing are generally lower.
发明内容Contents of the invention
本公开的实施例提供了一种用于由加速器执行的方法和电子设备。Embodiments of the present disclosure provide a method and electronic apparatus for execution by an accelerator.
在第一方面,提供了一种由加速器执行的方法。该方法包括:接收针对加速器的第一线程集的第一张量乘法指令,第一张量乘法指令包括针对第一线程集的第一线程指示、针对第一张量的第一因数寄存 器表示、针对第二张量的存储器逻辑地址、以及针对第三张量的第一乘积寄存器表示;第一线程集基于针对第二张量的存储器逻辑地址将第二张量中的第二因数集广播至第二线程集,第二线程集不同于第一线程集;第二线程集中的第一线程基于第一因数寄存器表示将第一张量中的第一行中的第一因数集和第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集;以及由第一线程将第一点积集累加到与第一乘积寄存器表示对应的第一组乘积寄存器中。通过将矩阵分解,并且按行分配线程,这样多个线程可以并行处理矩阵张量的多个行,从而加快矩阵乘法的处理速度。此外,由于编程人员在编程时知晓矩阵张量的行列结构以及加速器中的线程状况,因此可以灵活使用线程来并行处理矩阵乘法,从而提高编程的灵活性。In a first aspect, a method performed by an accelerator is provided. The method includes receiving a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor register representation for the first tensor, a memory logical address for the second tensor, and a first product register representation for the third tensor; the first set of threads broadcasts a second set of factors in the second tensor to the second set based on the memory logical address for the second tensor. Thread set, the second thread set is different from the first thread set; the first thread in the second thread set performs a dot product of the first factor set and the second factor set in the first row of the first tensor based on the first factor register representation operations to generate a first set of point products in the first row of the third tensor; and accumulating, by the first thread, the first set of point products into a first set of product registers corresponding to the first product register representation. By decomposing the matrix and assigning threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing of matrix multiplications. In addition, because programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.
在一种可能的实现方式中,第一因数集包括第一张量的第一行中的至少一部分因数数据。第二因数集包括第二张量中的至少一部分的因数数据。第一点积集包括第三张量的第一行中的至少一部分乘积数据。In a possible implementation, the first factor set includes at least part of the factor data in the first row of the first tensor. The second factor set includes factor data for at least a portion of the second tensor. The first point product set includes at least a portion of the product data in the first row of the third tensor.
在一种可能的实现方式中,每个线程包括第一组寄存器和第二组寄存器,其中第一组寄存器用于存储第一因数矩阵的一行中数据的至少一部分,第二组寄存器用于存储乘积矩阵中的一行的数据。第二因数矩阵的一列中的数据可以存储于片上存储器、一级高速缓存或片外存储器。这样,在矩阵乘法执行过程中,第一线程的执行单元可以仅从第一组寄存器读取第一因数矩阵的一行中的数据一次,并且在后续针对第二因数矩阵各列的点积运算过程中重复使用。此外,第二因数矩阵的一列中的数据可以被并行广播至多个(例如与第一因数矩阵的行相同数目或其一半数目)线程中的执行单元,并且重复使用。以此方式,可以减少数据在不同存储装置之间的传送,从而减少矩阵乘法计算过程中因数据传输引起的时间延迟。In a possible implementation, each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a part of the data in a row of the first factor matrix, and the second set of registers is used to store A row of data in the product matrix. Data in a column of the second factor matrix may be stored in on-chip memory, L1 cache, or off-chip memory. In this way, during the execution of matrix multiplication, the execution unit of the first thread can only read the data in one row of the first factor matrix from the first set of registers once, and perform the subsequent dot product operation on each column of the second factor matrix. reused. Furthermore, data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, the transmission of data between different storage devices can be reduced, thereby reducing the time delay caused by data transmission during matrix multiplication calculations.
在一种可能的实现方式中,该方法还包括响应于接收到第二因数集,第二线程集中的第二线程基于第一因数寄存器表示将第一张量的第二行中的第三因数集和第二因数集进行点积运算,以生成第三张量 的第二行中的第二点积集;以及由第二线程将第二点积集累加到与第一乘积寄存器表示对应的第二组乘积寄存器中。In one possible implementation, the method further includes, in response to receiving the second set of factors, a second thread in the second set of threads converting the third factor in the second row of the first tensor based on the first factor register representation. performing a dot product operation on the set and the second set of factors to generate a second dot product set in the second row of the third tensor; and accumulating, by the second thread, the second dot product set corresponding to the first product register representation in the second set of product registers.
在一种可能的实现方式中,第一张量乘法指令还包括第一合并计算模式指示。生成第三张量的第一行中的第一点积集包括:基于第一合并计算模式指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In a possible implementation, the first tensor multiplication instruction further includes a first merge calculation mode indication. Generating the first set of point products in the first row of the third tensor includes: combining, by the first thread, the first set of factors and the second factor in the first row based on the first merge calculation mode indication and the first factor register representation. The dot product operation is performed on the sets to produce the first set of dot products in the first row of the third tensor.
在一种可能的实现方式中,该方法还包括:基于第一合并计算模式指示和第一因数寄存器表示,由第一线程集中的第三线程将第一因数集和第二张量的第四因数集进行点积运算,以生成第三张量的第一行中的第三点积集,第四因数集不同于第二因数集,第三点积集不同于第一点积集;以及由第三线程将第三点积集累加到与第一乘积寄存器表示对应的第三组乘积寄存器中。In a possible implementation, the method further includes: based on the first combined calculation mode indication and the first factor register representation, using a third thread in the first thread set to combine the first factor set and the fourth part of the second tensor. perform a dot product operation on the set of factors to generate a third set of dot products in the first row of the third tensor, the fourth set of factors being different from the set of second factors, and the third set of dot products being different from the set of first dot products; and The third point product is accumulated by the third thread into a third set of product registers corresponding to the first product register representation.
在一种可能的实现方式中,第一张量乘法指令还包括第二合并计算模式指示。生成第三张量的第一行中的第一点积集包括:基于第二合并计算模式指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In a possible implementation, the first tensor multiplication instruction further includes a second merge calculation mode indication. Generating the first point product set in the first row of the third tensor includes: based on the second merge calculation mode indication and the first factor register representation, the first thread combines the first factor set in the first row with the second tensor The second set of factors is dot producted to produce the first set of dot products in the first row of the third tensor.
在一种可能的实现方式中,该方法还包括:基于第二合并计算模式指示和第一因数寄存器表示,由第二线程集中的第四线程将第五因数集和第二张量的第六因数集进行点积运算,以生成第三张量的第一行中的第四点积集,第五因数集不同于第一因数集,第六因数集不同于第二因数集,第四点积集不同于第一点积集;以及由第四线程将第四点积集累加至与第一乘积寄存器表示对应的第一组乘积寄存器。In a possible implementation, the method further includes: based on the second combined calculation mode indication and the first factor register representation, the fourth thread in the second thread set combines the fifth factor set and the sixth factor set of the second tensor. The factor sets are dot producted to produce the fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, the fourth point The product set is different from the first dot product set; and the fourth dot product set is accumulated by the fourth thread to the first set of product registers corresponding to the first product register representation.
在一种可能的实现方式中,第一张量乘法指令还包括转置指示。生成第三张量的第一行中的第一点积集包括:基于转置指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In one possible implementation, the first tensor multiplication instruction also includes a transpose instruction. Generating the first dot product set in the first row of the third tensor includes, based on the transpose indication and the first factor register representation, combining the first set of factors in the first row and the second set of factors in the second tensor by the first thread. The set of factors are dot producted to produce the first set of dot products in the first row of the third tensor.
在一种可能的实现方式中,基于转置指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算以生成第三张量的第一行中的第一点积集包括:基于转置指示和存储器逻辑地址,将第二张量中的多个行的因数加载至高速缓存;按列从多个行的因数中选择因数以形成第二因数集;以及基于第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二因数集进行点积运算以生成第三张量的第一行中的第一点积集。In a possible implementation, based on the transposition indication and the first factor register representation, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to generate the The first dot product set in the first row of three tensors consists of: loading the factors of multiple rows in the second tensor into cache based on the transpose directive and the memory logical address; selecting from the factors of multiple rows by column factors to form a second factor set; and based on the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate the first row of the third tensor The first point accumulation of .
在一种可能的实现方式中,多个行中未被选择的多个因数被保留在高速缓存中,直至上述多个未被选择的因数被选择用于进行矩阵乘法的计算。In a possible implementation, multiple unselected factors in multiple rows are retained in the cache until the multiple unselected factors are selected for calculation of matrix multiplication.
在一种可能的实现方式中,第一线程集将与存储器逻辑地址对应的第二因数集以广播的形式并行提供给第二线程集中的全部线程中的计算单元,而不提供至全部线程中的寄存器。In a possible implementation, the first thread set provides the second factor set corresponding to the memory logical address in parallel to the computing units in all threads in the second thread set in the form of broadcast, but not to all threads. register.
在一种可能的实现方式中,存储器逻辑地址包括段基准数据和偏移数据,段基准数据表示在第二张量中的起始地址,偏移数据表示在第二张量多个维度中的各维上的偏移量。In a possible implementation, the memory logical address includes segment reference data and offset data. The segment reference data represents the starting address in the second tensor, and the offset data represents each dimension in the multiple dimensions of the second tensor. offset on.
在一种可能的实现方式中,第一乘积寄存器表示对应于一个或多个乘积寄存器,一个或多个乘积寄存器的数目与合并计算模式以及第二张量的列数相关,不同线程的乘积寄存器构成结果张量,每个线程的乘积寄存器包括结果张量每行的部分或全部;以及结果张量的行数与第一张量的行数相同,结果张量的列数与第二张量的列数相同。In a possible implementation, the first product register represents one or more product registers, the number of one or more product registers is related to the combined calculation mode and the number of columns of the second tensor, and the product registers of different threads Constituting a result tensor, each thread's product register contains part or all of each row of the result tensor; and the result tensor has the same number of rows as the first tensor, and the result tensor has the same number of columns as the second tensor The number of columns is the same.
在一种可能的实现方式中,第二线程集中的线程内的乘积寄存器的数目是可变的,乘积寄存器的数目取决于第一张量乘法指令的执行条件,执行条件确定对第二张量中的列的访问;以及如果第二张量中的第一列未被访问,则第二张量中的第一列不参与矩阵乘法计算。In a possible implementation, the number of product registers in the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the value of the multiplication instruction in the second tensor. Column access; and if the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.
在一种可能的实现方式中,第一张量乘法指令被多次发射,其中第一张量乘法指令第一次以存储指令的方式被发射,以用于获取第二张量中的列数据或者行数据;以及响应于获取到第二张量中的列数据或者行数据,并且第一张量的数据已被存储在第一因数寄存器中,第 一张量乘法指令以数学计算指令的方式被二次或多次发射,以用于执行第三张量的行内的各列结果的计算。In a possible implementation, the first tensor multiplication instruction is issued multiple times, where the first tensor multiplication instruction is issued for the first time in the form of a storage instruction to obtain column data or rows in the second tensor. data; and in response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction is quadratic or in the form of a mathematical calculation instruction. Emit multiple times to perform the calculation of the results of each column within the row of the third tensor.
在一种可能的实现方式中,在进行二次或多次发射之前,检查第一因数寄存器的对应的令牌状态;如果令牌状态表示第一张量的数据已被存储在第一因数寄存器中,则以数学计算指令方式发射,否则阻塞发射队列,直至第一张量的数据已被存储在第一因数寄存器中。In a possible implementation, before performing the second or multiple transmissions, the corresponding token status of the first factor register is checked; if the token status indicates that the data of the first tensor has been stored in the first factor register , it is issued in the form of mathematical calculation instructions, otherwise the emission queue is blocked until the data of the first tensor has been stored in the first factor register.
在一种可能的实现方式中,基于第一乘积寄存器表示,确定针对第三张量的乘积寄存器使用范围是否超出单个线程内的寄存器文件的范围;以及如果确定针对第三张量的乘积寄存器使用范围超出单个线程内的寄存器文件的范围,则忽略超出寄存器文件范围的计算操作或访存操作并且报错。In one possible implementation, based on the first product register representation, it is determined whether the product register usage range for the third tensor exceeds the range of the register file within a single thread; and if the product register usage range for the third tensor is determined If the scope exceeds the scope of the register file within a single thread, calculation operations or memory access operations that exceed the scope of the register file are ignored and an error is reported.
根据本公开的第二方面,提供一种电子设备。电子设备包括:流处理器;页表装置,耦合至流处理器;存储器;处理引擎单元,耦合至流处理器、存储器和页表装置,被配置为执行根据第一方面的方法。According to a second aspect of the present disclosure, an electronic device is provided. The electronic device includes: a stream processor; a page table device coupled to the stream processor; a memory; a processing engine unit coupled to the stream processor, the memory and the page table device configured to perform the method according to the first aspect.
根据本公开的第三方面,提供一种电子设备。该电子设备包括:接收单元,被配置为接收针对加速器的第一线程集的第一张量乘法指令,第一张量乘法指令包括针对第一线程集的第一线程指示、针对第一张量的第一因数寄存器表示、针对第二张量的存储器逻辑地址、以及针对第三张量的第一乘积寄存器表示;广播单元,被配置为由第一线程集基于针对第二张量的存储器逻辑地址将第二张量中的第二因数集广播至第二线程集,第二线程集不同于第一线程集;生成单元,被配置为由第二线程集中的第一线程基于第一因数寄存器表示将第一张量中的第一行中的第一因数集和第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集;以及存储单元,被配置为由第一线程将第一点积集累加到与第一乘积寄存器表示对应的第一组乘积寄存器中。通过将矩阵分解,并且按行分配线程,这样多个线程可以并行处理矩阵张量的多个行,从而加快矩阵乘法的处理效率。此外,由于编程人员在编程时知晓矩阵张量的行列结构以及加速器中的线程状况,因此可以灵活使用线程来并行处理矩阵乘法,从而提高编程 的灵活性。According to a third aspect of the present disclosure, an electronic device is provided. The electronic device includes: a receiving unit configured to receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, for the first tensor a first factor register representation of, a memory logical address for the second tensor, and a first product register representation for the third tensor; a broadcast unit configured to be configured by the first set of threads based on the memory logic address for the second tensor The address broadcasts a second set of factors in the second tensor to a second set of threads, the second set of threads being different from the first set of threads; the generation unit is configured to represent, by the first thread in the second thread set, the first factor register based on the first set of threads. perform a dot product operation on the first factor set and the second factor set in the first row of the first tensor to generate the first dot product set in the first row of the third tensor; and a storage unit configured to be A thread accumulates the first point product into a first set of product registers corresponding to the first product register representation. By decomposing the matrix and allocating threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing efficiency of matrix multiplication. In addition, because programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.
在一种可能的实现方式中,每个线程包括第一组寄存器和第二组寄存器,其中第一组寄存器用于存储第一因数矩阵的一行中数据的至少一部分,第二组寄存器用于存储乘积矩阵中的一行的数据。第二因数矩阵的一列中的数据可以来自于片上存储器、一级高速缓存或片外存储器。这样,在矩阵乘法执行过程中,第一线程的执行单元可以仅从第一组寄存器读取第一因数矩阵的一行中的数据一次,并且在后续点积运算过程中重复使用。此外,第二因数矩阵的一列中的数据可以被并行广播至多个(例如与第一因数矩阵的行相同数目或其一半数目)线程中的执行单元,并且重复使用。以此方式,可以减少数据在不同存储装置之间的传送,从而减少矩阵乘法计算过程中因数据传输引起的时间。In a possible implementation, each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a part of the data in a row of the first factor matrix, and the second set of registers is used to store A row of data in the product matrix. The data in a column of the second factor matrix may come from on-chip memory, L1 cache, or off-chip memory. In this way, during the execution of matrix multiplication, the execution unit of the first thread can read the data in one row of the first factor matrix from the first set of registers only once, and reuse it during subsequent dot product operations. Furthermore, data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.
在一种可能的实现方式中,生成单元被进一步配置为响应于接收到第二因数集,第二线程集中的第二线程基于第一因数寄存器表示将第一张量的第二行中的第三因数集和第二因数集进行点积运算,以生成第三张量的第二行中的第二点积集。存储单元908被进一步配置为由第二线程将第二点积集累加到与第一乘积寄存器表示对应的第二组乘积寄存器中。In one possible implementation, the generation unit is further configured to, in response to receiving the second set of factors, a second thread in the second set of threads convert the first element in the second row of the first tensor based on the first factor register representation. The three-factor set and the second factor set are dot producted to produce the second dot product set in the second row of the third tensor. Storage unit 908 is further configured to accumulate, by the second thread, the second dot product set into a second set of product registers corresponding to the first product register representation.
在一种可能的实现方式中,第一张量乘法指令还包括第一合并计算模式指示。生成单元还被配置为:基于第一合并计算模式指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In a possible implementation, the first tensor multiplication instruction further includes a first merge calculation mode indication. The generation unit is further configured to: based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate a third The first point in the first row of the quantity accumulates.
在一种可能的实现方式中,生成单元还被配置为基于第一合并计算模式指示和第一因数寄存器表示,由第一线程集中的第三线程将第一因数集和第二张量的第四因数集进行点积运算,以生成第三张量的第一行中的第三点积集,第四因数集不同于第二因数集,第三点积集不同于第一点积集。存储单元还被配置为由第三线程将第三点积集累加到与第一乘积寄存器表示对应的第三组乘积寄存器中。In a possible implementation, the generation unit is further configured to use a third thread in the first thread set to combine the first factor set and the second tensor of the second tensor based on the first combined calculation mode indication and the first factor register representation. The four-factor set performs a dot product operation to produce a third dot product set in the first row of the third tensor, the fourth factor set is different from the second factor set, and the third dot product set is different from the first dot product set. The storage unit is further configured to accumulate, by the third thread, a third point product into a third set of product registers corresponding to the first product register representation.
在一种可能的实现方式中,第一张量乘法指令还包括第二合并计 算模式指示。生成单元还被配置为基于第二合并计算模式指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In a possible implementation, the first tensor multiplication instruction also includes a second merge calculation mode indication. The generation unit is further configured to perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor based on the second merge calculation mode indication and the first factor register representation by the first thread to generate The first point product in the first row of the third tensor.
在一种可能的实现方式中,生成单元还被配置为基于第二合并计算模式指示和第一因数寄存器表示,由第二线程集中的第四线程将第五因数集和第二张量的第六因数集进行点积运算,以生成第三张量的第一行中的第四点积集,第五因数集不同于第一因数集,第六因数集不同于第二因数集,第四点积集不同于第一点积集。存储单元还被配置为由第四线程将第四点积集累加至与第一乘积寄存器表示对应的第一组乘积寄存器。In a possible implementation, the generation unit is further configured to use the fourth thread in the second thread set to combine the fifth factor set and the second tensor of the second tensor based on the second combined calculation mode indication and the first factor register representation. The dot product operation is performed on the set of six factors to produce the fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth The point product set is different from the first point product set. The storage unit is further configured to accumulate, by the fourth thread, a fourth point product set to the first set of product registers corresponding to the first product register representation.
在一种可能的实现方式中,第一张量乘法指令还包括转置指示。生成单元还被配置为:基于转置指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In one possible implementation, the first tensor multiplication instruction also includes a transpose instruction. The generation unit is further configured to: based on the transpose indication and the first factor register representation, perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread to generate a third The first point product in the first row of the tensor.
在一种可能的实现方式中,生成单元还被配置为:基于转置指示和存储器逻辑地址,将第二张量中的多个行的因数加载至高速缓存;按列从多个行的因数中选择因数以形成第二因数集;以及基于第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二因数集进行点积运算以生成第三张量的第一行中的第一点积集。在一种可能的实现方式中,多个行中未被选择的多个因数被保留在一级高速缓存中,直至上述多个未被选择的因数被选择用于进行矩阵乘法的计算。In a possible implementation, the generation unit is further configured to: load the factors of the plurality of rows in the second tensor into the cache based on the transposition indication and the memory logical address; select from the factors of the plurality of rows by column factors to form a second factor set; and based on the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate the first row of the third tensor The first point accumulation of . In a possible implementation, multiple unselected factors in multiple rows are retained in the first-level cache until the multiple unselected factors are selected for calculation of matrix multiplication.
在一种可能的实现方式中,第一线程集将与存储器逻辑地址对应的第二因数集由以广播的形式并行提供给第二线程集中的全部线程。In a possible implementation manner, the first thread set provides the second factor set corresponding to the memory logical address to all threads in the second thread set in parallel in the form of broadcast.
在一种可能的实现方式中,存储器逻辑地址包括段基准数据和偏移数据,段基准数据表示第二张量的起始地址,偏移数据表示第二张量在多个维度中的各维上的偏移量。In a possible implementation, the memory logical address includes segment reference data and offset data. The segment reference data represents the starting address of the second tensor, and the offset data represents each dimension of the second tensor in multiple dimensions. offset on.
在一种可能的实现方式中,第一乘积寄存器表示对应于一个或多个乘积寄存器,一个或多个乘积寄存器的数目与合并计算模式以及第 二张量的列数相关,不同线程的乘积寄存器构成结果张量,每个线程的乘积寄存器包括结果张量每行的部分或全部;以及结果张量的行数与第一张量的行数相同,结果张量的列数与第二张量的列数相同。In a possible implementation, the first product register represents one or more product registers, the number of one or more product registers is related to the combined calculation mode and the number of columns of the second tensor, and the product registers of different threads Constituting a result tensor, each thread's product register contains part or all of each row of the result tensor; and the result tensor has the same number of rows as the first tensor, and the result tensor has the same number of columns as the second tensor The number of columns is the same.
在一种可能的实现方式中,第二线程集中的线程内的乘积寄存器的数目是可变的,乘积寄存器的数目取决于第一张量乘法指令的执行条件,执行条件确定对第二张量中的列的访问;以及如果第二张量中的第一列未被访问,则第二张量中的第一列不参与矩阵乘法计算。In a possible implementation, the number of product registers in the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the value of the multiplication instruction in the second tensor. Column access; and if the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.
在一种可能的实现方式中,第一张量乘法指令被多次发射,其中第一张量乘法指令第一次以存储指令的方式被发射,以用于获取第二张量中的列数据或者行数据;以及响应于获取到第二张量中的列数据或者行数据,并且第一张量的数据已被存储在第一因数寄存器中,第一张量乘法指令以数学计算指令的方式被二次或多次发射,以用于执行第三张量的行内的各列结果的计算。In a possible implementation, the first tensor multiplication instruction is issued multiple times, where the first tensor multiplication instruction is issued for the first time in the form of a storage instruction to obtain column data or rows in the second tensor. data; and in response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction is quadratic or in the form of a mathematical calculation instruction. Emit multiple times to perform the calculation of the results of each column within the row of the third tensor.
在一种可能的实现方式中,加速器还包括检查单元,检查单元被配置为在进行二次或多次发射之前,检查第一因数寄存器的对应的令牌状态;如果令牌状态表示第一张量的数据已被存储在第一因数寄存器中,则以数学计算指令方式发射,否则阻塞发射队列,直至第一张量的数据已被存储在第一因数寄存器中。In a possible implementation, the accelerator further includes a checking unit configured to check the corresponding token status of the first factor register before performing the second or multiple transmissions; if the token status represents the first If the amount of data has been stored in the first factor register, it is issued in the form of a mathematical calculation instruction, otherwise the emission queue is blocked until the first amount of data has been stored in the first factor register.
在一种可能的实现方式中,加速器还包括越界检查单元。越界检查单元被配置为基于第一乘积寄存器表示,确定针对第三张量的乘积寄存器使用范围是否超出单个线程内的寄存器文件的范围;以及如果确定针对第三张量的乘积寄存器使用范围超出单个线程内的寄存器文件的范围,则忽略超出寄存器文件范围的计算操作或访存操作并且报错。In a possible implementation, the accelerator further includes an out-of-bounds checking unit. The out-of-bounds checking unit is configured to determine, based on the first product register representation, whether the product register usage range for the third tensor exceeds the range of a register file within a single thread; and if it is determined that the product register usage range for the third tensor exceeds the range of a single The range of the register file within the thread, then calculation operations or memory access operations that exceed the range of the register file are ignored and an error is reported.
在一种可能的实现方式中,第一线程集将与存储器逻辑地址对应的第二因数集以广播的形式并行提供给第二线程集中的全部线程中的计算单元,而不提供至全部线程中的寄存器。In a possible implementation, the first thread set provides the second factor set corresponding to the memory logical address in parallel to the computing units in all threads in the second thread set in the form of broadcast, but not to all threads. register.
根据本公开的实施例的方法和电子设备,编程人员可以从矩阵角度考虑线程任务分配,这样可以使用一个或多个线程来计算第一因数 矩阵的一行与第二因数矩阵的点积,并且将相应结果累加到相同线程内的乘积寄存器,从而增加针对矩阵乘法的编程灵活性并且提高矩阵乘法的执行效率。According to the method and electronic device of the embodiments of the present disclosure, programmers can consider thread task allocation from a matrix perspective, so that one or more threads can be used to calculate the dot product of one row of the first factor matrix and the second factor matrix, and The corresponding results are accumulated into the product register within the same thread, thereby increasing programming flexibility for matrix multiplication and improving the execution efficiency of matrix multiplication.
附图说明Description of the drawings
通过结合附图对本公开示例性实施例进行更详细的描述,本公开的上述以及其他目的、特征和优势将变得更加明显,其中,在本公开示例性实施例中,相同的参考标号通常代表相同部件。The above and other objects, features and advantages of the present disclosure will become more apparent by describing the exemplary embodiments of the present disclosure in more detail with reference to the accompanying drawings, wherein, in the exemplary embodiments of the present disclosure, the same reference numerals generally represent Same parts.
图1示出了本公开的多个实施例能够在其中实现的示例环境的示意图;1 illustrates a schematic diagram of an example environment in which various embodiments of the present disclosure can be implemented;
图2示出了根据本公开的一个实施例的芯片示意框图;Figure 2 shows a schematic block diagram of a chip according to an embodiment of the present disclosure;
图3示出了根据本公开的一个实施例的三维张量示意框图;Figure 3 shows a schematic block diagram of a three-dimensional tensor according to one embodiment of the present disclosure;
图4示出了根据本公开的一个实施例的图像数据的页分配示意图;Figure 4 shows a schematic diagram of page allocation of image data according to one embodiment of the present disclosure;
图5示出了根据本公开的一个实施例的矩阵乘法的示意图;Figure 5 shows a schematic diagram of matrix multiplication according to one embodiment of the present disclosure;
图6示出了根据本公开的一个实施例的矩阵乘法的一部分的示意图;Figure 6 shows a schematic diagram of a portion of matrix multiplication according to one embodiment of the present disclosure;
图7示出了根据本公开的另一实施例的矩阵乘法的一部分的示意图;7 shows a schematic diagram of a portion of matrix multiplication according to another embodiment of the present disclosure;
图8示出了根据本公开的一个实施例的由加速器执行的方法的示意流程图;以及8 shows a schematic flow diagram of a method performed by an accelerator according to one embodiment of the present disclosure; and
图9示出了根据本公开的一个实施例的电子设备的示意框图。Figure 9 shows a schematic block diagram of an electronic device according to one embodiment of the present disclosure.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的优选实施例。虽然附图中示出了本公开的优选实施例,然而应该理解,本公开可以以各种形式实现而不应被这里阐述的实施例限制。相反,提供这些实施例是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
在本文中使用的术语“包括”及其变形表示开放性包括,即“包括但不限于”。除非特别申明,术语“或”表示“和/或”。术语“基于”表示“至少部分地基于”。术语“一个示例实施例”和“一个实施例”表示“至少一个示例实施例”。术语“另一实施例”表示“至少一个另外的实施例”。术语“第一”、“第二”等等可以指代不同的或相同的对象。下文还可能包括其他明确的和隐含的定义。As used herein, the term "include" and its variations mean an open inclusion, ie, "including but not limited to." Unless otherwise stated, the term "or" means "and/or". The term "based on" means "based at least in part on." The terms "one example embodiment" and "an embodiment" mean "at least one example embodiment." The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," etc. may refer to different or the same object. Other explicit and implicit definitions may be included below.
如前文所提及的,基于GPU之类的内部硬件加速器的常规矩阵乘法对于程序编程人员通常不可获知,因此编程人员通常不了解硬件执行矩阵乘法的过程,因而也无法针对硬件对矩阵乘法的计算进行优化,这导致了程序的执行效率以及张量处理的效率通常较低。As mentioned earlier, regular matrix multiplication based on internal hardware accelerators such as GPUs is usually not available to programmers. Therefore, programmers usually do not understand the process of hardware performing matrix multiplication, and therefore cannot calculate matrix multiplication for hardware. Optimizations are performed, which results in program execution and tensor processing being generally less efficient.
在本公开的一些实施例中,编程人员可以从矩阵的行列结构角度考虑线程任务分配,这样可以使用一个或多个线程来计算第一因数矩阵的一行与第二因数矩阵的点积,并且将相应结果查出在相同线程内的乘积寄存器,从而增加针对矩阵乘法的编程灵活性并且提高矩阵乘法的执行效率。In some embodiments of the present disclosure, programmers can consider thread task allocation from the perspective of the row-column structure of the matrix, so that one or more threads can be used to calculate the dot product of a row of the first factor matrix and the second factor matrix, and The corresponding result finds the product register within the same thread, thereby increasing programming flexibility for matrix multiplication and improving the execution efficiency of matrix multiplication.
图1示出了本公开的多个实施例能够在其中实现的示例环境100的示意图。示例环境100例如可以是诸如计算机之类的具有计算能力的电子设备。在一个实施例中,示例环境100例如包括中央处理器(CPU)20、系统存储器10、北桥/存储器桥30、加速器40、设备存储器50和南桥/输入输出(IO)桥60。系统存储器10例如可以是诸如动态随机存取存储器(DRAM)之类的易失性存储器。北桥/存储器桥30例如集成了内存控制器、PCIe控制器等,其负责CPU 20和高速接口之间的数据交换以及桥接CPU 20和南桥/IO桥60。南桥/IO桥60用于计算机的低速接口,例如串行高级技术接口(SATA)控制器等。加速器40例如可以包括诸如图形处理器(GPU)和/或人工智能(AI)加速器等用于对图形、视频等数据进行加速处理的装置或芯片。在一个实施例中,加速器40可以是GPU。在另一实施例中,加速器40可以是AI芯片。设备存储器50例如可以是诸如DRAM之类的位于加速器40外部的易失性存储器。在本公开中,设备存储器50也被 称为片外存储器,即,位于加速器40的芯片外部的存储器。相对而言,加速器40的芯片内部也具有易失性存储器,例如一级(L1)高速缓存(cache)以及可选的二级(L2)高速缓存。这将在下文结合本公开的一些实施例具体描述。虽然在图1中示出了本公开的多个实施例能够在其中实现的一种示例环境100,但是本公开不限于此。本公开的一些实施例也可以在诸如ARM架构和RISC-V架构之类的具有诸如GPU之类的加速器的一些应用环境中使用。Figure 1 shows a schematic diagram of an example environment 100 in which various embodiments of the present disclosure can be implemented. Example environment 100 may be, for example, an electronic device with computing capabilities, such as a computer. In one embodiment, example environment 100 includes, for example, central processing unit (CPU) 20, system memory 10, northbridge/memory bridge 30, accelerator 40, device memory 50, and southbridge/input-output (IO) bridge 60. System memory 10 may be, for example, volatile memory such as dynamic random access memory (DRAM). The northbridge/memory bridge 30, for example, integrates a memory controller, a PCIe controller, etc., and is responsible for data exchange between the CPU 20 and the high-speed interface and bridging the CPU 20 and the southbridge/IO bridge 60. Southbridge/IO bridge 60 is used for low-speed interfaces of computers, such as Serial Advanced Technology Interface (SATA) controllers, etc. The accelerator 40 may include, for example, a device or chip such as a graphics processing unit (GPU) and/or an artificial intelligence (AI) accelerator for accelerating processing of graphics, video, and other data. In one embodiment, accelerator 40 may be a GPU. In another embodiment, accelerator 40 may be an AI chip. Device memory 50 may be, for example, volatile memory external to accelerator 40 such as DRAM. In this disclosure, device memory 50 is also referred to as off-chip memory, i.e., memory located outside the chip of accelerator 40. Relatively speaking, the accelerator 40 also has volatile memory inside the chip, such as a level one (L1) cache (cache) and an optional level two (L2) cache. This will be described in detail below in connection with some embodiments of the present disclosure. Although an example environment 100 in which various embodiments of the disclosure can be implemented is shown in FIG. 1 , the disclosure is not limited thereto. Some embodiments of the present disclosure may also be used in some application environments such as ARM architecture and RISC-V architecture with accelerators such as GPUs.
图2示出了根据本公开的一个实施例的加速器200的示意框图。加速器200例如可以是图1中加速器40的芯片的一种具体实现方式。加速器200例如是诸如GPU之类的加速器芯片。在一个实施例中,加速器200包括流处理器(SP)210、页表装置220、处理引擎(PE)单元230、直接存储器访问(DMA)控制器240、L1高速缓存(cache)260和L2高速缓存250。Figure 2 shows a schematic block diagram of an accelerator 200 according to one embodiment of the present disclosure. The accelerator 200 may be, for example, a specific implementation of the chip of the accelerator 40 in FIG. 1 . The accelerator 200 is, for example, an accelerator chip such as a GPU. In one embodiment, accelerator 200 includes stream processor (SP) 210, page table device 220, processing engine (PE) unit 230, direct memory access (DMA) controller 240, L1 cache (cache) 260, and L2 cache. Cache 250.
加速器200由诸如CPU 20之类的主机设备控制,并且接收来自CPU 20的指令。SP 210对来自CPU 20的指令进行分析,并且将经分析的操作指派给PE单元230、页表装置220和DMA控制器240进行处理。页表装置220用于管理加速器200的片上虚拟存储。在本公开中,L2高速缓存250和诸如图1中的设备存储器50之类的片外存储器构成虚拟存储系统。页表装置220由SP 210、PE单元230和DMA控制器240共同维护。The accelerator 200 is controlled by a host device such as the CPU 20 and receives instructions from the CPU 20. The SP 210 analyzes instructions from the CPU 20 and assigns the analyzed operations to the PE unit 230, the page table device 220, and the DMA controller 240 for processing. The page table device 220 is used to manage the on-chip virtual storage of the accelerator 200 . In the present disclosure, L2 cache 250 and off-chip memory such as device memory 50 in FIG. 1 constitute a virtual storage system. Page table device 220 is jointly maintained by SP 210, PE unit 230 and DMA controller 240.
PE单元230包括多个处理引擎(processing engine,PE)PE_1、PE_2……PE_N,其中N表示大于1的整数。PE单元230中的每个PE可以是单指令多线程(SIMT)装置。在PE中,每个线程可以具有自己的寄存器堆(register file),并且每个PE的所有线程还共享一个统一寄存器堆(uniform register file)。多个PE可以并行地执行相同或不同的处理工作,可以并行地进行下文所述的地址转换和存储器中目标数据的访问,从而减少处理时间。可以理解,多个PE处理的目标元素并不相同,并且目标元素所在的段、页、缓存行和元素的属性、尺寸、维度排序等可以有所不同,如下文具体描述。The PE unit 230 includes a plurality of processing engines (PE) PE_1, PE_2...PE_N, where N represents an integer greater than 1. Each PE in PE unit 230 may be a single instruction multi-thread (SIMT) device. In PE, each thread can have its own register file, and all threads of each PE also share a unified register file. Multiple PEs can perform the same or different processing work in parallel, and can perform address translation and access to target data in the memory as described below in parallel, thereby reducing processing time. It can be understood that the target elements processed by multiple PEs are not the same, and the segments, pages, cache lines where the target elements are located, and the attributes, sizes, dimension ordering, etc. of the elements may be different, as described in detail below.
在一个实施例中,目标元素的逻辑地址可以表示为seg:RF:imm,其中seg表示段基址寄存器,RF表示偏移寄存器,imm表示偏移立即数。从张量角度而言,逻辑地址可以包括目标元素在第一段张量中各维上的基准数据和偏移数据。偏移数据表示目标元素在第一段的多个维度中的各维上的偏移量,段基准数据为段起始点的地址。In one embodiment, the logical address of the target element can be expressed as seg:RF:imm, where seg represents the segment base address register, RF represents the offset register, and imm represents the offset immediate value. From a tensor perspective, the logical address can include the reference data and offset data of the target element in each dimension of the first tensor. The offset data represents the offset of the target element on each of the multiple dimensions of the first segment, and the segment reference data is the address of the segment starting point.
在一个实施例中,第一段包括至少一个页,加速器200可以至少根据目标元素页的各维的尺寸,将逻辑地址转换为线性地址。线性地址包括目标元素页的一维页标识和目标元素在目标元素页内的一维偏移值。具体而言,加速器200可以根据第一段内各维上页的页尺寸得到目标元素在各维上所处的页序号偏移,由此获得目标元素所处的页的一维标识。例如,目标元素位于图3中的张量的最上层,通过上述方式可以确定目标元素的页标识为P[1]。In one embodiment, the first segment includes at least one page, and the accelerator 200 may convert the logical address into a linear address based at least on dimensions of each dimension of the target element page. The linear address includes the one-dimensional page identifier of the target element page and the one-dimensional offset value of the target element within the target element page. Specifically, the accelerator 200 can obtain the page number offset of the target element in each dimension according to the page size of the page in each dimension in the first segment, thereby obtaining the one-dimensional identification of the page where the target element is located. For example, the target element is located at the top level of the tensor in Figure 3, and the page identifier of the target element can be determined to be P[1] through the above method.
此外,加速器还可以得到目标元素在该页内部各维上的相对偏移量,并以此为基础,确定目标元素相对于页的起始位置的一维线性偏移量。页的一维标识以及页内的一维线性偏移量共同构成目标元素的线性地址。In addition, the accelerator can also obtain the relative offset of the target element in each dimension within the page, and based on this, determine the one-dimensional linear offset of the target element relative to the starting position of the page. The one-dimensional identifier of the page and the one-dimensional linear offset within the page together constitute the linear address of the target element.
加速器200根据针对目标元素页的页表项,将线性地址转换为物理地址,页表项包括至少一个页中的每个页的页物理地址。具体而言,在一个实施例中,加速器200在获取目标元素的页标识之后,可以根据页标识查找页表装置220中对应的项,获取页的物理地址。该物理地址加上目标元素在目标元素页的一维线性偏移量即为目标元素的物理地址。该物理地址可以表示片外的设备存储器50或片上的存储器,例如L2高速缓存250上的目标元素的存储地址。备选地,目标元素页的页表项也可以存储相对于其它页的物理地址,并且基于目标元素页相对于其它页的偏移、其它页的物理地址和一维线性偏移量来获得目标元素的物理地址。The accelerator 200 converts the linear address into a physical address based on the page table entry for the target element page, which page table entry includes the page physical address of each page in at least one page. Specifically, in one embodiment, after obtaining the page identifier of the target element, the accelerator 200 can search the corresponding item in the page table device 220 according to the page identifier to obtain the physical address of the page. This physical address plus the one-dimensional linear offset of the target element on the target element page is the physical address of the target element. The physical address may represent the storage address of the target element on the off-chip device memory 50 or on-chip memory, such as the L2 cache 250 . Alternatively, the page table entry of the target element page can also store the physical address relative to other pages, and the target is obtained based on the offset of the target element page relative to other pages, the physical addresses of other pages, and the one-dimensional linear offset The physical address of the element.
除了物理地址之外,页表项还可以包括其它属性,例如状态,用于表示页是否加载完毕,即是否可用。本公开对此不进行限制。虽然在此示出了地址的二级转换,但是本公开不限于此。备选地,也可以 经过更多级转换。例如,分级计算页偏移、缓存行偏移、元素偏移,并且依次与物理地址相加以得到目标元素的最终的物理地址。In addition to the physical address, page table entries can also include other attributes, such as status, which indicates whether the page has been loaded, that is, whether it is available. This disclosure does not limit this. Although two-level translation of addresses is shown here, the disclosure is not limited thereto. Alternatively, more stages of conversion are possible. For example, the page offset, cache line offset, and element offset are calculated hierarchically, and are sequentially added to the physical address to obtain the final physical address of the target element.
在一个实施例中,加速器200将多个页中的第一页从片外的存储器移入片上的存储器,并且建立与第一页对应的第一页表项,第一页表项存储第一页在存储器中的物理地址。如果将多个页中的第一页从存储器移入片外存储器,则加速器200可以删除与第一页对应的第一页表项。In one embodiment, the accelerator 200 moves the first page of the plurality of pages from the off-chip memory into the on-chip memory, and establishes a first page table entry corresponding to the first page. The first page table entry stores the first page. Physical address in memory. If the first page of the plurality of pages is moved from the memory to the off-chip memory, the accelerator 200 may delete the first page table entry corresponding to the first page.
加速器将第一段S1中的目标元素的逻辑地址转换为在片上虚拟存储器中的物理地址。片上虚拟存储器可以包括片上的L2高速缓存250和片外的设备存储器50。逻辑地址包括第一段在张量中的段基准数据和偏移数据,段基准数据和偏移数据分别表示目标元素在第一段的多个维度中的各维上的基址和偏移量。The accelerator converts the logical address of the target element in the first segment S1 to a physical address in on-chip virtual memory. On-chip virtual memory may include on-chip L2 cache 250 and off-chip device memory 50 . The logical address includes the segment reference data and offset data of the first segment in the tensor. The segment reference data and offset data respectively represent the base address and offset of the target element in each of the multiple dimensions of the first segment.
每个线程可以在自己的寄存器堆与存储器子系统之间做线程级的数据交换。每个线程有自己的算数逻辑执行单元并使用自己的存储地址,其采用典型的寄存器存取架构(load-store architecture)。每个执行单元包括一个支持多种数据类型的浮点/定点单元以及一个算数逻辑单元。Each thread can perform thread-level data exchange between its own register file and memory subsystem. Each thread has its own arithmetic logic execution unit and uses its own storage address, which uses a typical load-store architecture. Each execution unit includes a floating-point/fixed-point unit that supports multiple data types and an arithmetic logic unit.
大多数的指令执行算数和逻辑运算,例如,浮点和定点数的加、减、乘、除,或者逻辑与、或、非等。操作数来自于寄存器。存储器读写指令可以提供寄存器与片上/片外存储器之间的数据交换。一般地,PE中所有的执行单元可以同步地执行相同指令。通过使用谓词(predicate)寄存器,可以屏蔽部分执行单元,从而实现分支指令的功能。Most instructions perform arithmetic and logical operations, such as addition, subtraction, multiplication, and division of floating-point and fixed-point numbers, or logical AND, OR, NOT, etc. The operands come from registers. Memory read and write instructions can provide data exchange between registers and on-chip/off-chip memory. Generally, all execution units in a PE can execute the same instructions synchronously. By using the predicate register, part of the execution unit can be masked to implement the function of the branch instruction.
在一个实施例中,图2的加速器200可以例如执行如下操作:1)组建页表项内容和初始状态;2)将诸如图1中的设备存储器50之类的片外存储器上的数据搬运至片上存储器,例如L2高速缓存250;3)启动和执行程序;4)定义各个段并对张量以及存储的属性进行描述;5)在程序执行完成时,将执行结果的数据写入至片外存储器。In one embodiment, the accelerator 200 of FIG. 2 may, for example, perform the following operations: 1) assemble page table entry content and initial state; 2) transfer data on an off-chip memory such as the device memory 50 in FIG. 1 to On-chip memory, such as L2 cache 250; 3) Start and execute the program; 4) Define each segment and describe the tensor and stored attributes; 5) When the program execution is completed, write the execution result data off-chip memory.
可以理解,在公开的实施例中,加速器200所处理的数据主要针 对多维张量。例如,在一个实施例中,张量可以是四维张量,其具有四个维度D1、D2、D3和D4,并且张量在各维上的尺寸可以不同。在另一些实施例中,张量可以是一维、二维、三维或更多维张量,本公开对此不进行限制。It can be understood that in the disclosed embodiments, the data processed by the accelerator 200 is mainly aimed at multi-dimensional tensors. For example, in one embodiment, the tensor may be a four-dimensional tensor with four dimensions D1, D2, D3, and D4, and the tensor may have different dimensions in each dimension. In other embodiments, the tensor may be a one-dimensional, two-dimensional, three-dimensional or more dimensional tensor, which is not limited by this disclosure.
此外,在本公开的实施例中,张量内部可以支持诸如uint8、int8、bfloat16、float16、uint16、int16、float32、int32、uint32以及其他自定义元素类型,本公开对此也不进行限制。对于张量的寻址而言,其以元素为基本单位。例如,如果元素类型为int8,则元素以字节为单位。再例如,如果元素类型为int16,则寻址基本单位为双字节,依此类推。In addition, in the embodiment of the present disclosure, the tensor can internally support such as uint8, int8, bfloat16, float16, uint16, int16, float32, int32, uint32 and other custom element types, and the present disclosure does not limit this. For tensor addressing, the basic unit is element. For example, if the element type is int8, the element is in bytes. For another example, if the element type is int16, the basic unit of addressing is double bytes, and so on.
在一些情形中,张量所包含的数据量可能较大,而L2高速缓存250的容量有限,因此无法将张量整体加载至片上的L2高速缓存250。在本公开的一些实施例中,为了便于张量的并行处理,可以将张量划分为至少一个段。在张量仅包括一个段的情形下,张量即为段。而在张量包括多个段的情形下,段为张量的一部分。CPU 20可以通过指令指定段的各个部分由哪个PE进行处理。In some cases, the amount of data contained in the tensor may be large, and the capacity of the L2 cache 250 is limited, so the entire tensor cannot be loaded into the on-chip L2 cache 250 . In some embodiments of the present disclosure, to facilitate parallel processing of the tensor, the tensor may be divided into at least one segment. In the case where a tensor consists of only one segment, the tensor is a segment. In the case of a tensor containing multiple segments, the segments are part of the tensor. The CPU 20 can specify which PE to process each part of the segment through instructions.
图3示出了根据本公开的一个实施例的三维张量300的示意框图。三维张量300具有三个维度D1、D2和D3,并且包括第一段S1、第二段S2和第三段S3。CPU 20可以指定段S1的张量元素由PE_1、PE_2、PE_3、PE_4、PE_5、PE_6、PE_7和PE_8处理。此外,CPU20还指定了第二段S2的张量元素由PE_1-PE_4处理。在本公开的实施例中,每个段所具有的尺寸可以不同,因此编程人员可以基于设计需要灵活配置段。实际上,页的划分可以在任意一个或多个维上实施,并且各维上划分的页数是相互独立的。Figure 3 shows a schematic block diagram of a three-dimensional tensor 300 according to one embodiment of the present disclosure. The three-dimensional tensor 300 has three dimensions D1, D2, and D3, and includes a first segment S1, a second segment S2, and a third segment S3. CPU 20 may specify that the tensor elements of segment S1 are processed by PE_1, PE_2, PE_3, PE_4, PE_5, PE_6, PE_7, and PE_8. In addition, CPU20 also specifies that the tensor elements of the second segment S2 are processed by PE_1-PE_4. In embodiments of the present disclosure, each segment may have different sizes, so programmers can flexibly configure the segments based on design needs. In fact, page division can be implemented in any one or more dimensions, and the number of pages divided in each dimension is independent of each other.
在一个实施例中,可以将张量数据存储于片上的高速存储器,例如L2高速缓存250。但由于片上的高速存储器的容量较少,因此在张量规模较大时,编程人员可以将张量划分为多个段,每个段描述张量一部分。核心程序(kernel)可以分多次启动,每次由DMA控制器240提前将张量的一个段由片外存储搬运到片内存储,并供kernel 操作使用。在多次启动kernel后,张量包含的所有段均被处理,整个运行过程结束。当片上的高速存储器足以容纳kernel所要访问的所有张量时,一个张量仅需要一个段描述即可,kernel也只需要启动一次。In one embodiment, tensor data may be stored in on-chip high-speed memory, such as L2 cache 250. However, due to the small capacity of the on-chip high-speed memory, when the tensor size is large, programmers can divide the tensor into multiple segments, and each segment describes a part of the tensor. The core program (kernel) can be started multiple times. Each time, the DMA controller 240 transfers a segment of the tensor from off-chip storage to on-chip storage in advance and makes it available for kernel operations. After starting the kernel multiple times, all segments contained in the tensor are processed and the entire running process ends. When the on-chip high-speed memory is enough to accommodate all the tensors that the kernel needs to access, a tensor only needs one segment description, and the kernel only needs to be started once.
进一步地,在本公开的一些实施例中,在一个段内,还可以设置至少一个页以进一步细分张量。例如,在第一段S1中,具有4个页P[1]、P[2]、P[3]和P[4]。第二段S2仅具有一个页。在本公开的实施例中,每个段所具有的页的数目可以不同,因此编程人员可以基于设计需要灵活配置段内页的尺寸。例如,将页配置为适于整体存入L2高速缓存250。Further, in some embodiments of the present disclosure, within a segment, at least one page can also be set to further subdivide the tensor. For example, in the first segment S1, there are 4 pages P[1], P[2], P[3] and P[4]. The second segment S2 has only one page. In embodiments of the present disclosure, each segment may have a different number of pages, so programmers can flexibly configure the size of pages within a segment based on design needs. For example, the page is configured to fit into L2 cache 250 in its entirety.
如上所述,当对张量寻址时,最小的寻址单元是以元素为单元。一个页通常可以包括多个元素。目标元素所在的页在本文中被称为“目标元素页”。在本公开的一些实施例中,页可以包括多个缓存行。目标元素页可以位于L2高速缓存250中时,如果PE经由L1高速缓存260读取目标元素,则L2高速缓存250需要将L2高速缓存250中的包括目标元素在内的一小部分的物理地址连续的数据整体传输至L1高速缓存260。这一小部分数据也被称为缓存行(cache line)数据,而这种缓存机制基于空间邻近性原理。PE从L1高速缓存260读取数据仅需几个时钟周期,而L1高速缓存260从L2高速缓存250读取数据可能需要几十个甚至上百个时钟周期。因此,期望减少L1高速缓存260从L2高速缓存250读取数据的次数。虽然在此以“缓存行”来描述从L2高速缓存250到L1高速缓存260的最小传输数据单位,但在本公开中,这部分数据可以并不必然按行或列排列,一个“缓存行”里面的数据分布在多个维上,且各维上分布的数据尺寸不限于1。PE对一个段内的数据进行并行处理,PE的分配在数据的逻辑地址空间展开,独立于段的物理存储结构,具体如下文描述。As mentioned above, when addressing a tensor, the smallest unit of addressing is the element. A page can usually contain multiple elements. The page on which the target element is located is referred to herein as the "target element page". In some embodiments of the present disclosure, a page may include multiple cache lines. While the target element page may be located in the L2 cache 250, if the PE reads the target element via the L1 cache 260, the L2 cache 250 needs to concatenate the physical addresses of a small portion of the L2 cache 250 including the target element. The data is transferred to L1 cache 260 in its entirety. This small part of data is also called cache line data, and this caching mechanism is based on the principle of spatial proximity. It only takes a few clock cycles for the PE to read data from the L1 cache 260, while it may take dozens or even hundreds of clock cycles for the L1 cache 260 to read data from the L2 cache 250. Therefore, it is desirable to reduce the number of times L1 cache 260 reads data from L2 cache 250. Although a "cache line" is used here to describe the smallest unit of data transferred from the L2 cache 250 to the L1 cache 260, in this disclosure, this part of the data may not necessarily be arranged in rows or columns. A "cache line" The data inside is distributed in multiple dimensions, and the size of the data distributed in each dimension is not limited to 1. PE performs parallel processing on the data within a segment. The allocation of PE is expanded in the logical address space of the data and is independent of the physical storage structure of the segment, as described below.
在图3中,第一页P[1]中的第一组缓存行被指定由PE_1处理,第二组缓存行被指定由PE_2处理。虽然在此以顺序示出了张量由多个PE依序处理,但是可以理解张量数据的处理独立于PE的顺序,本公开对此不进行限制。例如图3中的PE_2表示部分的张量数据可 以由PE_M处理,其中M表示不大于N的任意整数。In Figure 3, the first set of cache lines in the first page P[1] is designated to be processed by PE_1, and the second set of cache lines is designated to be processed by PE_2. Although the tensor is shown to be processed by multiple PEs in sequence here, it can be understood that the processing of tensor data is independent of the order of the PEs, and the present disclosure is not limited to this. For example, PE_2 in Figure 3 represents part of the tensor data that can be processed by PE_M, where M represents any integer not greater than N.
图4示出了根据本公开的一个实施例的图像数据400的页分配示意图。图像数据是典型的二维张量。在一个实施例中,图像数据400例如为8*8像素。换言之,图像数据400在第一维D1具有8个像素,并且在第二维D2也具有8个像素。因此,图像数据400具有像素P00、P01……P77。在图4的实施例中,图像数据400仅具有一个段,但是按两个维度分为4个页P[1]、P[2]、P[3]和P[4]。4个页可以按第二维D2划分以分配给PE_1和PE_2处理,也可以按第一维D1划分以分配给PE_1和PE_2处理。此外,还可以按对角线划分。本公开对此不进行限制。FIG. 4 shows a schematic diagram of page allocation of image data 400 according to one embodiment of the present disclosure. Image data is typically a two-dimensional tensor. In one embodiment, the image data 400 is, for example, 8*8 pixels. In other words, image data 400 has 8 pixels in the first dimension D1 and also has 8 pixels in the second dimension D2. Therefore, image data 400 has pixels P00, P01...P77. In the embodiment of Figure 4, the image data 400 has only one segment, but is divided into 4 pages P[1], P[2], P[3] and P[4] in two dimensions. The four pages can be divided according to the second dimension D2 to be allocated to PE_1 and PE_2 for processing, or they can be divided according to the first dimension D1 to be allocated to PE_1 and PE_2 for processing. In addition, it can also be divided diagonally. This disclosure does not limit this.
图5示出了根据本公开的一个实施例的矩阵乘法500的示意图。张量通常可以包括一个或多个维度。二维张量可以被认为是矩阵。在一些情形下,可能需要对两个二维矩阵进行矩阵乘法以获得乘积矩阵。在本公开中,对于矩阵乘法C=A×B,矩阵C表示乘积矩阵,矩阵A表示第一因数矩阵,并且矩阵B表示第二因数矩阵。在图5中,第一因数矩阵A 502与第二因数矩阵B 504相乘,可以获得乘积矩阵C 506。在本公开中,“点积运算”可以包括对应矩阵元素的乘法操作以及可选的乘积相加操作。具体而言,第一因数矩阵502可以是m×k矩阵,第二因数矩阵504可以是k×n矩阵,其中m,k和n均表示正整数。根据矩阵乘法的规则,乘积矩阵因此是m×n矩阵。由此可见,第一因数矩阵502包括m行和k列,第二因数矩阵504包括k行和n列,并且乘积矩阵因此包括m行和n列。Figure 5 shows a schematic diagram of matrix multiplication 500 according to one embodiment of the present disclosure. Tensors can typically contain one or more dimensions. Two-dimensional tensors can be thought of as matrices. In some situations, it may be necessary to perform matrix multiplication of two two-dimensional matrices to obtain the product matrix. In the present disclosure, for matrix multiplication C=A×B, matrix C represents a product matrix, matrix A represents a first factor matrix, and matrix B represents a second factor matrix. In Figure 5, the first factor matrix A 502 is multiplied by the second factor matrix B 504 to obtain the product matrix C 506. In the present disclosure, a "dot product operation" may include a multiplication operation of corresponding matrix elements and optionally a product addition operation. Specifically, the first factor matrix 502 may be an m×k matrix, and the second factor matrix 504 may be a k×n matrix, where m, k and n all represent positive integers. According to the rules of matrix multiplication, the product matrix is therefore an m×n matrix. It can be seen that the first factor matrix 502 includes m rows and k columns, the second factor matrix 504 includes k rows and n columns, and the product matrix therefore includes m rows and n columns.
在进行矩阵乘法时,可以将第一行A[1][1]…A[1][k]与B[1][1]…B[k][1]进行点积运算以得到C[1][1]。具体而言,C[1][1]可以通过下面的式子(1)表示:When performing matrix multiplication, you can perform a dot product operation on the first row A[1][1]...A[1][k] and B[1][1]...B[k][1] to get C[ 1][1]. Specifically, C[1][1] can be expressed by the following formula (1):
C[1][1]=A[1][1]×B[1][1]+A[1][2]×B[2][1]…+A[1][k]×B[k][1]   (1)C[1][1]=A[1][1]×B[1][1]+A[1][2]×B[2][1]…+A[1][k]×B [k][1] (1)
类似地,可以进行点积运算以得到C[m][1]和C[m][n]。C[m][1]和C[m][n]可以通过下面的式子(2)和(3)表示:Similarly, a dot product operation can be performed to obtain C[m][1] and C[m][n]. C[m][1] and C[m][n] can be expressed by the following formulas (2) and (3):
C[m][1]=A[m][1]×B[1][1]+A[m][2]×B[2][1]…+A[m][k]×B[k][1]   (2)C[m][1]=A[m][1]×B[1][1]+A[m][2]×B[2][1]…+A[m][k]×B [k][1] (2)
C[m][n]=A[m][1]×B[1][n]+A[m][2]×B[2][n]…+A[m][k]×B[k][n]   (3)C[m][n]=A[m][1]×B[1][n]+A[m][2]×B[2][n]…+A[m][k]×B [k][n] (3)
可以看出,矩阵C包括了m×n个矩阵元素,并且每个矩阵元素是由k个乘积结果相加而成。在本公开中,针对上述的乘积矩阵C=A×B,乘积结果表示矩阵A的一个矩阵元素和矩阵B中的一个矩阵元素相乘的结果,而点积结果表示矩阵A的多个矩阵元素和矩阵B中的相应多个矩阵元素分别相乘并且多个乘积结果相加得到的结果。It can be seen that matrix C includes m×n matrix elements, and each matrix element is formed by adding k product results. In the present disclosure, for the above-mentioned product matrix C=A×B, the product result represents the result of multiplying one matrix element of matrix A and one matrix element of matrix B, and the dot product result represents multiple matrix elements of matrix A. The result obtained by multiplying the corresponding multiple matrix elements in matrix B and adding the multiple product results.
图6示出了根据本公开的一个实施例的矩阵乘法600的示意图。在一个实施例中,乘积矩阵C 602可以包括m行和n列,并且每行对应于一个线程。每个线程包括n个寄存器,以用于存储每行的n个点积结果。在PE执行时,m个线程可以并行执行以提高执行效率。在具体执行过程中,可以首先将与矩阵C对应的所有寄存器初始化为0。以C[1][1]为例,如上面的式子(1)所示,C[1][1]的计算包括k次乘法计算以及k-1次加法运算(实际上相当于k次累加,因为矩阵元素被初始化为0,第一个乘积元素相当于与0进行累加)。然后依次计算,例如第一个线程首先计算矩阵元素C[1][1]的第一个乘积结果A[1][1]×B[1][1],第二个线程并行地首先计算矩阵元素C[2][1]的第一个乘积结果A[2][1]×B[1][1],以此类推。即,m个线程都先计算各自对应的矩阵C的一个行的第一矩阵元素的第一个乘积结果。可以理解,此时既未得到乘积矩阵C 602的第一列的完整结果,也未开展乘积矩阵C 602的各行除第一列外的其他列的计算。Figure 6 shows a schematic diagram of matrix multiplication 600 according to one embodiment of the present disclosure. In one embodiment, product matrix C 602 may include m rows and n columns, with each row corresponding to a thread. Each thread includes n registers for storing n dot product results for each row. When PE is executed, m threads can be executed in parallel to improve execution efficiency. During the specific execution process, all registers corresponding to matrix C can be initialized to 0 first. Taking C[1][1] as an example, as shown in the above formula (1), the calculation of C[1][1] includes k multiplication calculations and k-1 addition operations (actually equivalent to k times Accumulation, because the matrix elements are initialized to 0, the first product element is equivalent to accumulation with 0). Then calculate it sequentially. For example, the first thread first calculates the first product result A[1][1]×B[1][1] of the matrix element C[1][1], and the second thread first calculates it in parallel. The first product of matrix element C[2][1] results in A[2][1]×B[1][1], and so on. That is, all m threads first calculate the first product result of the first matrix element of a row of the corresponding matrix C. It can be understood that at this time, the complete result of the first column of the product matrix C 602 has not been obtained, nor has the calculation of other columns except the first column of each row of the product matrix C 602 been carried out.
第一个线程然后计算矩阵第二列元素C[1][2]的第一个乘积结果A[1][1]×B[1][2],第二个线程并行地计算矩阵元素C[2][2]的第一个乘积结果A[2][1]×B[1][2],以此类推。即,m个线程计算各自对应的矩阵C的一个行的第二矩阵元素的第一个乘积结果。此时未得到乘积矩阵C 602的第一、二列的完整结果,也未开展乘积矩阵C 602的各行除第一、二列外的其他列的计算。M个线程并行计算到第n轮后,得到乘积矩阵C602的各行所有列矩阵元素的第一个乘积结果。第一个线程然后计算矩阵元素C[1][1]的第二个乘积结果A[1][2]×B[2][1]并且将其与第一个乘积A[1][1]×B[1][1]相加,第二个线程并行地首 先计算矩阵元素C[2][1]的第二个乘积结果A[2][2]×B[2][1]并且将其与第一个乘积A[2][1]×B[1][1]相加,以此类推,M个线程并行计算到第n轮后,矩阵C 602的所有列被计算完成。即,m个线程都先计算各自对应的矩阵C的一个行的各元素的第二个乘积与第一乘积相加的结果。The first thread then computes the first product result A[1][1]×B[1][2] of the second column element C[1][2] of the matrix, and the second thread computes the matrix element C in parallel The first product result of [2][2] is A[2][1]×B[1][2], and so on. That is, m threads calculate the first product result of the second matrix element of a row of the corresponding matrix C. At this time, the complete results of the first and second columns of the product matrix C 602 have not been obtained, and the calculation of other columns except the first and second columns of each row of the product matrix C 602 has not been carried out. After M threads perform parallel calculations to the nth round, the first product result of all column matrix elements in each row of the product matrix C602 is obtained. The first thread then computes the second product result A[1][2]×B[2][1] of matrix element C[1][1] and multiplies it with the first product A[1][1 ]×B[1][1] is added, and the second thread first calculates the second product result A[2][2]×B[2][1] of the matrix element C[2][1] in parallel. And add it to the first product A[2][1]×B[1][1], and so on. After M threads perform parallel calculations to the nth round, all columns of matrix C 602 are calculated. . That is, all m threads first calculate the result of adding the second product and the first product of each element of a row of the corresponding matrix C.
以此类推,直至计算完成各个矩阵元素的第k个乘积结果,并且将其分别与前k-1个乘积结果之和相加,以获得最终的矩阵C 604。换言之,在计算过程中,对于矩阵C 604实际上包括k轮计算。每一轮都计算矩阵C的各个矩阵元素的一部分,并且将计算结果与先前轮次的计算结果累加在相应寄存器中。如图6所示,矩阵C 602的每个矩阵元素具有相同颜色图案,这表明每个矩阵元素都被计算了相同数目的轮次的乘积累加。矩阵C 604的每个矩阵元素则是经过k轮累加之后获得的最终结果,因此每个矩阵元素的颜色相比于矩阵C 602的颜色更深。By analogy, the k-th product result of each matrix element is calculated and added to the sum of the previous k-1 product results to obtain the final matrix C 604. In other words, during the calculation process, matrix C 604 actually includes k rounds of calculations. Each round calculates a part of each matrix element of matrix C, and accumulates the calculation results with the calculation results of previous rounds in the corresponding register. As shown in Figure 6, each matrix element of matrix C 602 has the same color pattern, which indicates that each matrix element is calculated for the same number of rounds of multiplication and accumulation. Each matrix element of matrix C 604 is the final result obtained after k rounds of accumulation, so the color of each matrix element is darker than the color of matrix C 602.
虽然在图6的实施例中,每次仅计算一个乘积结果并且将其与先前的结果累加在寄存器中,但是这仅是示意而非对本公开的范围进行限制。在另一些实施例中,可以将每轮计算多个乘积结果并且进行累加。例如,可以将k维分为s段,每次计算s段内的乘积结果的累加。例如,在s=k/2的情形下,针对C[1][1],第一轮计算可以计算A[1][1]×B[1][1]+A[1][2]×B[2][1]。在执行s轮之后,可以获得C[1][1]的完整值。这样,可以基于PE单元的计算资源的分配情况,更为灵活地使用这些计算资源,从而给编程人员赋予更大的编程灵活性。Although in the embodiment of FIG. 6 , only one product result is calculated at a time and accumulated with the previous results in the register, this is only illustrative and does not limit the scope of the present disclosure. In other embodiments, multiple product results may be calculated in each round and accumulated. For example, the k dimension can be divided into s segments, and the accumulation of the product results within the s segments is calculated each time. For example, in the case of s=k/2, for C[1][1], the first round of calculation can calculate A[1][1]×B[1][1]+A[1][2] ×B[2][1]. After executing s rounds, the complete value of C[1][1] can be obtained. In this way, these computing resources can be used more flexibly based on the allocation of computing resources of the PE unit, thereby giving programmers greater programming flexibility.
图7示出了根据本公开的另一实施例的矩阵乘法700的示意图。与图6不同,在图7中,多个线程可以并行先计算完成矩阵C的各个矩阵元素的所有乘积结果的累加,然后按列计算矩阵C的下一列的矩阵元素。如图7所示,矩阵C 702的第一列的矩阵元素具有比第n列更深的颜色,这表明第一列的矩阵元素都被计算了相同数目的轮次的乘积累加,而最后一列的矩阵元素此时并未被计算,例如仍为初始值0。矩阵C 704的每个矩阵元素则是经过k轮累加之后获得的最终结 果,其中矩阵C 704的第一列的矩阵元素的颜色与矩阵C 702的第一列的矩阵元素的颜色相同,这表明矩阵C 702的第一列首先被计算完成,然后才进行下一轮的计算。与图6的实施例相似,也可以将k维分为s段,每次计算s段内的乘积结果的累加。Figure 7 shows a schematic diagram of matrix multiplication 700 according to another embodiment of the present disclosure. Different from Figure 6, in Figure 7, multiple threads can first calculate in parallel the accumulation of all product results of each matrix element of matrix C, and then calculate the matrix elements of the next column of matrix C column by column. As shown in Figure 7, the matrix elements in the first column of matrix C 702 have a darker color than the nth column, which indicates that the matrix elements in the first column have been calculated for the same number of rounds of multiplication and accumulation, while the matrix elements in the last column The matrix elements have not been calculated at this time, for example, they still have the initial value 0. Each matrix element of matrix C 704 is the final result obtained after k rounds of accumulation. The color of the matrix element in the first column of matrix C 704 is the same as the color of the matrix element in the first column of matrix C 702. This shows that The first column of matrix C 702 is calculated first, and then the next round of calculation is performed. Similar to the embodiment of Figure 6, the k dimension can also be divided into s segments, and the accumulation of the product results in the s segments is calculated each time.
虽然在图6和图7的实施例中,乘积矩阵C的每个行都使用一个线程执行矩阵计算而得到,但是这仅是示意,而非对本公开的范围进行限制。在线程数目显著大于矩阵行的数目时,例如线程数目为乘积矩阵行数数目的2倍、3倍或更多倍以上时,可以针对乘积矩阵C的每个行,使用2个、3个或更多个线程去计算得到乘积矩阵C,具体如下文所述。Although in the embodiments of FIG. 6 and FIG. 7 , each row of the product matrix C is obtained by using one thread to perform matrix calculation, this is only for illustration and does not limit the scope of the present disclosure. When the number of threads is significantly greater than the number of matrix rows, for example, when the number of threads is 2 times, 3 times, or more than the number of rows of the product matrix, 2, 3, or more threads can be used for each row of the product matrix C. More threads are used to calculate the product matrix C, as described below.
由于乘积矩阵C的一行可以由一个或多个线程去执行矩阵计算而得到,因此编程人员可以根据矩阵乘法中的第一因数矩阵A、第二因数矩阵B以及由此得到的乘积矩阵C的行数目和列数目而灵活分配线程。具体而言,在一些实施例中,可以在张量乘法指令中,给各个线程分配相应的第一因数矩阵A、第二因数矩阵B和乘积矩阵C的相关信息来将矩阵乘法的一部分任务,以灵活和高效地利用PE单元中的计算资源。上面在图5-图7中描述了矩阵乘法的一般性概念,下面将结合图8来具体描述矩阵乘法的一些实施例。Since a row of the product matrix C can be obtained by one or more threads performing matrix calculations, the programmer can calculate the rows of the product matrix C based on the first factor matrix A, the second factor matrix B and the resulting matrix multiplication. Flexible allocation of threads based on the number and number of columns. Specifically, in some embodiments, in the tensor multiplication instruction, each thread can be assigned corresponding information about the first factor matrix A, the second factor matrix B, and the product matrix C to perform part of the matrix multiplication task. To flexibly and efficiently utilize the computing resources in the PE unit. The general concept of matrix multiplication is described above in Figures 5 to 7. Some embodiments of matrix multiplication will be described in detail below in conjunction with Figure 8.
图8示出了根据本公开的一个实施例的由加速器执行的方法800的示意流程图。方法800用于执行如上结合图5-图7所示的矩阵乘法。在802,接收针对加速器的第一线程集的第一张量乘法指令,所述第一张量乘法指令包括针对所述第一线程集的第一线程指示、针对第一张量的第一因数寄存器表示、针对第二张量的存储器逻辑地址、以及针对第三张量的第一乘积寄存器表示。在一个实施例中,电子设备可以有两个线程集,其中第一线程集用于将矩阵B的数据广播至第二线程集中的线程的计算单元。例如,第一线程集将与所述存储器逻辑地址对应的所述第二因数集由以广播的形式并行提供给第二线程集中的全部线程或部分线程。换言之,第一线程集被配置用于广播矩阵B的数据,而第一线程集被配置为响应于接收到矩阵A的数据来执行A ×B。第二线程集中的每个线程包括第一组寄存器和第二组寄存器,其中第一组寄存器用于存储第一因数矩阵的一行中数据的至少一部分,第二组寄存器用于存储乘积矩阵中的一行的数据。Figure 8 shows a schematic flow diagram of a method 800 performed by an accelerator according to one embodiment of the present disclosure. Method 800 is used to perform matrix multiplication as shown above in conjunction with Figures 5-7. At 802, receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor for a first tensor a register representation, a memory logical address for the second tensor, and a first product register representation for the third tensor. In one embodiment, the electronic device may have two thread sets, wherein the first thread set is used to broadcast the data of matrix B to the computing units of the threads in the second thread set. For example, the first thread set provides the second factor set corresponding to the memory logical address in parallel to all threads or part of the threads in the second thread set in the form of broadcast. In other words, the first set of threads is configured to broadcast data for matrix B, and the first set of threads is configured to perform A × B in response to receiving data for matrix A. Each thread in the second thread set includes a first set of registers for storing at least a portion of the data in a row of the first factor matrix, and a second set of registers for storing at least a portion of the data in a row of the first factor matrix. One row of data.
第一张量乘法指令的一个示意性示例例如是@p1,mm.R0,ur4:rf290:0x00,R256,其中@p1表示与第一线程相关联的保护谓词操作数。@p1例如可以是第一线程的布尔谓词变量。如果谓词值为假,则不执行该条指令的数据加载操作。如果谓词值为真,则以ur4:rf290:0x00正常访问片上存储器,例如L1高速缓存260、L2高速缓存250或经由DMA 240控制的DDR(Double Data Rate)存储器之类的动态随机存取存储器(dynamic random access memory,DRAM),并第一线程集将得到的数据内容广播至第二线程集中的所有线程。换言之,可以提供针对各线程的执行条件,对于不满足执行条件的线程,其访存被视作超出张量地址范围而被忽略,或放弃第二线程集相应线程要执行的张量乘法操作。R0表示用于存储乘积矩阵C中的一行的各个乘积元素的第二组寄存器中的起始寄存器,例如寄存器R0-R255用于存储乘积矩阵C中的一行的各个乘积元素。ur4:rf290:0x00则表示第二因数矩阵的逻辑地址,例如是前面所述的目标元素的逻辑地址seg:RF:imm的一个具体示例。R256表示第一组寄存器中的起始寄存器,第一组寄存器用于存储第一因数矩阵中的一行中的、在一轮点积(矩阵A和矩阵B中的相应元素的乘法和累加)运算中所涉及的相关矩阵元素。在一个实施例中,第一组寄存器和第二组寄存器都位于相同的线程内,这样可以减少在计算过程中数据的传输的功耗和时间。An illustrative example of a first tensor multiply instruction is for example @p1,mm.R0,ur4:rf290:0x00,R256, where @p1 represents the guard predicate operand associated with the first thread. @p1 can for example be a boolean predicate variable for the first thread. If the predicate value is false, the data load operation for this instruction is not performed. If the predicate value is true, then ur4:rf290:0x00 is used to access the on-chip memory normally, such as L1 cache 260, L2 cache 250 or dynamic random access memory such as DDR (Double Data Rate) memory controlled by DMA 240 ( dynamic random access memory (DRAM), and the first thread set broadcasts the obtained data content to all threads in the second thread set. In other words, execution conditions for each thread can be provided. For threads that do not meet the execution conditions, their memory access is regarded as exceeding the tensor address range and is ignored, or the tensor multiplication operation to be performed by the corresponding thread of the second thread set is abandoned. R0 represents the starting register in the second set of registers used to store each product element of a row in the product matrix C. For example, the registers R0-R255 are used to store each product element of a row in the product matrix C. ur4:rf290:0x00 represents the logical address of the second factor matrix, such as a specific example of the logical address seg:RF:imm of the target element mentioned above. R256 represents the starting register in the first group of registers. The first group of registers is used to store the dot product (multiplication and accumulation of corresponding elements in matrix A and matrix B) operations in a row in the first factor matrix. The correlation matrix elements involved in . In one embodiment, the first set of registers and the second set of registers are located in the same thread, which can reduce power consumption and time of data transmission during the calculation process.
可以理解,第一乘积寄存器表示可以对应于一个或多个乘积寄存器。一个或多个乘积寄存器的数目与合并计算模式以及所述第二张量的列数相关,如下文详述。不同线程的乘积寄存器构成结果张量,结果张量的行数与第一张量的行数相同,结果张量的列数与第二张量的列数相同。例如,256个线程可以构成具有256行的结果张量。每个线程的乘积寄存器文件包括结果张量每行的部分或全部。例如,每个 线程的乘积寄存器文件可以对应于结果张量的一行。在合并计算模式下,每个线程的乘积寄存器可以对应于结果张量的一行的一部分。It can be understood that the first product register representation may correspond to one or more product registers. The number of one or more product registers is related to the merge calculation mode and the number of columns of the second tensor, as detailed below. The product registers of the different threads form a result tensor with the same number of rows as the first tensor and the same number of columns as the second tensor. For example, 256 threads can form a resulting tensor with 256 rows. Each thread's product register file includes part or all of each row of the result tensor. For example, each thread's product register file could correspond to one row of the result tensor. In merged computation mode, each thread's product register can correspond to part of a row of the result tensor.
此外,可以理解,第二线程集中的线程内的乘积寄存器的数目是可变的。乘积寄存器的数目取决于第一张量乘法指令的执行条件。执行条件确定对所述第二张量中的列的访问。例如,在一些情形下,第二线程集中的线程内的全部乘积寄存器可以仅一部分被使用。在另一些情形下,第二线程集中的线程内的全部乘积寄存器的另一部分或全部被使用。如果第二张量中的第一列未被访问,则第二张量中的第一列不参与矩阵乘法计算。Additionally, it will be appreciated that the number of product registers within the threads in the second thread set may be variable. The number of product registers depends on the execution conditions of the first tensor multiply instruction. The execution condition determines access to the columns in the second tensor. For example, in some situations, only a portion of all product registers within the threads in the second thread set may be used. In other cases, another portion or all of the product registers within the threads in the second thread set are used. If the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.
在一个具体实现方式中,第一张量乘法指令可以被发射两次或者更多次。在第一次发射中,第一张量指令被发射到存储系统,矩阵乘法(matrix multiply,mm)指令可以从加速器200的高速缓存或指令段中被取出,并且被送至加速器200的流水线单元,经过译码之后作为常规存取指令发射,其存取地址为诸如ur4:rf290:0x00之类的seg:RF:imm。换言之,第一张量乘法指令第一次以存储指令的方式被发射,以用于获取所述第二张量中的列数据或者行数据。In one implementation, the first tensor multiply instruction may be issued two or more times. In the first issue, the first tensor instruction is issued to the memory system, and the matrix multiply (mm) instruction can be fetched from the cache or instruction section of the accelerator 200 and sent to the pipeline unit of the accelerator 200 , after decoding, it is issued as a regular access instruction, and its access address is seg:RF:imm such as ur4:rf290:0x00. In other words, the first tensor multiplication instruction is issued for the first time in the form of a storage instruction to obtain the column data or row data in the second tensor.
响应于获取到第二张量中的列数据或者行数据,并且第一张量的数据已被存储在第一因数寄存器中,第一张量乘法指令以数学计算指令的方式被二次或多次发射,以用于执行所述第三张量的行内的各列结果的计算。In response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction is issued two or more times in the form of a mathematical calculation instruction. , for performing the calculation of the results of each column in the row of the third tensor.
加速器200可以读取矩阵C和矩阵A对应的数据块寄存器,例如R0-R255以及R256-R257,然后读取第一次发射过程中得到的第二因数矩阵B的数据块并且执行点积运算,并且将临时计算结果写入相应的寄存器,例如R0-R255中一个寄存器。这样,在矩阵乘法执行过程中,第一线程的执行单元可以仅从第一组寄存器读取第一因数矩阵的一行中的数据一次,并且在后续点积运算过程中重复使用。可以理解,在一些情形下,针对第三张量的乘积寄存器使用范围可能超出单个线程内的寄存器文件的范围。例如,数据块寄存器R0-R255不足以存储第三张量中的一行乘积数据,例如第三张量的一行乘积数据需要300 个数据寄存器来存储。在一个实施例中,加速器200可以基于第一乘积寄存器表示,确定针对第三张量的乘积寄存器使用范围是否超出单个线程内的寄存器文件的范围。如果确定针对第三张量的乘积寄存器使用范围超出单个线程内的寄存器文件的范围,则忽略超出寄存器文件范围的计算操作或访存操作并且报错。The accelerator 200 can read the data block registers corresponding to matrix C and matrix A, such as R0-R255 and R256-R257, and then read the data block of the second factor matrix B obtained during the first emission process and perform a dot product operation, And write the temporary calculation result into the corresponding register, such as a register in R0-R255. In this way, during the execution of matrix multiplication, the execution unit of the first thread can read the data in one row of the first factor matrix from the first set of registers only once, and reuse it during subsequent dot product operations. It is understood that in some cases, the range of product register usage for the third tensor may exceed the range of the register file within a single thread. For example, data block registers R0-R255 are not enough to store one row of product data in the third tensor. For example, one row of product data in the third tensor requires 300 data registers to store. In one embodiment, the accelerator 200 may determine whether the product register usage range for the third tensor exceeds the range of the register file within a single thread based on the first product register representation. If it is determined that the product register usage range for the third tensor exceeds the range of the register file within a single thread, calculation operations or memory access operations that exceed the range of the register file are ignored and an error is reported.
在一些实施例中,在进行再次发射之前,需要检查第一因数寄存器的就绪状态,具体检查其对应的令牌(token)状态,如果token状态表征第一因数就绪,则以数学计算指令方式发射,否则阻塞发射队列,直至第一因数寄存器就绪。具体而言,在进行二次或多次发射之前,加速器200可以检查第一因数寄存器的对应的令牌状态。如果令牌状态表示所述第一张量的数据已被存储在所述第一因数寄存器中,则以数学计算指令方式发射,否则阻塞发射队列,直至所述第一张量的数据已被存储在所述第一因数寄存器中。In some embodiments, before transmitting again, it is necessary to check the readiness status of the first factor register, specifically check its corresponding token status. If the token status indicates that the first factor is ready, transmit it in the form of a mathematical calculation instruction. , otherwise the transmit queue is blocked until the first factor register is ready. Specifically, before performing the second or more transmissions, the accelerator 200 may check the corresponding token status of the first factor register. If the token status indicates that the data of the first tensor has been stored in the first factor register, then issue a mathematical calculation instruction, otherwise block the emission queue until the data of the first tensor has been stored. in the first factor register.
由于每个线程在执行并行的mm计算都涉及基本上相同的第二因数矩阵B的矩阵元素数据块,因此第二因数矩阵B的每一段数据块都被广播到所有线程去并行执行。在一个实施例中,可以n个步骤来完成一段数据的计算任务。计算从第二因数矩阵B和乘积矩阵C的第0列开始,每次向后移动一列,直至所有列循环完毕。每个线程都可以为mm指令指定独立的列地址,每个列取回来的数据都广播到所有的线程做计算。Since each thread performing parallel mm calculations involves substantially the same matrix element data block of the second factor matrix B, each data block of the second factor matrix B is broadcast to all threads for parallel execution. In one embodiment, the calculation task of a piece of data can be completed in n steps. The calculation starts from the 0th column of the second factor matrix B and the product matrix C, and moves backward one column at a time until all columns are cycled. Each thread can specify an independent column address for the mm instruction, and the data retrieved from each column is broadcast to all threads for calculation.
在一个实施例中,第二因数矩阵B的一列中的数据可以来自于L1 cache、L2 cache或片外存储器。这样,第二因数矩阵的一列中的数据可以被并行广播至多个(例如与第一因数矩阵的行相同数目或其一半数目)线程中的执行单元,并且重复使用。以此方式,可以减少数据在不同存储装置之间的传送,从而减少矩阵乘法计算过程中因数据传输引起的时间。In one embodiment, the data in a column of the second factor matrix B may come from L1 cache, L2 cache or off-chip memory. In this way, data in a column of the second factor matrix can be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.
在804,第一线程集基于针对第二张量的存储器逻辑地址将第二张量中的第二因数集广播至第二线程集,如上所述。At 804, the first set of threads broadcasts the second set of factors in the second tensor to the second set of threads based on the memory logical address for the second tensor, as described above.
在806,第二线程集中的第一线程基于第一因数寄存器表示将第 一张量中的第一行中的第一因数集和第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。点积预算可以包括乘法运算和加法运算。第一因数寄存器表示例如是R256,并且存储器逻辑地址是诸如ur4:rf290:0x00之类的seg:RF:imm。在一些实施例中,第二线程集中的每个线程内的寄存器的个数是可变的,具体由张量乘法指令的执行条件控制,该条件控制对第二张量中各列的访问,如果某列未被访问,则该列不参与矩阵乘法计算,因此与该列对应的乘积寄存器不存在。At 806, the first thread in the second set of threads performs a dot product operation on the first set of factors and the second set of factors in the first row of the first tensor based on the first factor register representation to generate a first set of factors of the third tensor. Accumulate the first point in the row. Dot product budgeting can include multiplication and addition operations. The first factor register representation is for example R256 and the memory logical address is seg:RF:imm such as ur4:rf290:0x00. In some embodiments, the number of registers within each thread in the second thread set is variable, specifically controlled by the execution condition of the tensor multiplication instruction, which condition controls access to each column in the second tensor. If a certain If the column has not been accessed, the column does not participate in the matrix multiplication calculation, so the product register corresponding to the column does not exist.
可以理解,在本公开的实施例中,矩阵乘法并非一次完成,而是基于寄存器的大小,第一因数矩阵A、第二因数矩阵B和乘积矩阵C中的矩阵元素的类型以及加速器200中的计算单元的计算能力等因素综合考虑而多次执行来完成。换言之,单个线程内的第一因数寄存器集包括第一张量单行内数据的至少部分数据,第一因数寄存器集包括一个或多个寄存器,其具体数目可以由单轮张量乘法指令支持的数据长度决定,诸如可以是2个寄存器,每个寄存器包括一个或者多个数据元素;例如对于int8数据类型,2个寄存器包括8个数据元素。参与张量乘法的线程与第一张量的行数成正比关系。例如第一张量的行数可以是256,参与张量乘法的线程的数目也可以是256。It can be understood that in the embodiment of the present disclosure, the matrix multiplication is not completed at one time, but is based on the size of the register, the type of the matrix elements in the first factor matrix A, the second factor matrix B and the product matrix C, and the accelerator 200 The calculation capability of the computing unit and other factors are comprehensively considered and executed multiple times to complete the process. In other words, the first factor register set within a single thread includes at least part of the data in a single row of the first tensor, and the first factor register set includes one or more registers, the specific number of which can be supported by a single round of tensor multiply instructions. The length is determined by, for example, 2 registers, each register including one or more data elements; for example, for the int8 data type, 2 registers include 8 data elements. The number of threads involved in tensor multiplication is proportional to the number of rows of the first tensor. For example, the number of rows of the first tensor can be 256, and the number of threads participating in tensor multiplication can also be 256.
在一个实施例中,第一张量乘法指令可以进一步例如是@p1,mm8.sa.ub.R0,ur4:rf290:0x00,R256。其与@p1,mm.R0,ur4:rf290:0x00,R256相同或相似之处在此不再赘述,参见上面相关说明。mm8表示矩阵乘法所涉及的元素的数据类型是8位,sa表示与寄存器R256相关联的第一因数矩阵A中的元素数据是有符号型的int8,ub表示与逻辑地址ur4:rf290:0x00相关联的第二因数矩阵B中的元素数据是无符号型的uint8。可以理解,第一因数矩阵A、第二因数矩阵B和乘积矩阵C中的矩阵元素的类型也可以是其它数据类型,本公开对此不进行限制。In one embodiment, the first tensor multiply instruction may further be, for example, @p1,mm8.sa.ub.R0,ur4:rf290:0x00,R256. It is the same or similar to @p1, mm.R0, ur4:rf290:0x00, R256 and will not be repeated here. Please refer to the relevant description above. mm8 indicates that the data type of the elements involved in matrix multiplication is 8 bits, sa indicates that the element data in the first factor matrix A associated with register R256 is signed int8, ub indicates that it is related to the logical address ur4:rf290:0x00 The element data in the connected second factor matrix B is unsigned uint8. It can be understood that the types of matrix elements in the first factor matrix A, the second factor matrix B and the product matrix C can also be other data types, and this disclosure is not limited thereto.
由于矩阵乘法涉及多个矩阵元素的多次点积运算,因此在一些实施例中可以分段进行多次运算,并且将多次点积运算的结果累加以获 得最终的mm结果。在一个实施例中,例如基于@p1,mm8.sa.sb.R0,ur4:rf290:0x00,R256,可以确定第一因数寄存器表示R256和存储器逻辑地址ur4:rf290:0x00。对于第二线程集中的第一线程而言,第一因数寄存器表示R256和存储器逻辑地址ur4:rf290:0x00例如可以对应于第一组寄存器中的第一寄存器和矩阵B的张量段的基准点的数据块。第一寄存器中存储器了第一因数集,例如A[1][1],并且矩阵B的张量段的基准点的数据块表示矩阵B的张量段的基准点的数据块,例如B[1][1]。在进行矩阵乘法之后,可以得到乘积矩阵C的第三张量的第一行中的第一点积集A[1][1]×B[1][1]。在另一实施例中,第一因数集可以包括A[1][1]和A[1][2],第二因数集可以包括B[1][1]和B[2][1],因此第一点积集可以包括A[1][1]×B[1][1]+A[1][2]×B[2][1]。在又一实施例中,第一因数集可以是A[1][1]、A[1][2]和A[1][3],第二因数集可以包括B[1][1]、B[2][1]和B[3][1],因此第一点积集可以包括A[1][1]×B[1][1]+A[1][2]×B[2][1]+A[1][3]×B[3][1]。本公开对于第一因数集、第二因数集和第一点积集的范围不进行限制,该范围可以由编程人员在对矩阵乘法进行编程时基于矩阵元素的数据类型、寄存器容量等因素而灵活配置,例如通过设置张量乘法指令中的数据类型来自动配置。Since matrix multiplication involves multiple dot product operations of multiple matrix elements, in some embodiments multiple operations can be performed in segments, and the results of multiple dot product operations are accumulated to obtain the final mm result. In one embodiment, for example based on @p1,mm8.sa.sb.R0,ur4:rf290:0x00,R256, the first factor register representation R256 and the memory logical address ur4:rf290:0x00 can be determined. For the first thread in the second thread set, the first factor register representation R256 and the memory logical address ur4:rf290:0x00 may, for example, correspond to the first register in the first set of registers and the reference point of the tensor segment of matrix B data block. The first factor set is stored in the first register, such as A[1][1], and the data block of the reference point of the tensor segment of matrix B represents the data block of the reference point of the tensor segment of matrix B, such as B[ 1][1]. After matrix multiplication, the first point product set A[1][1]×B[1][1] in the first row of the third tensor of the product matrix C can be obtained. In another embodiment, the first set of factors may include A[1][1] and A[1][2], and the second set of factors may include B[1][1] and B[2][1] , so the first point product set can include A[1][1]×B[1][1]+A[1][2]×B[2][1]. In yet another embodiment, the first set of factors may be A[1][1], A[1][2], and A[1][3], and the second set of factors may include B[1][1] , B[2][1] and B[3][1], so the first point product set can include A[1][1]×B[1][1]+A[1][2]×B [2][1]+A[1][3]×B[3][1]. The present disclosure does not limit the range of the first factor set, the second factor set and the first point product set. The range can be flexibly adjusted by the programmer based on factors such as the data type of the matrix element and the register capacity when programming the matrix multiplication. Configuration, such as automatic configuration by setting the data type in the tensor multiply instruction.
虽然在此以乘积矩阵C中的单个乘积元素C[1][1]的示例进行描述,但是可以理解,这仅是示意而非对本公开的范围进行限制。在一些实施例中,单个线程可以对乘积矩阵C中的一行中的多个乘积元素进行并行计算。例如第二线程集中的第一线程可以并行计算C[1][1]-C[1][8]的各自的第一点积集A[1][1]×B[1][1]、A[1][1]×B[1][2]、A[1][1]×B[1][3]、A[1][1]×B[1][4]、A[1][1]×B[1][5]、A[1][1]×B[1][6]、A[1][1]×B[1][7]和A[1][1]×B[1][8]。在另一实施例中,第一线程也可以并行计算C[1][1]-C[1][8]的各自的第一点积集A[1][1]×B[1][1]+A[1][2]×B[2][1]、A[1][1]×B[1][2]+A[1][2]×B[2][2]、A[1][1]×B[1][3]+A[1][2]×B[2][3]、A[1][1]×B[1][4]+A[1][2]×B[2][4]、A[1][1]×B[1][5]+A[1][2]×B[2][5]、A[1][1]×B[1][6]+A[1][2] ×B[2][6]、A[1][1]×B[1][7]+A[1][2]×B[2][7]和A[1][1]×B[1][8]+A[1][2]×B[2][8]。Although an example of a single product element C[1][1] in the product matrix C is used to describe here, it can be understood that this is only illustrative and does not limit the scope of the present disclosure. In some embodiments, a single thread can perform parallel computations on multiple product elements in a row of the product matrix C. For example, the first thread in the second thread set can calculate the respective first point product sets A[1][1]×B[1][1] of C[1][1]-C[1][8] in parallel. , A[1][1]×B[1][2], A[1][1]×B[1][3], A[1][1]×B[1][4], A [1][1]×B[1][5], A[1][1]×B[1][6], A[1][1]×B[1][7] and A[1 ][1]×B[1][8]. In another embodiment, the first thread can also calculate the respective first point product sets A[1][1]×B[1][ of C[1][1]-C[1][8] in parallel. 1]+A[1][2]×B[2][1], A[1][1]×B[1][2]+A[1][2]×B[2][2] , A[1][1]×B[1][3]+A[1][2]×B[2][3], A[1][1]×B[1][4]+A [1][2]×B[2][4], A[1][1]×B[1][5]+A[1][2]×B[2][5], A[1 ][1]×B[1][6]+A[1][2] ×B[2][6], A[1][1]×B[1][7]+A[1][ 2]×B[2][7] and A[1][1]×B[1][8]+A[1][2]×B[2][8].
在808,由第二线程集中的第一线程将第一点积集累加到与第一乘积寄存器表示对应的第一组乘积寄存器中。例如,第一线程可以将上述计算的点积结果累加到对应的第一组乘积寄存器中,例如R0-R7寄存器。与上面类似,第一组乘积寄存器所包括的寄存器的范围可以由mm指令灵活配置。通过将矩阵分解,并且按行分配线程,这样多个线程可以并行处理矩阵张量的多个行,从而加快矩阵乘法的处理效率。此外,由于编程人员在编程时知晓矩阵张量的行列结构以及加速器中的线程状况,因此可以灵活使用线程来并行处理矩阵乘法,从而提高编程的灵活性。At 808, the first point product is accumulated by the first thread in the second set of threads into a first set of product registers corresponding to the first product register representation. For example, the first thread can accumulate the dot product result of the above calculation into the corresponding first set of product registers, such as R0-R7 registers. Similar to the above, the range of registers included in the first group of product registers can be flexibly configured by the mm instruction. By decomposing the matrix and allocating threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing efficiency of matrix multiplication. In addition, because programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.
在一些实施例中,方法800还包括响应于接收到第二因数集,第二线程集中的第二线程基于第一因数寄存器表示将第一张量的第二行中的第三因数集和第二因数集进行点积运算,以生成第三张量的第二行中的第二点积集;以及由第二线程将第二点积集累加到与第一乘积寄存器表示对应的第二组乘积寄存器中。可以理解,虽然第二线程集中的第一线程和第二线程具有相同的第一张量乘法指令,例如,在一个实施例中,第一张量乘法指令可以表示为@p1,mm8.sa.sb.R0,ur4:rf290:0x00,R256,但是由于可以使用诸如加载指令的其它一些指令将第一张量的第一行数据加载至第一线程,并且将第二行数据加载至第二线程,因此第一线程和第二线程基于已加载的第一张量的数据便可以正确执行点积运算。In some embodiments, method 800 further includes, in response to receiving the second set of factors, a second thread in the second set of threads converting the third set of factors in the second row of the first tensor to the third set of factors based on the first factor register representation. performing a dot product operation on the two factor sets to generate a second dot product set in the second row of the third tensor; and accumulating the second dot product set by the second thread into the second set corresponding to the first product register representation in the product register. It can be understood that although the first thread and the second thread in the second thread set have the same first tensor multiplication instruction, for example, in one embodiment, the first tensor multiplication instruction may be represented as @p1,mm8.sa. sb.R0,ur4:rf290:0x00,R256, but since some other instructions such as load instructions can be used to load the first row of data of the first tensor to the first thread, and the second row of data to the second thread , so the first thread and the second thread can correctly perform dot product operations based on the loaded data of the first tensor.
与第二线程集中的第一线程相同或相似,第二线程集中的第二线程也包括第一组寄存器,例如R256-R257用于存储第一因数矩阵的第二行中的第三因数集,并且还包括第二组寄存器,例如R0-R255以用于存储第三张量的第二行的第二点积集。第一线程和第二线程实际上分别针对第一因数矩阵A的第一行和第二行并且分别针对第一乘积矩阵C的第一行和第二行而进行并行的mm计算,因此通过并行计算,可以大大节省计算时间。此外,由于每个线程和每个矩阵行存在固定 的对应关系,因此也可以避免多个线程依据忙碌程度动态分配矩阵乘法计算任务(例如,一个线程可以计算两个矩阵行,而另一线程仅计算一个矩阵行的一部分)所造成的开销。Identical or similar to the first thread in the second thread set, the second thread in the second thread set also includes a first set of registers, such as R256-R257 for storing the third set of factors in the second row of the first factor matrix, And also includes a second set of registers, such as R0-R255, for storing the second dot product set of the second row of the third tensor. The first thread and the second thread actually perform parallel mm calculations respectively for the first and second rows of the first factor matrix A and respectively for the first and second rows of the first product matrix C, so by parallel Calculation can greatly save calculation time. In addition, since there is a fixed correspondence between each thread and each matrix row, it can also avoid multiple threads dynamically allocating matrix multiplication calculation tasks according to their busyness (for example, one thread can calculate two matrix rows, while another thread only The overhead caused by computing a portion of a matrix row).
在一些情形下,例如当线程数目远大于乘积矩阵C中的行的数目时,可能会造成部分线程闲置。例如,当PE单元包括64个线程,而乘积矩阵C中的行的数目仅为16时,如果每行仍仅分配一个线程,则会有48个线程被闲置。在此情形下,可以通过在张量乘法指令中,设置第一合并计算模式指示或第二合并计算模式指示来将多个线程(例如第二线程集中的第一线程和第三线程)用于乘积矩阵C中的一行的计算。In some situations, such as when the number of threads is much larger than the number of rows in the product matrix C, some threads may become idle. For example, when the PE unit includes 64 threads and the number of rows in the product matrix C is only 16, if each row is still assigned only one thread, 48 threads will be idle. In this case, multiple threads (eg, the first thread and the third thread in the second thread set) can be used by setting the first merge mode indication or the second merge mode indication in the tensor multiplication instruction. Calculation of a row in the product matrix C.
例如,在一个实施例中,第一张量乘法指令还包括第一合并计算模式指示,例如KA2。KA2表示两个线程参与一个矩阵行的计算。在另一些实施例中,第一合并计算模式指示可以包括KA1、KA3、KA4等其它指示,区别仅在于KA之后跟随的数字。KA1则表示单个线程参与一个矩阵行的计算,KA3表示三个线程参与一个矩阵行的计算,以此类推。在一些实施例中,在没有第一合并计算模式指示的情形下,可以默认为单个线程执行一个矩阵行的计算。在第一合并计算模式指示为KA2的情形下,第一线程和第三线程接收到的第一张量乘法指令的示意性示例例如可以是@p1,mm8.KA2.sa.sb.R0,ur4:rf290:0x00,R256。可以理解,KA1-KA4仅使用用于表示第一合并计算模式指示的一种实现方式,可以使用其它字符或是其它表示方式来表示第一合并计算模式指示。For example, in one embodiment, the first tensor multiply instruction also includes a first merge calculation mode indication, such as KA2. KA2 indicates that two threads participate in the calculation of a matrix row. In other embodiments, the first combined calculation mode indication may include other indications such as KA1, KA3, KA4, etc., and the only difference lies in the number following KA. KA1 means that a single thread participates in the calculation of one matrix row, KA3 means that three threads participate in the calculation of one matrix row, and so on. In some embodiments, in the absence of a first merge calculation mode indication, a single thread may default to performing the calculation of one matrix row. In the case where the first combined calculation mode is indicated as KA2, an illustrative example of the first tensor multiplication instruction received by the first thread and the third thread may be, for example, @p1,mm8.KA2.sa.sb.R0,ur4 :rf290:0x00,R256. It can be understood that KA1-KA4 only use one implementation method for representing the first combined calculation mode indication, and other characters or other representation methods may be used to express the first combined calculation mode indication.
可以看出,通过增加第一合并计算模式指示KA2,第二线程集中的第一线程和第三线程共同计算乘积矩阵C中的同一行中的乘积元素。例如,第一线程用于计算第一组乘积元素C[1][1]-C[1][127],而第三线程用于计算第二组乘积元素C[1][128]-C[1][256],或者第一线程用于计算第一组乘积元素C[1][1],C[1][3],C[1][5]…C[1][255],而第三线程用于计算第二组乘积元素C[1][2],C[1][4],C[1][6]…C[1][256]。It can be seen that by adding the first combined calculation mode indication KA2, the first thread and the third thread in the second thread set jointly calculate the product elements in the same row in the product matrix C. For example, the first thread is used to calculate the first set of product elements C[1][1]-C[1][127], while the third thread is used to calculate the second set of product elements C[1][128]-C [1][256], or the first thread is used to calculate the first set of product elements C[1][1], C[1][3], C[1][5]…C[1][255] , and the third thread is used to calculate the second set of product elements C[1][2], C[1][4], C[1][6]...C[1][256].
在此情形下,基于第一合并计算模式指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集,并且将第一点积集累加到第一线程的第二组寄存器中。第三线程基于第一合并计算模式指示和第一因数寄存器表示,将第一因数集和第二张量的第四因数集进行点积运算,以生成第三张量的第一行中的第三点积集,第四因数集不同于第二因数集,第三点积集不同于第一点积集。第三线程还进一步将第三点积集累加到与第一乘积寄存器表示对应的第三组乘积寄存器中,该第三组乘积寄存器位于第三线程中。可以理解,第一合并计算模式指示计算模式指示可以与上面针对图8的实施例结合使用,因此针对图8描述的各个方面在此不再赘述。In this case, based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to generate the The first point product in the first row of the three tensors is accumulated, and the first point product is accumulated into the second set of registers in the first thread. The third thread performs a dot product operation on the first factor set and the fourth factor set of the second tensor based on the first merge calculation mode indication and the first factor register representation to generate the first factor set in the first row of the third tensor. In the three-point product set, the fourth factor set is different from the second factor set, and the third point product set is different from the first point product set. The third thread further accumulates the third point product into a third set of product registers corresponding to the first product register representation, the third set of product registers being located in the third thread. It can be understood that the first combined calculation mode indication calculation mode indication can be used in conjunction with the above embodiment with respect to FIG. 8 , so various aspects described with respect to FIG. 8 will not be described again here.
在另一个实施例中,第一张量乘法指令还包括第二合并计算模式指示,例如KB2。KB2表示两个线程共同参与乘积矩阵中的每个乘积元素的计算。在另一些实施例中,第二合并计算模式指示可以包括KB1、KB3、KB4等其它指示,区别仅在于KB之后跟随的数字。KB1表示单个线程参与参与乘积矩阵中的每个乘积元素的计算,KB3表示三个线程共同参与乘积矩阵中的每个乘积元素的计算,以此类推。在一些实施例中,在没有第二合并计算模式指示的情形下,可以默认为单个线程执行一个矩阵行的计算。在第二合并计算模式指示为KB2的情形下,第二线程集中的第一线程和第四线程所接收到的第一张量乘法指令的示意性示例例如可以是@p1,mm8.KB2.sa.sb.R0,ur4:rf290:0x00,R256。可以理解,KB1-KB4仅使用用于表示第二合并计算模式指示的一种实现方式,可以使用其它字符或是其它表示方式来表示第二合并计算模式指示。In another embodiment, the first tensor multiply instruction also includes a second merge calculation mode indication, such as KB2. KB2 indicates that two threads jointly participate in the calculation of each product element in the product matrix. In other embodiments, the second combined calculation mode indication may include other indications such as KB1, KB3, KB4, etc., and the only difference lies in the number following KB. KB1 indicates that a single thread participates in the calculation of each product element in the product matrix, KB3 indicates that three threads jointly participate in the calculation of each product element in the product matrix, and so on. In some embodiments, in the absence of a second merge calculation mode indication, a single thread may default to performing the calculation of one matrix row. In the case where the second merge calculation mode is indicated as KB2, an illustrative example of the first tensor multiplication instruction received by the first thread and the fourth thread in the second thread set may be, for example, @p1,mm8.KB2.sa .sb.R0,ur4:rf290:0x00,R256. It can be understood that KB1-KB4 only use one implementation method for expressing the second combined calculation mode indication, and other characters or other expression methods may be used to express the second combined calculation mode indication.
可以看出,通过增加第二合并计算模式指示KB2,第二线程集中的第一线程和第四线程共同参与乘积矩阵中的每个乘积元素的计算。具体而言,例如针对点积A[1][1]×B[1][1]+A[1][2]×B[2][1],第一线程可以计算A[1][1]×B[1][1],第四线程可以与第一线程并行地计算A[1][2]×B[2][1],第一线程和第四线程随后加和。例如,第四线程将 的乘积发送至第一线程,第一线程执行加法操作以得到点积结果。第一线程将点积结果累加乘积寄存器。对于A[1][1]×B[1][2]+A[1][2]×B[2][2]、A[1][1]×B[1][3]+A[1][2]×B[2][3]、A[1][1]×B[1][4]+A[1][2]×B[2][4]、A[1][1]×B[1][5]+A[1][2]×B[2][5]、A[1][1]×B[1][6]+A[1][2]×B[2][6]、A[1][1]×B[1][7]+A[1][2]×B[2][7]和A[1][1]×B[1][8]+A[1][2]×B[2][8],第一线程和第四线程可以类似地操作,以获得第一点积集。在另一个实施例中,可以默认第一线程将乘积发送至第四线程,而由第四线程执行乘积的加法并且将点积结果累加到第四线程的乘积寄存器。It can be seen that by adding the second combined calculation mode indication KB2, the first thread and the fourth thread in the second thread set jointly participate in the calculation of each product element in the product matrix. Specifically, for example, for the dot product A[1][1]×B[1][1]+A[1][2]×B[2][1], the first thread can calculate A[1][ 1]×B[1][1], the fourth thread can compute A[1][2]×B[2][1] in parallel with the first thread, and the first and fourth threads then add. For example, the fourth thread sends the product of to the first thread, and the first thread performs an addition operation to get the dot product result. The first thread accumulates the dot product result into the product register. For A[1][1]×B[1][2]+A[1][2]×B[2][2], A[1][1]×B[1][3]+A [1][2]×B[2][3], A[1][1]×B[1][4]+A[1][2]×B[2][4], A[1 ][1]×B[1][5]+A[1][2]×B[2][5], A[1][1]×B[1][6]+A[1][ 2]×B[2][6], A[1][1]×B[1][7]+A[1][2]×B[2][7] and A[1][1] ×B[1][8]+A[1][2]×B[2][8], the first thread and the fourth thread can operate similarly to obtain the first point product set. In another embodiment, the first thread may default to sending the product to the fourth thread, and the fourth thread performs the addition of the products and accumulates the dot product result into the fourth thread's product register.
再例如,针对点积A[1][1]×B[1][1]+A[1][2]×B[2][1]+A[1][3]×B[3][1]+A[1][4]×B[4][1],第一线程可以计算A[1][1]×B[1][1]+A[1][2]×B[2][1],第二线程集中的第四线程可以与第一线程并行地计算A[1][3]×B[3][1]+A[1][4]×B[4][1],第一线程随后进行加和处理。例如,第四线程将点积发送至第一线程,并且第一线程执行加法操作以得到点积结果。第一线程随后将点积结果累加到乘积寄存器。在另一个实施例中,可以默认第一线程将点积发送至第四线程,而由第四线程执行点积的加法并且将点积结果累加到第四线程的乘积寄存器。For another example, for the dot product A[1][1]×B[1][1]+A[1][2]×B[2][1]+A[1][3]×B[3] [1]+A[1][4]×B[4][1], the first thread can calculate A[1][1]×B[1][1]+A[1][2]×B [2][1], the fourth thread in the second thread set can calculate A[1][3]×B[3][1]+A[1][4]×B[4 in parallel with the first thread ][1], the first thread then performs addition processing. For example, the fourth thread sends the dot product to the first thread, and the first thread performs an addition operation to get the dot product result. The first thread then accumulates the dot product result into the product register. In another embodiment, the first thread may default to sending the dot product to the fourth thread, and the fourth thread performs the addition of the dot product and accumulates the dot product result into the product register of the fourth thread.
在此情形下,第一线程基于第二合并计算模式指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集,并且将第一点积集累加到第一线程的第二组寄存器中。第四线程基于第二合并计算模式指示和第一因数寄存器表示,将第四线程将第一行中的第五因数集和第二张量的第六因数集进行点积运算,以生成第三张量的第一行中的第四点积集,第五因数集不同于第一因数集,第六因数集不同于第二因数集,第四点积集不同于第一点积集。第一线程还进一步将第四点积集累加到与第一乘积寄存器表示对应的第三组乘积寄存器,该第三组乘积寄存器位于第一线程中。可以理解,第二合并计算模式指示计算模式指示可以与上面针对图8的实施例结合使用,因此针对图8描述的各个方面在此不再赘述。In this case, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor based on the second merge calculation mode indication and the first factor register representation, to generate the first point product in the first row of the third tensor and accumulate the first point product into the second set of registers of the first thread. Based on the second combined calculation mode indication and the first factor register representation, the fourth thread performs a dot product operation on the fifth factor set in the first row and the sixth factor set of the second tensor to generate a third The fourth point product set in the first row of the tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth point product set is different from the first point product set. The first thread further accumulates the fourth point product set to a third set of product registers corresponding to the first product register representation, the third set of product registers being located in the first thread. It can be understood that the second combined calculation mode indication calculation mode indication can be used in conjunction with the above embodiment with respect to FIG. 8 , so various aspects described with respect to FIG. 8 will not be described again here.
此外,在一些情形下,例如线程数目远大于乘积矩阵的行数目时,可以将第一合并计算指示与第二合并计算指示结合使用。即,乘积矩阵不仅每行可以分为不同的部分使用不同的线程组来计算,每行内的每个点积元素也可以由不同的线程来计算。例如,针对C[1][1]-C[1][8],C[1][1]-C[1][4]可以由第一组线程来计算,而C[1][5]-C[1][8]可以由第二组线程来计算。进一步地,针对每个点积元素,例如C[1][1]=A[1][1]×B[1][1]+A[1][2]×B[2][1]+A[1][3]×B[3][1]+A[1][4]×B[4][1]+A[1][5]×B[5][1]+A[1][6]×B[6][1]+A[1][7]×B[7][1]+A[1][8]×B[8][1],其中第一组线程中的第一线程计算A[1][1]×B[1][1]+A[1][2]×B[2][1]+A[1][3]×B[3][1]+A[1][4]×B[4][1],而第一组线程中的第二线程计算A[1][5]×B[5][1]+A[1][6]×B[6][1]+A[1][7]×B[7][1]+A[1][8]×B[8][1],以此类推。In addition, in some situations, for example, when the number of threads is much larger than the number of rows of the product matrix, the first merge calculation instruction can be used in combination with the second merge calculation instruction. That is, not only can each row of the product matrix be divided into different parts and calculated by different thread groups, but each dot product element within each row can also be calculated by a different thread. For example, for C[1][1]-C[1][8], C[1][1]-C[1][4] can be calculated by the first group of threads, while C[1][5 ]-C[1][8] can be calculated by the second set of threads. Further, for each dot product element, for example, C[1][1]=A[1][1]×B[1][1]+A[1][2]×B[2][1] +A[1][3]×B[3][1]+A[1][4]×B[4][1]+A[1][5]×B[5][1]+A [1][6]×B[6][1]+A[1][7]×B[7][1]+A[1][8]×B[8][1], the first The first thread in the group of threads calculates A[1][1]×B[1][1]+A[1][2]×B[2][1]+A[1][3]×B[ 3][1]+A[1][4]×B[4][1], while the second thread in the first set of threads computes A[1][5]×B[5][1]+A [1][6]×B[6][1]+A[1][7]×B[7][1]+A[1][8]×B[8][1], and so on .
在矩阵乘法的计算过程中,第二因数矩阵通常按列与第一因数矩阵的行元素进行点积运算。然而,在一些情形下,被存储在诸如DDR之类的存储器中的第二因数矩阵在物理上通常是按行存储。因此,当线程从存储器中读取第二因数矩阵的元素,例如B[1][1],基于空间邻近性原理,其通常也将该元素在物理上相近的一些元素一次读取到L1高速缓存中,例如B[1][2]、B[1][3]、B[1][4]与B[1][1]被一同读取到L1高速缓存。然而,在做矩阵乘法的过程中,实际上一个线程可能需要的相同列的元素,例如B[1][1]和B[2][1]。这时,则又需要消耗若干个时钟周期从存储器读取B[2][1]以及在本次计算过程中不需要的B[2][2]、B[2][3]和B[2][4]到L1高速缓存中。在常规情形下,B[1][2]、B[1][3]、B[1][4]、B[2][2]、B[2][3]和B[2][4]通常会因L1高速缓存的动态刷新规则而被丢弃。在后续矩阵计算过程中,当需要B[1][2]、B[1][3]、B[1][4]、B[2][2]、B[2][3]或B[2][4]时,线程重新从存储器读取相应数据到L1高速缓存中。由此可见,这样的多次重复读取极大地浪费了从存储器到L1高速缓存传输数据的时间。In the calculation process of matrix multiplication, the second factor matrix is usually column-wise and the dot product operation is performed with the row elements of the first factor matrix. However, in some cases, the second factor matrix stored in a memory such as a DDR is typically physically stored row by row. Therefore, when a thread reads an element of the second factor matrix from the memory, such as B[1][1], based on the principle of spatial proximity, it usually also reads some elements that are physically close to the element to the L1 high speed at one time. In the cache, for example, B[1][2], B[1][3], B[1][4] and B[1][1] are read together into the L1 cache. However, during matrix multiplication, a thread may actually need elements of the same column, such as B[1][1] and B[2][1]. At this time, it takes several clock cycles to read B[2][1] from the memory as well as B[2][2], B[2][3] and B[ that are not needed in this calculation process. 2][4] into the L1 cache. In the normal case, B[1][2], B[1][3], B[1][4], B[2][2], B[2][3] and B[2][ 4] Typically discarded due to the dynamic flushing rules of the L1 cache. In the subsequent matrix calculation process, when B[1][2], B[1][3], B[1][4], B[2][2], B[2][3] or B [2][4], the thread re-reads the corresponding data from the memory into the L1 cache. It can be seen that such repeated reads greatly waste the time of transferring data from the memory to the L1 cache.
在本公开的一些实施例中,针对例如第二因数矩阵B的矩阵元素按行存储的情形,在张量乘法指令中进一步设置了转置指示。在一个 实施例中,第一张量乘法指令还包括转置指示。第一张量乘法指令的一个进一步的示意性示例是@p1,mm8.KA1.T1.sa.sb R0,ur4:rf290:0x00,R256,其中T1表示第二因数矩阵B需要被转置。在另一些实施例中,在张量乘法指令不包括转置指示时,可以默认第二因数矩阵B无需被转置。在又一些实施例中,可以在张量乘法指令中使用T0表示第二因数矩阵B无需被转置。In some embodiments of the present disclosure, for a situation where, for example, the matrix elements of the second factor matrix B are stored in rows, a transpose indication is further set in the tensor multiplication instruction. In one embodiment, the first tensor multiply instruction also includes a transpose instruction. A further illustrative example of a first tensor multiplication instruction is @p1,mm8.KA1.T1.sa.sb R0,ur4:rf290:0x00,R256, where T1 indicates that the second factor matrix B needs to be transposed. In other embodiments, when the tensor multiplication instruction does not include a transpose indication, it may be assumed that the second factor matrix B does not need to be transposed. In yet other embodiments, T0 may be used in tensor multiplication instructions to indicate that the second factor matrix B does not need to be transposed.
第二线程集中的第一线程因此可以基于转置指示和第一因数寄存器表示,将第一张量第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。具体而言,第一线程集基于转置指示和存储器逻辑地址,将第二张量中的多个行的因数加载至高速缓存。例如,第一线程集可以将B[1][1]-B[1][4]、B[2][1]-B[2][4]、B[3][1]-B[3][4]和B[4][1]-B[4][4]均加载进入L1高速缓存。第一线程集继而按列从多个行的因数中选择因数,例如选择B[1][1]、B[2][1]、B[3][1]和B[4][1],以形成第二因数集并且将其广播至第二线程集。第二线程集继而基于第一因数寄存器表示将第一行中的第一因数集和第二因数集进行点积运算以生成第三张量的第一行中的第一点积集。注意,此时由于存在转置指示T1,因此B[1][2]-B[1][4]、B[2][2]-B[2][4]、B[3][2]-B[3][4]和B[4][2]-B[4][4]被直接保留在高速缓存中,而不被动态刷新掉。这样,第二线程集中的第一线程在执行后续矩阵计算过程中,无需再次从存储器读取B[1][2]-B[1][4]、B[2][2]-B[2][4]、B[3][2]-B[3][4]和B[4][2]-B[4][4],由此可以大大节省时间。The first thread in the second thread set can therefore dot product the first set of factors in the first row of the first tensor and the second set of factors in the second tensor based on the transpose indication and the first factor register representation to Generate the first point product in the first row of the third tensor. Specifically, the first set of threads loads the factors of multiple rows in the second tensor into the cache based on the transpose indication and the memory logical address. For example, the first thread set can divide B[1][1]-B[1][4], B[2][1]-B[2][4], B[3][1]-B[ 3][4] and B[4][1]-B[4][4] are both loaded into the L1 cache. The first set of threads then selects factors from multiple rows of factors by column, such as selecting B[1][1], B[2][1], B[3][1], and B[4][1] , to form a second set of factors and broadcast it to the second set of threads. The second set of threads then performs a dot product operation on the first set of factors and the second set of factors in the first row based on the first factor register representation to generate a first set of dot products in the first row of the third tensor. Note that since there is a transposition instruction T1 at this time, B[1][2]-B[1][4], B[2][2]-B[2][4], B[3][2 ]-B[3][4] and B[4][2]-B[4][4] are directly retained in the cache without being dynamically refreshed. In this way, the first thread in the second thread set does not need to read B[1][2]-B[1][4], B[2][2]-B[ from the memory again when performing subsequent matrix calculations. 2][4], B[3][2]-B[3][4] and B[4][2]-B[4][4], which can greatly save time.
虽然在此以B[1][1]-B[1][4]、B[2][1]-B[2][4]、B[3][1]-B[3][4]和B[4][1]-B[4][4]为例来说明转置指示,但是可以理解这仅是示意。可以用于转置的第二因数矩阵B的范围可以变化,例如,在第二因数矩阵B的行数为诸如256行之类的其它行数目时,可以将全部行的cache line都加载进高速缓存,并且直至cache line中的数据都已经用于矩阵乘法计算之后再从高速缓存释放。这样,可以大大节省反复从存储器读取数据到L1高速缓存所需的时间。Although here B[1][1]-B[1][4], B[2][1]-B[2][4], B[3][1]-B[3][4 ] and B[4][1]-B[4][4] are taken as examples to illustrate the transposition instruction, but it can be understood that this is only an illustration. The range of the second factor matrix B that can be used for transposition can vary. For example, when the number of rows of the second factor matrix B is another number of rows, such as 256 rows, the cache lines of all rows can be loaded into the high speed. Cache, and release from the cache until the data in the cache line has been used for matrix multiplication calculations. In this way, the time required to repeatedly read data from memory to the L1 cache can be greatly saved.
在上文中,主要以二维张量的形式来描述根据本公开的实施例的矩阵乘法的原理和示例。但是可以理解,本公开不限于二维张量的形式的矩阵乘法计算,而是可以包括一维张量或更多维张量的乘法或卷积的计算。对于一维张量而言,相当于二维张量中的一个维度为1,因此在此不再赘述。In the above, the principles and examples of matrix multiplication according to embodiments of the present disclosure are mainly described in the form of two-dimensional tensors. However, it can be understood that the present disclosure is not limited to matrix multiplication calculations in the form of two-dimensional tensors, but may include calculations of multiplications or convolutions of one-dimensional tensors or more-dimensional tensors. For a one-dimensional tensor, it is equivalent to one dimension of the two-dimensional tensor being 1, so I won’t go into details here.
对于三维或更高维度的矩阵计算而言,可以将第一因数矩阵A和第二因数矩阵B中的除k维之外的其它维度进行降维分解以获得等效二维矩阵,k维通常不被分解是因为为了进行矩阵乘法,第一因数矩阵A中的k列和第二因数矩阵B中的k行的数目需要相等。For three-dimensional or higher-dimensional matrix calculations, the dimensions other than k dimensions in the first factor matrix A and the second factor matrix B can be reduced and decomposed to obtain an equivalent two-dimensional matrix. The k dimension is usually It is not decomposed because in order to perform matrix multiplication, the number of k columns in the first factor matrix A and the number of k rows in the second factor matrix B need to be equal.
在一个实施例中,假设第一因数张量A是m×x×k的三维张量,并且第二因数张量B是k×n×y×z的四维张量,其中k、m、n、x、y和z均表示正整数。可以将第一因数张量A转换为(m×x,k)形式的二维张量。即,在x维上切割,并且将切割后的x个m×k的二维张量按行拼接,以获得二维等效矩阵A’。在此情形下,可以使用m×x个线程来并行计算。此外,类似地,可以将第二因数张量切割为y×z个k×n的二维矩阵,并且依次按列拼接,以获得二维等效矩阵B’。可以理解,虽然在此以三维张量和四维张量的乘法(卷积)为例来说明矩阵降维,但是这仅是示意而非对本公开的范围进行限制。其它多维mm的降维可以类似处理,在此不再赘述。降维之后的mm,可以参见前面针对图8的关于mm的具体描述,在此同样不再赘述。In one embodiment, assume that the first factor tensor A is a three-dimensional tensor of m×x×k, and the second factor tensor B is a four-dimensional tensor of k×n×y×z, where k, m, n , x, y and z all represent positive integers. The first factor tensor A can be converted into a two-dimensional tensor of the form (m×x,k). That is, cut on the x dimension, and concatenate the cut x m×k two-dimensional tensors row by row to obtain the two-dimensional equivalent matrix A'. In this case, m×x threads can be used for parallel computation. In addition, similarly, the second factor tensor can be cut into y×z k×n two-dimensional matrices and spliced column-by-column in sequence to obtain a two-dimensional equivalent matrix B’. It can be understood that although the multiplication (convolution) of a three-dimensional tensor and a four-dimensional tensor is used as an example to illustrate matrix dimensionality reduction, this is only illustrative and does not limit the scope of the present disclosure. Dimensionality reduction of other multi-dimensional mm can be handled similarly and will not be repeated here. For mm after dimensionality reduction, please refer to the detailed description of mm mentioned above in Figure 8, which will not be described again here.
图9示出了根据本公开的一个实施例的电子设备900的示意框图。电子设备900可以用于执行图8所示的方法800,因此关于图8描述的各个方面可以选择性适用于电子设备900。电子设备900包括接收单元902、广播单元903、生成单元904和存储单元906。Figure 9 shows a schematic block diagram of an electronic device 900 according to one embodiment of the present disclosure. The electronic device 900 may be used to perform the method 800 shown in FIG. 8 , and therefore various aspects described with respect to FIG. 8 may be selectively applicable to the electronic device 900 . The electronic device 900 includes a receiving unit 902, a broadcast unit 903, a generating unit 904, and a storage unit 906.
接收单元902被配置为接收针对加速器的第一线程集的第一张量乘法指令,第一张量乘法指令包括针对第一线程集的第一线程指示、针对第一张量的第一因数寄存器表示、针对第二张量的存储器逻辑地址、以及针对第三张量的第一乘积寄存器表示。广播单元903被配置为由第一线程集基于针对第二张量的存储器逻辑地址将第二张量中 的第二因数集广播至第二线程集,第二线程集不同于第一线程集。生成单元904被配置为由第二线程集中的第一线程基于第一因数寄存器表示将第一张量中的第一行中的第一因数集和第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。存储单元906被配置为由第一线程将第一点积集累加到与第一乘积寄存器表示对应的第一组乘积寄存器中。通过将矩阵分解,并且按行分配线程,这样多个线程可以并行处理矩阵张量的多个行,从而加快矩阵乘法的处理效率。此外,由于编程人员在编程时知晓矩阵张量的行列结构以及加速器中的线程状况,因此可以灵活使用线程来并行处理矩阵乘法,从而提高编程的灵活性。The receiving unit 902 is configured to receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first factor register for the first tensor representation, the memory logical address for the second tensor, and the first product register representation for the third tensor. Broadcast unit 903 is configured to broadcast, by the first set of threads, a second set of factors in the second tensor to a second set of threads, the second set of threads being different from the first set of threads based on the memory logical address for the second tensor. The generation unit 904 is configured to perform a dot product operation on the first factor set and the second factor set in the first row of the first tensor based on the first factor register representation by the first thread in the second thread set to generate a third tensor. The first point in the first row of the quantity accumulates. Storage unit 906 is configured to accumulate the first point product by the first thread into the first set of product registers corresponding to the first product register representation. By decomposing the matrix and allocating threads by rows, multiple threads can process multiple rows of the matrix tensor in parallel, thereby speeding up the processing efficiency of matrix multiplication. In addition, because programmers know the row-column structure of matrix tensors and the thread status in the accelerator when programming, they can flexibly use threads to process matrix multiplications in parallel, thereby improving programming flexibility.
在一个实施例中,每个线程包括第一组寄存器和第二组寄存器,其中第一组寄存器用于存储第一因数矩阵的一行中数据的至少一部分,第二组寄存器用于存储乘积矩阵中的一行的数据。第二因数矩阵的一列中的数据可以来自于片上存储器、一级高速缓存或片外存储器。这样,在矩阵乘法执行过程中,第一线程的执行单元可以仅从第一组寄存器读取第一因数矩阵的一行中的数据一次,并且在后续点积运算过程中重复使用。此外,第二因数矩阵的一列中的数据可以被并行广播至多个(例如与第一因数矩阵的行相同数目或其一半数目)线程中的执行单元,并且重复使用。以此方式,可以减少数据在不同存储装置之间的传送,从而减少矩阵乘法计算过程中因数据传输引起的时间。In one embodiment, each thread includes a first set of registers and a second set of registers, wherein the first set of registers is used to store at least a portion of the data in a row of the first factor matrix, and the second set of registers is used to store the data in a row of the product matrix. a row of data. The data in a column of the second factor matrix may come from on-chip memory, L1 cache, or off-chip memory. In this way, during the execution of matrix multiplication, the execution unit of the first thread can read the data in one row of the first factor matrix from the first set of registers only once, and reuse it during subsequent dot product operations. Furthermore, data in a column of the second factor matrix may be broadcast in parallel to execution units in multiple (eg, the same number or half the number of rows of the first factor matrix) threads and reused. In this way, data transmission between different storage devices can be reduced, thereby reducing the time caused by data transmission during matrix multiplication calculations.
在一个实施例中,生成单元904被进一步配置为响应于接收到第二因数集,第二线程集中的第二线程基于第一因数寄存器表示将第一张量的第二行中的第三因数集和第二因数集进行点积运算,以生成第三张量的第二行中的第二点积集。存储单元908被进一步配置为由第二线程将第二点积集累加到与第一乘积寄存器表示对应的第二组乘积寄存器中。In one embodiment, the generation unit 904 is further configured to, in response to receiving the second set of factors, a second thread in the second set of threads convert the third factor in the second row of the first tensor based on the first factor register representation. The dot product operation is performed on the set and the second set of factors to produce the second dot product set in the second row of the third tensor. Storage unit 908 is further configured to accumulate, by the second thread, the second dot product set into a second set of product registers corresponding to the first product register representation.
在一个实施例中,第一张量乘法指令还包括第一合并计算模式指示。生成单元904还被配置为:基于第一合并计算模式指示和第一因 数寄存器表示,由第一线程将第一行中的第一因数集和第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In one embodiment, the first tensor multiply instruction further includes a first merge calculation mode indication. The generation unit 904 is further configured to: based on the first combined calculation mode indication and the first factor register representation, perform a dot product operation on the first factor set and the second factor set in the first row by the first thread to generate a third The first point product in the first row of the tensor.
在一个实施例中,生成单元904还被配置为基于第一合并计算模式指示和第一因数寄存器表示,由第一线程集中的第三线程将第一因数集和第二张量的第四因数集进行点积运算,以生成第三张量的第一行中的第三点积集,第四因数集不同于第二因数集,第三点积集不同于第一点积集。存储单元906还被配置为由第三线程将第三点积集累加到与第一乘积寄存器表示对应的第三组乘积寄存器中。In one embodiment, the generation unit 904 is further configured to combine the first set of factors and the fourth factor of the second tensor by a third thread in the first set of threads based on the first combined calculation mode indication and the first factor register representation. The sets are subjected to a dot product operation to produce a third set of point products in the first row of the third tensor, the fourth set of factors is different from the set of second factors, and the third set of point products is different from the set of first point products. Storage unit 906 is further configured to accumulate, by the third thread, a third point product into a third set of product registers corresponding to the first product register representation.
在一个实施例中,第一张量乘法指令还包括第二合并计算模式指示。生成单元904还被配置为基于第二合并计算模式指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In one embodiment, the first tensor multiply instruction further includes a second merge calculation mode indication. The generation unit 904 is further configured to perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread based on the second combined calculation mode indication and the first factor register representation, to Generate the first point product in the first row of the third tensor.
在一个实施例中,生成单元904还被配置为基于第二合并计算模式指示和第一因数寄存器表示,由第二线程集中的第四线程将第五因数集和第二张量的第六因数集进行点积运算,以生成第三张量的第一行中的第四点积集,第五因数集不同于第一因数集,第六因数集不同于第二因数集,第四点积集不同于第一点积集。存储单元906还被配置为由第四线程将第四点积集累加至与第一乘积寄存器表示对应的第一组乘积寄存器。In one embodiment, the generation unit 904 is further configured to combine the fifth set of factors and the sixth factor of the second tensor by the fourth thread in the second set of threads based on the second combined calculation mode indication and the first factor register representation. Sets are subjected to a dot product operation to produce a fourth dot product set in the first row of the third tensor, the fifth factor set is different from the first factor set, the sixth factor set is different from the second factor set, and the fourth dot product The set is different from the first point product set. Storage unit 906 is further configured to accumulate, by the fourth thread, a fourth point product to the first set of product registers corresponding to the first product register representation.
在一个实施例中,第一张量乘法指令还包括转置指示。生成单元904还被配置为:基于转置指示和第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二张量中的第二因数集进行点积运算,以生成第三张量的第一行中的第一点积集。In one embodiment, the first tensor multiply instruction also includes a transpose instruction. The generation unit 904 is further configured to: based on the transposition indication and the first factor register representation, perform a dot product operation on the first factor set in the first row and the second factor set in the second tensor by the first thread to generate the The first point product in the first row of three tensors.
在一个实施例中,生成单元904还被配置为:基于转置指示和存储器逻辑地址,将第二张量中的多个行的因数加载至高速缓存;按列从多个行的因数中选择因数以形成第二因数集;以及基于第一因数寄存器表示,由第一线程将第一行中的第一因数集和第二因数集进行点积运算以生成第三张量的第一行中的第一点积集。在一个实施例中,多个行中未被选择的多个因数被保留在一级高速缓存中,直至上述多 个未被选择的因数被选择用于进行矩阵乘法的计算。In one embodiment, the generation unit 904 is further configured to: load the factors of the plurality of rows in the second tensor to the cache based on the transposition indication and the memory logical address; select the factors from the factors of the plurality of rows by columns to forming a second factor set; and based on the first factor register representation, performing a dot product operation on the first factor set and the second factor set in the first row by the first thread to generate the third tensor in the first row. A little accumulation. In one embodiment, unselected factors in multiple rows are retained in the L1 cache until the unselected factors are selected for calculation of matrix multiplication.
在一个实施例中,第一线程集将与存储器逻辑地址对应的第二因数集由以广播的形式并行提供给第二线程集中的全部线程。In one embodiment, the first thread set provides the second factor set corresponding to the memory logical address in a broadcast form in parallel to all threads in the second thread set.
在一个实施例中,存储器逻辑地址包括段基准数据和偏移数据,段基准数据表示第二张量的起始地址,偏移数据表示第二张量在多个维度中的各维上的偏移量。In one embodiment, the memory logical address includes segment reference data and offset data, the segment reference data represents the starting address of the second tensor, and the offset data represents the offset of the second tensor in each of the multiple dimensions. Shift amount.
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。Furthermore, although operations are depicted in a specific order, this should be understood to require that such operations be performed in the specific order shown or in sequential order, or that all illustrated operations should be performed to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, although several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Claims (30)

  1. 一种由加速器执行的方法,包括:A method performed by an accelerator that includes:
    接收针对加速器的第一线程集的第一张量乘法指令,所述第一张量乘法指令包括针对所述第一线程集的第一线程指示、针对第一张量的第一因数寄存器表示、针对第二张量的存储器逻辑地址、以及针对第三张量的第一乘积寄存器表示;receiving a first tensor multiply instruction for a first thread set of the accelerator, the first tensor multiply instruction including a first thread indication for the first thread set, a first factor register representation for a first tensor, a memory logical address for the second tensor, and a first product register representation for the third tensor;
    所述第一线程集基于针对所述第二张量的存储器逻辑地址将所述第二张量中的第二因数集广播至第二线程集,所述第二线程集不同于所述第一线程集;The first set of threads broadcasts a second set of factors in the second tensor to a second set of threads, the second set of threads being different from the first set of threads based on the memory logical address for the second tensor ;
    所述第二线程集中的第一线程基于所述第一因数寄存器表示将所述第一张量中的第一行中的第一因数集和所述第二因数集进行点积运算,以生成所述第三张量的第一行中的第一点积集;以及The first thread in the second thread set performs a dot product operation on the first factor set in the first row of the first tensor and the second factor set based on the first factor register representation to generate the the first point product in the first row of the third tensor; and
    由所述第一线程将所述第一点积集累加到与第一乘积寄存器表示对应的第一组乘积寄存器中。The first point product is accumulated by the first thread into a first set of product registers corresponding to a first product register representation.
  2. 根据权利要求1所述的方法,还包括:The method of claim 1, further comprising:
    响应于接收到所述第二因数集,所述第二线程集中的第二线程基于所述第一因数寄存器表示将所述第一张量的第二行中的第三因数集和所述第二因数集进行点积运算,以生成所述第三张量的第二行中的第二点积集;以及In response to receiving the second set of factors, a second thread in the second set of threads combines a third set of factors in a second row of the first tensor with the third set of factors based on the first factor register representation. perform a dot product operation on the set of two factors to generate a second set of dot products in the second row of the third tensor; and
    由所述第二线程将所述第二点积集累加到与第一乘积寄存器表示对应的第二组乘积寄存器中。The second set of point products is accumulated by the second thread into a second set of product registers corresponding to the first product register representation.
    提供针对第一线程集各线程的执行条件,对于不满足执行条件的线程,其访存操作被视作超出张量地址范围而被忽略。Provide execution conditions for each thread in the first thread set. For threads that do not meet the execution conditions, their memory access operations are regarded as exceeding the tensor address range and are ignored.
  3. 根据权利要求1所述的方法,其中所述第一张量乘法指令还包括第一合并计算模式指示;The method of claim 1, wherein the first tensor multiplication instruction further includes a first merge calculation mode indication;
    生成所述第三张量的第一行中的第一点积集包括:The first point product in the first row that generates the third tensor consists of:
    基于所述第一合并计算模式指示和所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二因数集 进行点积运算,以生成所述第三张量的第一行中的第一点积集。Based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate The first point product in the first row of the third tensor.
  4. 根据权利要求3所述的方法,还包括:The method of claim 3, further comprising:
    基于所述第一合并计算模式指示和所述第一因数寄存器表示,由所述第一线程集中的第三线程将所述第一因数集和所述第二张量的第四因数集进行点积运算,以生成所述第三张量的第一行中的第三点积集,所述第四因数集不同于所述第二因数集,所述第三点积集不同于所述第一点积集;以及Based on the first merge calculation mode indication and the first factor register representation, the first factor set and a fourth factor set of the second tensor are merged by a third thread in the first thread set product operation to generate a third point product set in the first row of the third tensor, the fourth factor set is different from the second factor set, and the third point product set is different from the One point accumulation; and
    由所述第三线程将所述第三点积集累加到与所述第一乘积寄存器表示对应的第三组乘积寄存器中。The third set of point products is accumulated by the third thread into a third set of product registers corresponding to the first product register representation.
  5. 根据权利要求1所述的方法,其中所述第一张量乘法指令还包括转置指示;The method of claim 1, wherein the first tensor multiply instruction further includes a transpose instruction;
    生成所述第三张量的第一行中的第一点积集包括:The first point product in the first row that generates the third tensor consists of:
    基于所述转置指示和所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二张量中的第二因数集进行点积运算,以生成所述第三张量的第一行中的第一点积集。Based on the transpose indication and the first factor register representation, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to obtain Generate the first point product in the first row of the third tensor.
  6. 根据权利要求5所述的方法,其中基于所述转置指示和所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二张量中的第二因数集进行点积运算以生成所述第三张量的第一行中的第一点积集包括:The method of claim 5, wherein based on the transpose indication and the first factor register representation, the first set of factors in the first row and the first set of factors in the second tensor are combined by the first thread. The second set of factors are dot producted to generate the first dot product set in the first row of the third tensor including:
    基于所述转置指示和所述存储器逻辑地址,将所述第二张量中的多个行的因数加载至高速缓存;Loading factors of a plurality of rows in the second tensor into a cache based on the transpose indication and the memory logical address;
    按列从所述多个行的因数中选择因数以形成所述第二因数集;以及selecting factors from the factors of the plurality of rows by column to form the second set of factors; and
    基于所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二因数集进行点积运算以生成所述第三张量的第一行中的第一点积集。Based on the first factor register representation, the first thread performs a dot product operation on the first set of factors in the first row and the second set of factors to generate the first row of the third tensor The first point product in .
  7. 根据权利要求1-6中任一项所述的方法,其中所述第一线程集将与所述存储器逻辑地址对应的所述第二因数集以广播的形式并行提供给所述第二线程集中的全部线程中的计算单元,而不提供至所述 全部线程中的寄存器。The method according to any one of claims 1-6, wherein the first thread set provides the second factor set corresponding to the memory logical address to the second thread set in parallel in the form of broadcast Computational units in all threads without providing registers in all threads.
  8. 根据权利要求7所述的方法,其中所述存储器逻辑地址包括段基准数据和偏移数据,所述段基准数据表示所述第二张量的起始地址,所述偏移数据表示所述第二张量在多个维度中的各维上的偏移量。The method of claim 7, wherein the memory logical address includes segment reference data representing a starting address of the second tensor and offset data representing the first tensor. The offset of the two tensors in each of multiple dimensions.
  9. 根据权利要求1或3所述的方法,其中所述第一张量乘法指令还包括第二合并计算模式指示;The method of claim 1 or 3, wherein the first tensor multiplication instruction further includes a second merge calculation mode indication;
    生成所述第三张量的第一行中的第一点积集包括:The first point product in the first row that generates the third tensor consists of:
    基于所述第二合并计算模式指示和所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二张量中的第二因数集进行点积运算,以生成所述第三张量的第一行中的第一点积集。Based on the second merge calculation mode indication and the first factor register representation, the first thread performs a dot product of the first set of factors in the first row and the second set of factors in the second tensor. Operation to generate the first point product in the first row of the third tensor.
  10. 根据权利要求9所述的方法,还包括:The method of claim 9, further comprising:
    基于所述第二合并计算模式指示和所述第一因数寄存器表示,由所述第二线程集中的第四线程将所述第五因数集和所述第二张量的第六因数集进行点积运算,以生成所述第三张量的所述第一行中的第四点积集,所述第五因数集不同于所述第一因数集,所述第六因数集不同于所述第二因数集,所述第四点积集不同于所述第一点积集;以及Based on the second merge calculation mode indication and the first factor register representation, the fifth factor set and the sixth factor set of the second tensor are combined by a fourth thread in the second thread set. product operation to generate a fourth point product set in the first row of the third tensor, the fifth factor set is different from the first factor set, and the sixth factor set is different from the a second set of factors, the fourth point product set being different from the first point product set; and
    由所述第四线程将所述第四点积集累加至与所述第一乘积寄存器表示对应的所述第一组乘积寄存器。The fourth set of point products is accumulated by the fourth thread to the first set of product registers corresponding to the first product register representation.
  11. 根据权利要求1所述的方法,其中The method of claim 1, wherein
    所述第一乘积寄存器表示对应于一个或多个乘积寄存器,所述一个或多个乘积寄存器的数目与合并计算模式以及所述第二张量的列数相关,不同线程的乘积寄存器构成结果张量,每个线程的乘积寄存器包括结果张量每行的部分或全部;以及The first product register represents one or more product registers. The number of the one or more product registers is related to the combined calculation mode and the number of columns of the second tensor. The product registers of different threads constitute the result tensor. quantities, each thread's product register contains part or all of each row of the result tensor; and
    所述结果张量的行数与第一张量的行数相同,所述结果张量的列数与第二张量的列数相同。The number of rows of the result tensor is the same as the number of rows of the first tensor, and the number of columns of the result tensor is the same as the number of columns of the second tensor.
  12. 根据权利要求11所述的方法,其中The method of claim 11, wherein
    所述第二线程集中的线程内的乘积寄存器的数目是可变的,所述乘积寄存器的数目取决于所述第一张量乘法指令的执行条件,所述执行条件确定对所述第二张量中的列的访问;以及The number of product registers within the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the number of multiplication instructions in the second tensor. column access; and
    如果所述第二张量中的第一列未被访问,则所述第二张量中的第一列不参与矩阵乘法计算。If the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.
  13. 根据权利要求1所述的方法,其中The method of claim 1, wherein
    所述第一张量乘法指令在一次完整的执行过程中被多次发射,其中所述第一张量乘法指令第一次以存储指令的方式被发射,以用于获取所述第二张量中的列数据或者行数据;以及The first tensor multiplication instruction is issued multiple times during a complete execution process, wherein the first tensor multiplication instruction is issued for the first time as a storage instruction to obtain the second tensor. Column data or row data; and
    响应于获取到所述第二张量中的列数据或者行数据,并且所述第一张量的数据已被存储在所述第一因数寄存器中,第一张量乘法指令以数学计算指令的方式被二次或多次发射,以用于执行所述第三张量的行内的各列结果的计算。In response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, a first tensor multiplication instruction is issued in the form of a mathematical calculation instruction. Two or more emissions are used to perform the calculation of the results of each column within the row of the third tensor.
  14. 根据权利要求13所述的方法,其中The method of claim 13, wherein
    在进行二次或多次发射之前,检查所述第一因数寄存器的对应的令牌状态;Before performing the second or multiple transmissions, check the corresponding token status of the first factor register;
    如果所述令牌状态表示所述第一张量的数据已被存储在所述第一因数寄存器中,则以数学计算指令方式发射,否则阻塞发射队列,直至所述第一张量的数据已被存储在所述第一因数寄存器中。If the token status indicates that the data of the first tensor has been stored in the first factor register, then a mathematical calculation instruction is issued, otherwise the emission queue is blocked until the data of the first tensor has been stored. is stored in the first factor register.
  15. 根据权利要求11所述的方法,还包括:The method of claim 11, further comprising:
    基于所述第一乘积寄存器表示,确定针对所述第三张量的乘积寄存器使用范围是否超出单个线程内的寄存器文件的范围;以及determining whether the range of product register usage for the third tensor exceeds the range of a register file within a single thread based on the first product register representation; and
    如果确定针对所述第三张量的乘积寄存器使用范围超出单个线程内的寄存器文件的范围,则忽略超出寄存器文件范围的计算操作或访存操作并且报错。If it is determined that the usage range of the product register for the third tensor exceeds the range of the register file within a single thread, the calculation operation or memory access operation that exceeds the range of the register file is ignored and an error is reported.
  16. 一种电子设备,包括:An electronic device including:
    流处理器;stream processor;
    页表装置,耦合至所述流处理器;A page table device coupled to the stream processor;
    存储器;memory;
    处理引擎单元,耦合至所述流处理器、所述存储器和所述页表装置,被配置为执行权利要求1-15中任一项所述的方法。A processing engine unit, coupled to the stream processor, the memory and the page table device, is configured to perform the method of any one of claims 1-15.
  17. 一种加速器,包括:An accelerator that includes:
    接收单元,被配置为接收针对加速器的第一线程集的第一张量乘法指令,所述第一张量乘法指令包括针对所述第一线程集的第一线程指示、针对第一张量的第一因数寄存器表示、针对第二张量的存储器逻辑地址、以及针对第三张量的第一乘积寄存器表示;a receiving unit configured to receive a first tensor multiplication instruction for a first thread set of the accelerator, the first tensor multiplication instruction including a first thread indication for the first thread set, a first tensor multiplication instruction for a first tensor a first factor register representation, a memory logical address for the second tensor, and a first product register representation for the third tensor;
    广播单元,被配置为由所述第一线程集基于针对所述第二张量的存储器逻辑地址将所述第二张量中的第二因数集广播至第二线程集,所述第二线程集不同于所述第一线程集;a broadcast unit configured to broadcast, by the first set of threads, a second set of factors in the second tensor to a second set of threads based on a memory logical address for the second tensor, the second set of threads being different on the first thread set;
    生成单元,被配置为由所述第二线程集中的第一线程基于所述第一因数寄存器表示将所述第一张量中的第一行中的第一因数集和所述第二因数集进行点积运算,以生成所述第三张量的第一行中的第一点积集;以及a generation unit configured to dot, by a first thread in the second set of threads, a first set of factors in a first row in the first tensor and the second set of factors based on the first factor register representation. a product operation to generate the first set of point products in the first row of the third tensor; and
    存储单元,被配置为由所述第一线程将所述第一点积集累加到与第一乘积寄存器表示对应的第一组乘积寄存器中。A storage unit configured to accumulate, by the first thread, the first point product into a first set of product registers corresponding to a first product register representation.
  18. 根据权利要求17所述的加速器,其中The accelerator of claim 17, wherein
    所述生成单元被进一步配置为:响应于接收到所述第二因数集,所述第二线程集中的第二线程基于所述第一因数寄存器表示将所述第一张量的第二行中的第三因数集和所述第二因数集进行点积运算,以生成所述第三张量的第二行中的第二点积集;以及The generation unit is further configured to: in response to receiving the second set of factors, a second thread in the second set of threads generate a second row of the first tensor based on the first factor register representation. perform a dot product operation on a third set of factors and the second set of factors to generate a second set of dot products in the second row of the third tensor; and
    所述存储单元被进一步配置为:由所述第二线程将所述第二点积集累加到与第一乘积寄存器表示对应的第二组乘积寄存器中。The storage unit is further configured to accumulate, by the second thread, the second set of point products into a second set of product registers corresponding to the first product register representation.
  19. 根据权利要求18所述的加速器,其中所述第一张量乘法指令还包括第一合并计算模式指示;The accelerator of claim 18, wherein the first tensor multiply instruction further includes a first merge calculation mode indication;
    所述生成单元被进一步配置为:The generation unit is further configured to:
    基于所述第一合并计算模式指示和所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二因数集进行点积运算,以生成所述第三张量的第一行中的第一点积集。Based on the first combined calculation mode indication and the first factor register representation, the first thread performs a dot product operation on the first factor set and the second factor set in the first row to generate The first point product in the first row of the third tensor.
  20. 根据权利要求19所述的加速器,其中The accelerator of claim 19, wherein
    所述生成单元被进一步配置为:基于所述第一合并计算模式指示和所述第一因数寄存器表示,由所述第一线程集中的第三线程将所述第一因数集和所述第二张量的第四因数集进行点积运算,以生成所述第三张量的第一行中的第三点积集,所述第四因数集不同于所述第二因数集,所述第三点积集不同于所述第一点积集;以及The generating unit is further configured to: based on the first combined calculation mode indication and the first factor register representation, a third thread in the first thread set combines the first set of factors and the second set of factors. A fourth set of factors of the tensor is subjected to a dot product operation to generate a third set of dot products in the first row of the third tensor, the fourth set of factors being different from the second set of factors, the The three-point product set is different from the first point product set; and
    所述存储单元被进一步配置为:由所述第三线程将所述第三点积集累加到与所述第一乘积寄存器表示对应的第三组乘积寄存器中。The storage unit is further configured to accumulate, by the third thread, the third point product set into a third set of product registers corresponding to the first product register representation.
  21. 根据权利要求17所述的加速器,其中所述第一张量乘法指令还包括转置指示;The accelerator of claim 17, wherein the first tensor multiply instruction further includes a transpose instruction;
    所述生成单元被进一步配置为:The generation unit is further configured to:
    基于所述转置指示和所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二张量中的第二因数集进行点积运算,以生成所述第三张量的第一行中的第一点积集。Based on the transpose indication and the first factor register representation, the first thread performs a dot product operation on the first factor set in the first row and the second factor set in the second tensor to obtain Generate the first point product in the first row of the third tensor.
  22. 根据权利要求21所述的加速器,其中所述生成单元被进一步配置为:The accelerator of claim 21, wherein the generation unit is further configured to:
    基于所述转置指示和所述存储器逻辑地址,将所述第二张量中的多个行的因数加载至高速缓存;Loading factors of a plurality of rows in the second tensor into a cache based on the transpose indication and the memory logical address;
    按列从所述多个行的因数中选择因数以形成所述第二因数集;以及selecting factors from the factors of the plurality of rows by column to form the second set of factors; and
    基于所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二因数集进行点积运算以生成所述第三张量的第一行中的第一点积集。Based on the first factor register representation, the first thread performs a dot product operation on the first set of factors in the first row and the second set of factors to generate the first row of the third tensor The first point product in .
  23. 根据权利要求17或19所述的加速器,其中所述第一张量乘法指令还包括第二合并计算模式指示;The accelerator of claim 17 or 19, wherein the first tensor multiplication instruction further includes a second merge calculation mode indication;
    所述生成单元被进一步配置为:The generation unit is further configured to:
    基于所述第二合并计算模式指示和所述第一因数寄存器表示,由所述第一线程将所述第一行中的第一因数集和所述第二张量中的第二因数集进行点积运算,以生成所述第三张量的第一行中的第一 点积集。Based on the second merge calculation mode indication and the first factor register representation, the first thread performs a dot product of the first set of factors in the first row and the second set of factors in the second tensor. Operation to generate the first point product in the first row of the third tensor.
  24. 根据权利要求23所述的加速器,其中所述生成单元被进一步配置为:The accelerator of claim 23, wherein the generation unit is further configured to:
    基于所述第二合并计算模式指示和所述第一因数寄存器表示,由所述第二线程集中的第四线程将所述第五因数集和所述第二张量的第六因数集进行点积运算,以生成所述第三张量的所述第一行中的第四点积集,所述第五因数集不同于所述第一因数集,所述第六因数集不同于所述第二因数集,所述第四点积集不同于所述第一点积集;以及Based on the second merge calculation mode indication and the first factor register representation, the fifth factor set and the sixth factor set of the second tensor are combined by a fourth thread in the second thread set. product operation to generate a fourth point product set in the first row of the third tensor, the fifth factor set is different from the first factor set, and the sixth factor set is different from the a second set of factors, the fourth point product set being different from the first point product set; and
    由所述第四线程将所述第四点积集累加至与所述第一乘积寄存器表示对应的所述第一组乘积寄存器。The fourth set of point products is accumulated by the fourth thread to the first set of product registers corresponding to the first product register representation.
  25. 根据权利要求17所述的加速器,其中The accelerator of claim 17, wherein
    所述第一乘积寄存器表示对应于一个或多个乘积寄存器,所述一个或多个乘积寄存器的数目与合并计算模式以及所述第二张量的列数相关,不同线程的乘积寄存器构成结果张量,每个线程的乘积寄存器包括结果张量每行的部分或全部;以及The first product register represents one or more product registers. The number of the one or more product registers is related to the combined calculation mode and the number of columns of the second tensor. The product registers of different threads constitute the result tensor. quantities, each thread's product register contains part or all of each row of the result tensor; and
    所述结果张量的行数与第一张量的行数相同,所述结果张量的列数与第二张量的列数相同。The number of rows of the result tensor is the same as the number of rows of the first tensor, and the number of columns of the result tensor is the same as the number of columns of the second tensor.
  26. 根据权利要求25所述的加速器,其中The accelerator of claim 25, wherein
    所述第二线程集中的线程内的乘积寄存器的数目是可变的,所述乘积寄存器的数目取决于所述第一张量乘法指令的执行条件,所述执行条件确定对所述第二张量中的列的访问;以及The number of product registers within the threads in the second thread set is variable, and the number of product registers depends on the execution conditions of the first tensor multiplication instruction, and the execution conditions determine the number of multiplication instructions in the second tensor. column access; and
    如果所述第二张量中的第一列未被访问,则所述第二张量中的第一列不参与矩阵乘法计算。If the first column in the second tensor is not accessed, the first column in the second tensor does not participate in the matrix multiplication calculation.
  27. 根据权利要求17所述的加速器,其中The accelerator of claim 17, wherein
    所述第一张量乘法指令在一次完整的执行过程中会被多次发射,其中所述第一张量乘法指令第一次以存储指令的方式被发射,以用于获取所述第二张量中的列数据或者行数据;以及The first tensor multiplication instruction will be issued multiple times during a complete execution process, wherein the first tensor multiplication instruction is issued for the first time as a storage instruction to obtain the second tensor. column data or row data; and
    响应于获取到所述第二张量中的列数据或者行数据,并且所述第 一张量的数据已被存储在所述第一因数寄存器中,所述第一张量乘法指令以数学计算指令的方式被二次或多次发射,以用于执行所述第三张量的行内的各列结果的计算。In response to obtaining the column data or row data in the second tensor, and the data of the first tensor has been stored in the first factor register, the first tensor multiplication instruction uses a mathematical calculation instruction The method is emitted two or more times to perform the calculation of the results of each column within the row of the third tensor.
  28. 根据权利要求27所述的加速器,还包括检查单元,被配置为在进行二次或多次发射之前,检查所述第一因数寄存器的对应的令牌状态;The accelerator according to claim 27, further comprising a checking unit configured to check the corresponding token status of the first factor register before performing two or more transmissions;
    如果所述令牌状态表示所述第一张量的数据已被存储在所述第一因数寄存器中,则以数学计算指令方式发射,否则阻塞发射队列,直至所述第一张量的数据已被存储在所述第一因数寄存器中。If the token status indicates that the data of the first tensor has been stored in the first factor register, then a mathematical calculation instruction is issued, otherwise the emission queue is blocked until the data of the first tensor has been stored. is stored in the first factor register.
  29. 根据权利要求25所述的加速器,还包括越界检查单元,被配置为基于所述第一乘积寄存器表示,确定针对所述第三张量的乘积寄存器使用范围是否超出单个线程内的寄存器文件的范围;以及The accelerator of claim 25, further comprising an out-of-bounds checking unit configured to determine, based on the first product register representation, whether the product register usage range for the third tensor exceeds the range of a register file within a single thread. ;as well as
    如果确定针对所述第三张量的乘积寄存器使用范围超出单个线程内的寄存器文件的范围,则忽略超出寄存器文件范围的计算操作或访存操作并且报错。If it is determined that the usage range of the product register for the third tensor exceeds the range of the register file within a single thread, the calculation operation or memory access operation that exceeds the range of the register file is ignored and an error is reported.
  30. 根据权利要求17-22中任一项所述的加速器,其中所述第一线程集将与所述存储器逻辑地址对应的所述第二因数集以广播的形式并行提供给所述第二线程集中的全部线程中的计算单元,而不提供至所述全部线程中的寄存器。The accelerator according to any one of claims 17-22, wherein the first set of threads provides the second set of factors corresponding to the memory logical address to the second set of threads in parallel in the form of a broadcast Computational units in all threads without providing registers in all threads.
PCT/CN2022/107061 2022-03-14 2022-07-21 Method executed by accelerator, and electronic device WO2023173639A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210247720.2A CN114579929B (en) 2022-03-14 2022-03-14 Accelerator execution method and electronic equipment
CN202210247720.2 2022-03-14

Publications (1)

Publication Number Publication Date
WO2023173639A1 true WO2023173639A1 (en) 2023-09-21

Family

ID=81780810

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/107061 WO2023173639A1 (en) 2022-03-14 2022-07-21 Method executed by accelerator, and electronic device

Country Status (2)

Country Link
CN (1) CN114579929B (en)
WO (1) WO2023173639A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114579929B (en) * 2022-03-14 2023-08-08 海飞科(南京)信息技术有限公司 Accelerator execution method and electronic equipment
WO2024065860A1 (en) * 2022-10-01 2024-04-04 Intel Corporation Hardware support for n-dimensional matrix load and store instructions
TWI814618B (en) * 2022-10-20 2023-09-01 創鑫智慧股份有限公司 Matrix computing device and operation method thereof
CN116109468B (en) * 2023-04-04 2023-07-21 南京砺算科技有限公司 Graphics processing unit, instruction compiling method, storage medium, and terminal device
CN118520210A (en) * 2024-07-23 2024-08-20 北京壁仞科技开发有限公司 Data processing method, processor, electronic device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070271325A1 (en) * 2006-05-08 2007-11-22 Nvidia Corporation Matrix multiply with reduced bandwidth requirements
CN111353126A (en) * 2018-12-20 2020-06-30 卡雷公司 Block matrix multiplication system
CN111381939A (en) * 2018-12-31 2020-07-07 图核有限公司 Register file in a multithreaded processor
CN111814983A (en) * 2020-03-04 2020-10-23 深圳芯英科技有限公司 Data processing method, device, chip and computer readable storage medium
CN113836049A (en) * 2021-09-17 2021-12-24 海飞科(南京)信息技术有限公司 Memory access method and electronic device
CN114579929A (en) * 2022-03-14 2022-06-03 海飞科(南京)信息技术有限公司 Accelerator execution method and electronic device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11086968B1 (en) * 2017-06-05 2021-08-10 Reservoir Labs, Inc. Systems and methods for memory efficient parallel tensor decompositions
JP2019148969A (en) * 2018-02-27 2019-09-05 富士通株式会社 Matrix arithmetic device, matrix arithmetic method, and matrix arithmetic program
US10776110B2 (en) * 2018-09-29 2020-09-15 Intel Corporation Apparatus and method for adaptable and efficient lane-wise tensor processing
CN112559163B (en) * 2019-09-10 2023-05-23 华为技术有限公司 Method and device for optimizing tensor calculation performance
CN114090956B (en) * 2021-11-18 2024-05-10 深圳市比昂芯科技有限公司 Matrix data processing method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070271325A1 (en) * 2006-05-08 2007-11-22 Nvidia Corporation Matrix multiply with reduced bandwidth requirements
CN111353126A (en) * 2018-12-20 2020-06-30 卡雷公司 Block matrix multiplication system
CN111381939A (en) * 2018-12-31 2020-07-07 图核有限公司 Register file in a multithreaded processor
CN111814983A (en) * 2020-03-04 2020-10-23 深圳芯英科技有限公司 Data processing method, device, chip and computer readable storage medium
CN113836049A (en) * 2021-09-17 2021-12-24 海飞科(南京)信息技术有限公司 Memory access method and electronic device
CN114579929A (en) * 2022-03-14 2022-06-03 海飞科(南京)信息技术有限公司 Accelerator execution method and electronic device

Also Published As

Publication number Publication date
CN114579929A (en) 2022-06-03
CN114579929B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
WO2023173639A1 (en) Method executed by accelerator, and electronic device
JP7374236B2 (en) accelerated math engine
US8707320B2 (en) Dynamic partitioning of data by occasionally doubling data chunk size for data-parallel applications
US20130243329A1 (en) Parallel object detection method for heterogeneous multithreaded microarchitectures
CN112162854A (en) Method, system and medium for scheduling calculation tasks between CPU-GPU
CN112711478A (en) Task processing method, device, server and storage medium based on neural network
US20220043770A1 (en) Neural network processor, chip and electronic device
CN113836049A (en) Memory access method and electronic device
CN110991619A (en) Neural network processor, chip and electronic equipment
CN111091181B (en) Convolution processing unit, neural network processor, electronic device and convolution operation method
WO2023142403A1 (en) Method for determining out-of-bounds state of tensor element, and electronic apparatus
CN114610394B (en) Instruction scheduling method, processing circuit and electronic equipment
US6785743B1 (en) Template data transfer coprocessor
CN111047035B (en) Neural network processor, chip and electronic equipment
CN114218152B (en) Stream processing method, processing circuit and electronic equipment
CN113961506B (en) Accelerator and electronic device
CN117271136A (en) Data processing method, device, equipment and storage medium
CN113010173A (en) Method for matrix data broadcasting in parallel processing
US11609785B2 (en) Matrix data broadcast architecture
US20240220314A1 (en) Data dependency-aware scheduling
US20240220315A1 (en) Dynamic control of work scheduling
US20240248753A1 (en) Locating data in storage
US20220413858A1 (en) Processing device and method of using a register cache
US20240095541A1 (en) Compiling of tasks for streaming operations at neural processor
US20230195651A1 (en) Host device performing near data processing function and accelerator system including the same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931661

Country of ref document: EP

Kind code of ref document: A1