Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures. Meanwhile, relational terms such as "first," "second," and the like may be used solely in the description herein to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
Further, the term "and/or" in the present application is only one kind of association relationship describing the associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.
In view of the current problem of large memory access and thus large power consumption in typical high-intensity computing applications, for example, in matrix multiplication applications, it is necessary to repeatedly obtain a source operand a, a source operand B, and a source operand C from a vector general-purpose register VGPR, perform a + B + C calculation, and then write the multiplied result into the VGPR. For example, when instruction0 (instruction 0) is executed, source operand a, source operand B, and source operand C (in this case, 0) are obtained from VGPR, and when the calculation is completed, the calculation result needs to be written into VGPR; when instruction1 is executed, source operand a, source operand B, and source operand C are obtained from VGPR (at this time, the result of the previous calculation is C1 ═ a0+ B0+ C0), and when the calculation is completed, the calculation result needs to be written into VGPR, and the above steps are repeated until the matrix calculation is completed. Therefore, a new instruction is provided in the embodiment of the present application, so that the access of the VGPR may be converted into data pass, that is, the destination data (calculation result) of the ith instruction may be used as the source data of the (i + j) th instruction, so that the hardware directly skips the VGPR write of the destination data (calculation result) of the ith instruction and the VGPR read of the source operand of the (i + j) th instruction. Wherein i and j are both positive integers, and the maximum value of j does not exceed the maximum number of stages of data through supported by hardware, for example, assuming that the maximum number of stages of data through supported by hardware is 2, the maximum value of j is 2.
Referring to fig. 1, steps included in a method for generating an instruction according to an embodiment of the present application will be described with reference to fig. 1.
Step S101: it is determined that the instruction execution unit supports data pass-through.
In the embodiment of the present application, before generating the Instruction, a compiler (program software) detects whether an Instruction Execution unit (Instruction Execution) supports data pass-through, and when it is determined that the Instruction Execution unit (hardware) supports data pass-through, step S102 is performed. Currently, most hardware supports 1/2 levels of implicit data pass-through: when hardware executes the instruction1 or instruction2, if it is detected that source data can be obtained from a direct path (forwarding path), reading of VGPR may be skipped, and the source data may be obtained directly from the direct path, but this approach still has a large number of memory write operations of the instruction0 (previous instruction), and there is also a case where forwarding is unsuccessful, that is, even if hardware supports implicit data direct pass, source data may not necessarily be obtained from the direct path.
Step S102: when an instruction is generated, if the destination data of an ith instruction is required to be used as a source operand of an (i + j) th instruction, a first identifier used for indicating that the destination data of the ith instruction is written into a through path is set in the ith instruction, and a second identifier used for indicating that the source operand is obtained from the through path is set in the (i + j) th instruction.
When determining that hardware supports data pass-through, when generating an instruction, if the destination data of the ith instruction needs to be used as the source operand of the (i + j) th instruction, setting a first identifier for indicating to write the destination data of the ith instruction into a pass-through path in the ith instruction.
Wherein i and j are both positive integers, and the maximum value of j does not exceed the maximum number of stages of data through supported by hardware, for example, assuming that the maximum number of stages of data through supported by hardware is 2, the maximum value of j is 2. If the destination data of the previous instruction is required to be used as the source operand of the next instruction, a first identifier is set in the ith instruction when the ith instruction is generated, and a second identifier is set in the (i + 1) th instruction when the (i + 1) th instruction is generated. If the target data of the previous instruction is required to be used as the source operand of the next instruction, a first identifier is set in the ith instruction when the ith instruction is generated, and a second identifier is set in the (i + 2) th instruction when the (i + 2) th instruction is generated.
For ease of understanding, the present application is described with reference to generating a Vector Operation instruction VOP3(Vector Operation with 3 operands) instruction with 3 operands, as shown in fig. 2. The setting type is "110010", i.e., 110010 means that the instruction is a VOP3 instruction. The meaning of each field in the VOP3 instruction is shown in table 1.
TABLE 1
Note that the number of bits (bit width) of each field in table 1 is relatively fixed, and the position thereof may be changed, for example, operandd 0_ ID0 may no longer be [ 40: 32] which may be a value between [ 8: 0] this number of bits, and the rest of the fields are similar.
Wherein, the Operand0_ ID0 field, Operand1_ ID1 field, and Operand2_ ID2 field are used to indicate the source of the Operand, i.e. where to obtain the source Operand, that is, for the Operand0, if VF ═ 125 (second identifier) in this field, it indicates that Operand0 originated from the through path, otherwise Operand operanded 0 is obtained from the position pointed to by Operand0_ ID 0; for the Operand Operand1, if VF in the field is 125 (second identification), it indicates that the Operand Operand1 originates from a direct path, otherwise the Operand Operand1 is obtained from the position pointed to by Operand0_ ID 1; for Operand Operand2, if VF in this field is 125 (second identification), then it indicates that Operand Operand2 originates from the pass-through path, otherwise Operand Operand2 is fetched from the location pointed to by Operand0_ ID 2. The Result _ ID field and the DF field are used to indicate the write path of the destination data, and if DF is equal to 1 (first identifier), the destination data is written directly to the pass-through path, otherwise it is written to the location pointed to by the Result _ ID. Note that, the VF value for indicating that the operand is derived from the pass-through path is not limited to 125, and similarly, the value for indicating that the destination data is written into the pass-through path is not limited to 1.
As described in table 1, if the destination data of the ith instruction needs to be used as the source Operand of the (i + j) th instruction, the value in the destination pass-through DF field of the ith instruction is set to 1, and the value of the VF in the address field of the (i + j) th instruction used for indicating the source of the source Operand is set to 125, that is, if the vector pass-through VF value in the Operand0_ ID0 field is set to 125, it indicates that the Operand operanded 0 is from the pass-through path; if the vector pass VF value in the Operand0_ ID1 field is set to 125, it indicates that Operand1 originates from the pass-through path; if the vector pass VF value in the Operand0_ ID2 field is set to 125, it indicates that Operand operands 2 originate from the pass-through path. In the embodiment of the application, by setting a first identifier (DF ═ 1) for indicating that destination data is written into a pass-through path and setting a second identifier (VF ═ 125) for indicating that a source operand is derived from the pass-through path in an instruction, explicit data pass-through is forcibly realized by hardware by means of software, so that forwarding can be certainly executed.
Step S103: sending the ith instruction and the (i + j) th instruction to the instruction execution unit, so that the instruction execution unit writes the result of the ith instruction into a through path according to the first identifier when executing the ith instruction, and acquires a required source operand from the through path according to the second identifier when executing the (i + j) th instruction.
And after the ith instruction and the (i + j) th instruction are generated, the ith instruction and the (i + j) th instruction are sent to an instruction execution unit to be executed. When the ith instruction is executed, the instruction execution unit writes the result of the ith instruction into the through path according to the first identifier, and when the (i + j) th instruction is executed, the instruction execution unit acquires the required source operand from the through path according to the second identifier. For example, if DF in the ith instruction is equal to 1, the result of the ith instruction is directly written into the pass-through path, and if VF in the operandd 0_ ID0 field in the (i + j) th instruction is equal to 125, the source Operand operandd 0 is directly obtained from the pass-through path.
Considering that, when executing an instruction to realize the same function, an instruction to realize another function is not allowed to be inserted halfway, therefore, in order to ensure that the pass-through scheme provided by the application can be correctly implemented when there are a plurality of instructions for implementing different functions, in the embodiment of the application, or when the ith instruction and the (i + j) th instruction are sent to the instruction execution unit, the ith instruction and the (i + j) th instruction are spliced according to the generation sequence to obtain an instruction block (namely an instruction group or an instruction set), then sending the instruction block to a decoder (hardware) so that the decoder sequentially acquires first key information in the ith instruction from the instruction block and sends the first key information to an instruction execution unit, enabling the instruction execution unit to write the execution result of the ith instruction into the through path according to the first identifier in the first key information; and acquiring second key information in the (i + j) th instruction, and sending the second key information to the instruction execution unit, so that the instruction execution unit acquires the required source operand from the through path according to a second identifier in the second key information. The first key information comprises a first identifier, and the second key information comprises a second identifier. Therefore, when the instruction is executed, the switching to other instruction blocks can be performed only after all the instructions in the current instruction block are executed. Wherein the instruction group includes a group header (group header) and a group body (group body). The group head defines how many resources are used by the instruction group, and the instruction group comprises how many instructions; the group body contains all the instructions of the instruction group.
In order to avoid switching to other instruction blocks during the execution of the current instruction block, a blocking lock (lock) can be added to lock the Arbitration logic (Arbitration) when the hardware runs an instruction block. In each cycle, the decoder reads an instruction from the instruction block to execute, when an instruction with "BS ═ 1" (indicating the start of the instruction block) is encountered, the "lock" logic is enabled, the arbitration logic keeps track of the current "wave _ id", i.e. the arbitration logic can only select an instruction from the current instruction block. When an instruction of "BE ═ 1" (representing the end of an instruction block) is encountered, the "lock" logic will BE disabled, causing the arbitration logic to unlock, entering normal mode. In other words, the execution side, where the hardware must complete the entire instruction block, can switch to other instruction blocks, whose logic diagram is shown in fig. 3.
In order to support a new instruction, an embodiment of the present application further provides an instruction execution method, as shown in fig. 4, and the steps included therein will be described below with reference to fig. 4.
Step S201: and acquiring an instruction to be executed.
Step S202: obtaining key information in the instruction to be executed, wherein the key information comprises: source operand address information and destination address information.
After the instruction to be executed is obtained, key information in the instruction to be executed is obtained, wherein the key information comprises: source operand address information and destination address information. For example, the source Operand address information corresponding to the operatand 0_ ID0 field, the operatand 1_ ID1 field, and the operatand 2_ ID2 field in the instruction shown in fig. 1 is obtained, and the destination address information corresponding to the Result _ ID field and the DF field in the instruction shown in fig. 1 is obtained. The source operand address information is used for indicating the source of the source operand, and the destination address information is used for indicating the writing path of destination data.
Step S203: and judging whether the source operand indicated by the source operand address information is from a through path or not.
After the source operand address information is acquired, whether the source operand indicated by the source operand address information is from a direct path or not is judged, if yes, step S204 is executed, and if not, the source operand is acquired from the address pointed by the source operand address information.
Wherein, the process of determining whether the indicated source operand is derived from the pass-through path may be: determining whether the source operand indicated by the source operand address information originates from a through path by determining whether the source operand address information contains a second identifier; when the source operand address information contains a second identification (VF 125), the source operand indicated by the source operand address information is characterized to be sourced from a through path. It should be noted that the VF value indicating that the operand originates from the pass-through path is not limited to 125.
Step S204: obtaining a required source operand from the pass-through path.
When the source operand indicated by the source operand address information is derived from the pass-through path, the required source operand is obtained from the pass-through path.
Step S205: and judging whether a write path of the destination data indicated by the destination address information is the through path.
After the destination address information is acquired, it is determined whether or not the write path of the destination data indicated by the destination address information is a through path, and if so, step S206 is executed, and if not, the destination data is written to the address indicated by the destination address information.
Optionally, the process of determining whether the write path of the destination data indicated by the destination address information is the through path may be: determining whether a write path of destination data indicated by the destination address information is the through path by determining whether the destination address information contains a first flag (DF ═ 1); and when the destination address information contains the first identification, representing that the write path of the destination data indicated by the destination address information is the through path. Note that the value for instructing to write the destination data to the through path is not limited to 1.
Step S206: and writing the result of executing the instruction to be executed into the through path.
And when the write path of the destination data indicated by the destination address information is a through path, writing the result of executing the instruction to be executed into the through path.
In the embodiment of the application, when a compiler (a software program) detects that hardware can use pass-through data as source data, the compiler explicitly passes through destination data of an instruction0 to a source of an instruction1 or an instruction2 to realize data pass-through, so that the hardware skips VGPR writing of the instruction0 and VGPR reading of the instruction1 or the instruction2, and a large amount of power consumption is saved. For ease of understanding, an example of applying the method provided herein in matrix multiplication will be described below with reference to fig. 5.
It should be noted that the 3 a temporary registers (Temp Register For a) in fig. 5 are all the same temporary Register in physical sense, but 3 temporary registers are represented only because 3 unit times are delayed; similarly, the 2B temporary registers (Temp Register For B) in the figure are all the same temporary Register in physical sense, but 2 temporary registers are represented because 2 unit time is delayed. Since the VGPR has only one read port, three input operands (a, B, C) need to be staggered in time, and can be delayed by the temporary register, and finally aligned at the entry of the Arithmetic Unit (ALU), that is, the operand a obtained from the VGPR is temporarily placed in the temporary register of a at the first time, the operand B obtained from the VGPR is placed in the temporary register of B at the second time, and the operand B obtained from the VGPR is placed in the temporary register of C at the third time, so that the three input operands (a, B, C) can be simultaneously input into the Arithmetic Unit ALU for calculation. The flow logic shown by the dotted line in the figure is the execution logic of the existing instruction, that is, when the instruction0 (instruction 0) is executed, the source operand a, the source operand B and the source operand C (in this case, 0) are obtained from the VGPR, and when the calculation is completed, the calculation result needs to be written into the VGPR; when instruction1 is executed, source operand a, source operand B, and source operand C are obtained from VGPR (at this time, the result of the previous calculation is C1 ═ a0+ B0+ C0), and when the calculation is completed, the calculation result needs to be written into VGPR, and the above steps are repeated until the matrix calculation is completed. The execution logic after the new instruction provided by the application is shown by a solid line in the figure, namely the solid line marked with the first and the second in the figure. It is clear that with the new instructions provided by the present application, the output of the ALU is bypassed directly to the input of the ALU. Note that, in the figure, the solid line denoted by (r) indicates a data through in the case where operands are the same, and in this case, the output of the ALU is directly used as the input of three operands of the ALU, that is, the case where a ═ B ═ C is applied. The solid line labeled c represents the output of the ALU directly as an input to one of the three operands of the ALU. Among these, the three Result temporary registers (Temp Register For Result) in the figure represent three paths, i.e., the output of the ALU serves as either the input of operand a, the input of operand B, or the input of operation C.
As is clear from fig. 5, after the method provided by the embodiment of the present application is applied, only reading of the VGPR of the first instruction and writing of the VGPR of the last instruction are involved, and reading and writing of the VGPR of a large number of intermediate instructions are omitted, so that a large amount of power consumption can be reduced. Next, explanation will be given with an example of multiplying a specific matrix a by a matrix B, in which the result of the detection 0 is used as the source operand of the detection 1 and the result of the detection 1 is used as the source operand of the detection 2 when matrix multiplication is performed. In normal mode, instruction0 writes the result to VGPR, instruction1 reads its source operand from VGPR, instruction1 writes the result to VGPR, and instruction2 reads its source operand from VGPR. Below with C 64x64 =A 64x64 *B 64x64 By way of example, it should be noted that the matrix size of 64X64 is used herein only as an example and is not limited thereto. And assume that there are 64 arithmetic operation units (ALUs), each with a VGPR space of 200x64 bit.
The calculation process is roughly as follows:
1) matrix a is loaded in linear mode to LDS (Local Data Share, Local Share unit):
a (0,0) → LDS (Address 0); // A (0,0) is stored at the location of Address0 of LDS;
a (0,1) → LDS (Address 1); // A (0,1) is stored at the location of Address1 of LDS;
a (0,2) → LDS (Address 2); // A (0,2) is stored at the location of Address2 of LDS;
……
2) matrix B is loaded into the VGPR space as shown in Table 2.
TABLE 2
ALU0
|
ALU1
|
ALU2
|
……
|
ALU62
|
ALU63
|
B0,0
|
B0,1
|
B0,2
|
……
|
B0,62
|
B0,63
|
B1,0
|
B1,1
|
B1,2
|
……
|
B1,62
|
B1,63
|
……
|
……
|
……
|
……
|
……
|
……
|
B63,0
|
B63,1
|
B63,2
|
……
|
B63,62
|
B63,63 |
During calculation, elements in the matrix A are loaded into 64 ALUs one by one in parallel and multiplied by elements corresponding to columns stored in 64 vector general registers respectively, and the 64 ALUs accumulate multiplication results generated by the elements in the same row in the matrix A and the corresponding elements in the matrix B one by one in parallel in sequence to obtain all elements in the same row in the matrix C, so that multiplication operation of the matrix A and the second matrix B is completed.
3) Calculating a matrix C:
the instructions for calculating matrix C in the normal mode are as follows:
m0_ register is start _ address; the initial address of a register of/M0, wherein the register of M0 is used for storing the address of each element in the reading matrix A and automatically updating to the address corresponding to the next element after the 64 ALUs read the corresponding element in the matrix A from the LDS according to the current address of the register of M0 in parallel;
//-----------------------------------------
// Calculate the first row of Matrix C (first row of calculation Matrix C):
// C (0,0) is calculated on ALU _ Index0 ALU _ Index ═ 0(ALU0 calculates C (0,0)).
// C (0,1) is calculated on ALU _ Index1 ALU _ Index ═ 1(ALU1 calculates C (0,0)).
//......
The execution instruction for each ALU to calculate a corresponding element in the first row of the matrix C is as follows:
Block_Start::C(0,ALU_Index)=LDS_Direct(M0_register)*B(0,ALU_Index);
C(0,ALU_Index)=LDS_Direct(M0_register)*B(1,ALU_Index)+C(0,ALU_Index);
C(0,ALU_Index)=LDS_Direct(M0_register)*B(2,ALU_Index)+C(0,ALU_Index);
C(0,ALU_Index)=LDS_Direct(M0_register)*B(3,ALU_Index)+C(0,ALU_Index);
C(0,ALU_Index)=LDS_Direct(M0_register)*B(4,ALU_Index)+C(0,ALU_Index);
......
Block_End::C(0,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+
C(0,ALU_Index);
//-----------------------------------------
// Calculate the second row of Matrix C (second row of calculation Matrix C):
// C (1,0) is calculated on ALU _ Index0(ALU0 calculates C (1,0)).
// C (1,1) is calculated on ALU _ Index1(ALU1 calculates C (1,1)).
//......
The execution instruction for each ALU to compute a corresponding element in the second row of the matrix C is as follows:
Block_Start::C(1,ALU_Index)=LDS_Direct(M0_register)*B(0,ALU_Index);
C(1,ALU_Index)=LDS_Direct(M0_register)*B(1,ALU_Index)+C(1,ALU_Index);
C(1,ALU_Index)=LDS_Direct(M0_register)*B(2,ALU_Index)+C(1,ALU_Index);
C(1,ALU_Index)=LDS_Direct(M0_register)*B(3,ALU_Index)+C(1,ALU_Index);
C(1,ALU_Index)=LDS_Direct(M0_register)*B(4,ALU_Index)+C(1,ALU_Index);
......
Block_End::C(1,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+
C(1,ALU_Index);
......
//-----------------------------------------
// Calculate the last row of Matrix C:
// C (63,0) is calculated on ALU _ Index0(ALU0 calculates C (63,0)).
// C (63,1) is calculated on ALU _ Index1(ALU1 calculates C (63,1)).
//......
The execution instruction for each ALU to compute a corresponding element in the last row of the matrix C is as follows:
Block_Start::C(63,ALU_Index)=LDS_Direct(M0_register)*B(0,ALU_Index);
C(63,ALU_Index)=LDS_Direct(M0_register)*B(1,ALU_Index)+C(63,ALU_Index);
C(63,ALU_Index)=LDS_Direct(M0_register)*B(2,ALU_Index)+C(63,ALU_Index);
C(63,ALU_Index)=LDS_Direct(M0_register)*B(3,ALU_Index)+C(63,ALU_Index);
......
Block_End::C(63,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+
C(63,ALU_Index);
referring to the above instruction table, it can be seen that: each row of the computation matrix C requires 64 instructions, so the total number of instructions is 64x64 — 4096. Each instruction is executed once on each thread, so the total number of executions is 64x64x 64. The first instruction for calculating the instruction block for each row of matrix C is, for example, as follows:
C(63,ALU_Index)=LDS_Direct(M0_register)*B(0,ALU_Index);
there is one VGPR read and one VGPR write, and such an instruction occurs a total of 64 times, so there are 64x64 reads and 64x64 writes. Other lines of the instruction block, for example, are as follows:
C(63,ALU_Index)=LDS_Direct(M0_register)*B(1,ALU_Index)+C(63,ALU_Index);
there are two VGPR reads and one VGPR write, and this instruction occurs 63x64 times in total, so there are 2x63x64x64 reads and 63x64x64 writes. A summary of the reading and writing of VPGR is shown in Table 3.
TABLE 3
By using the explicit vector pass-through technique provided by the embodiment of the present application, the instruction for calculating the matrix C is as follows:
M0_register=start_address;
//-----------------------------------------
// Calculate the first row of Matrix C (first row of calculation Matrix C):
//C(0,0)is calculated on ALU_Index0:ALU_Index=0.
//C(0,1)is calculated on ALU_Index1:ALU_Index=1.
//......
//-----------------------------------------
Block_Start::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding;
......
Block_End::C(0,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
//-----------------------------------------
// Calculate the second row of Matrix C (second row of calculation Matrix C):
//C(1,0)is calculated on ALU_Index0.
//C(1,1)is calculated on ALU_Index1.
//...........
//-----------------------------------------
Block_Start::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding;
......
Block_End::C(1,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
......
//-----------------------------------------
// Calculate the last row of Matrix C (the last row of Matrix C):
//C(63,0)is calculated on ALU_Index0.
//C(63,1)is calculated on ALU_Index1.
//...........
//-----------------------------------------
Block_Start::Forwarding=LDS_Direct(M0_register)*B(0,ALU_Index);
Forwarding=LDS_Direct(M0_register)*B(1,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(2,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(3,ALU_Index)+Forwarding;
Forwarding=LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding;
......
Block_End::C(63,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
referring to the above instruction table, it can be seen that: each row of the computation matrix C requires 64 instructions, so the total number of instructions is 64x64 — 4096. Each instruction is executed once on each thread, so the total number of executions is 64x64x 64. The last instruction of the instruction block for each row of the computation matrix C is, for example, as follows:
C(63,ALU_Index)=LDS_Direct(M0_register)*B(63,ALU_Index)+Forwarding;
here, there is one VGPR read and one VGPR write, and such an instruction occurs a total of 64 times, so there are 64x64 reads and 64x64 writes. Other lines of the instruction block, for example, are as follows:
Forwarding=LDS_Direct(M0_register)*B(4,ALU_Index)+Forwarding;
here, there is one VGPR read, and such an instruction occurs a total of 64 times, thus a total of 63x64x64 reads. A summary of the reading and writing of VPGR is shown in Table 4.
TABLE 4
In summary, in the typical matrix multiplication example described above, the number of VGPR reads and writes is from 3x2 using the explicit vector pass technique of the present application 18 Optimization to 2 18 And is reduced to about 1/3, so that a great deal of energy consumption can be saved.
As shown in fig. 6, an embodiment of the present application further provides an instruction generating apparatus 100, including: a determination module 110, a generation module 120, and a transmission module 130.
A determining module 110, configured to determine that the instruction execution unit supports data pass-through.
A generating module 120, configured to, when an instruction is generated, if it is necessary to use destination data of an ith instruction as a source operand of an (i + j) th instruction, set, in the ith instruction, a first identifier for indicating to write the destination data of the ith instruction into a through path, and set, in the (i + j) th instruction, a second identifier for indicating to obtain the source operand from the through path; i is a positive integer.
A sending module 130, configured to send the ith instruction and the (i + j) th instruction to the instruction execution unit, so that the instruction execution unit writes a result of the ith instruction into a through path according to the first identifier when executing the ith instruction, and obtains a required source operand from the through path according to the second identifier when executing the (i + j) th instruction.
Optionally, the sending module 130 is configured to splice the ith instruction and the (i + j) th instruction according to a generation order to obtain an instruction block; sending the instruction block to a decoder, so that the decoder sequentially acquires first key information in the ith instruction from the instruction block and sends the first key information to the instruction execution unit, acquires second key information in the (i + j) th instruction and sends the second key information to the instruction execution unit, wherein the first key information comprises the first identifier, and the second key information comprises the second identifier.
The instruction generating apparatus 100 provided in the embodiment of the present application has the same implementation principle and the same technical effect as those of the foregoing method embodiments, and for the sake of brief description, reference may be made to corresponding contents in the foregoing method embodiments for parts that are not mentioned in the apparatus embodiments.
As shown in fig. 7, fig. 7 is a block diagram illustrating a structure of a processor 200 according to an embodiment of the present disclosure. The processor 200 includes: a processor core 210 (kernel), a decoder 220, and an instruction execution unit 230. The processor core 210, decoder 220, and instruction execution unit 230 are connected by a bus interconnect.
The processor core 210 has program code embodied therein, which when executed, generates instructions, and, accordingly, the processor core 210 is configured to determine that the instruction execution unit 230 supports data pass-through; when an instruction is generated, if the destination data of the ith instruction is required to be used as the source operand of the (i + j) th instruction, setting a first identifier for indicating that the destination data of the ith instruction is written into a through path in the ith instruction, and setting a second identifier for indicating that the source operand is obtained from the through path in the (i + j) th instruction; i is a positive integer; and is further configured to send the ith instruction and the (i + j) th instruction to the instruction execution unit 230; wherein i and j are positive integers.
The instruction execution unit 230 is configured to, when the ith instruction is executed, write a result of the ith instruction into a pass-through path according to the first identifier, and, when the (i + j) th instruction is executed, obtain a required source operand from the pass-through path according to the second identifier.
Optionally, the processor core is further configured to splice the ith instruction and the (i + j) th instruction according to a generation sequence to obtain an instruction block; the block of instructions is sent to the decoder 220. Correspondingly, the decoder 220 is configured to sequentially obtain first key information in the ith instruction from the instruction block, send the first key information to the instruction execution unit 230, obtain second key information in the (i + j) th instruction, and send the second key information to the instruction execution unit 230. Wherein the first key information includes the first identifier, and the second key information includes the second identifier.
In addition, the decoder 220 is further configured to obtain an instruction to be executed, and obtain key information in the instruction to be executed, where the key information includes: source operand address information and destination address information, wherein the source operand address information is used for indicating the source of a source operand, and the destination address information is used for indicating a writing path of destination data; and also for issuing the critical information to the instruction execution unit 230. Accordingly, the instruction execution unit 230 is configured to: judging whether the source operand indicated by the source operand address information is from a through path or not; when the source operand indicated by the source operand address information is from a through path, acquiring a required source operand from the through path; judging whether a write path of the destination data indicated by the destination address information is the through path; and when the write path of the destination data indicated by the destination address information is the through path, writing the result of executing the instruction to be executed into the through path.
Optionally, the instruction execution unit 230 is configured to determine whether the source operand indicated by the source operand address information originates from a direct path by determining whether the source operand address information contains a second identifier; when the source operand address information contains the second identification, the source operand indicated by the source operand address information is characterized to be originated from a through path. Optionally, the instruction execution unit 230 is configured to determine whether a write path of destination data indicated by the destination address information is the through path by determining whether the destination address information contains a first identifier; and when the destination address information contains the first identifier, representing that a write path of destination data indicated by the destination address information is the through path.
The processor 200 may be an integrated circuit chip having signal processing capability. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor 200 may be any conventional processor or the like.
The embodiment of the application also provides electronic equipment comprising the processor, and the electronic equipment can be equipment such as a computer and a server.
The present embodiment also provides a non-volatile computer-readable storage medium (hereinafter referred to as a storage medium), where the storage medium stores a computer program, and when the computer program is executed by the processor 200, the computer program executes the steps included in the instruction generating method and the instruction executing method in the above embodiments.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a notebook computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.