1. Introduction
In recent years, the rapid development of the Internet-of-Things (IoT) and artificial intelligence (AI) has led to a rise in the demand for advanced computational processing for all device types; for example, communication devices, such as mobile terminals and wearable devices; mobile vehicles, such as self–driving cars and public transportation systems; data centers for learning AI; all large and small central processing units (CPUs), graphics processing units, and microcomputers. It is important to reduce the power consumption of these devices [
1,
2,
3]. A reduction in the power consumption of mobile terminals and IoT devices improves their battery life. Furthermore, the power consumption of data centers has considerably increased owing to the increased use of AI; this has become a bottleneck in AI development [
4,
5,
6,
7]. Fast computation with low power consumption is required in all fields of computation [
8,
9,
10,
11,
12].
We previously proposed a fast computation algorithm for embedded systems, which approximates nonlinear elementary functions, such as trigonometric and inverse trigonometric, with high accuracy using multiply–add operations [
13,
14]. This fast computation is important for the attitude computation of wearable IoT sensors and the attitude control of drones, which require fast processing and low-power consumption [
15]. Division is crucial for the embedded implementation of algorithms. Although division is one of the four basic arithmetic operations, its computational cost is substantially higher than that of a multiply–add operation. Therefore, various methods have been proposed to achieve efficient division. Imani et al. propose a configurable approximate divider CADE that performs floating-point division with controllable precision at runtime [
16]. Liu et al. [
17] and Saadat et al. [
18] proposed a method based on logarithmic divisors and applied it to image processing and image compression. Vahdat et al. [
19] and Zendegani et al. [
20] also proposed a method to convert division into the product of the reciprocal of the divisor and applied it to image processing on DSPs and elsewhere.
One method for accelerating the calculations in embedded systems is to use integers (or fixed-point numbers) instead of floating-point numbers, but this method is incompatible with conventional division and normalization algorithms. Therefore, embedded engineers encounter problems in instances in which they want to use integer arithmetic for implementation, but real number arithmetic is required to compute vectors and other higher-dimensional algebra. As a solution to this problem, this study proposes a fixed-point vector normalization method using only multiply–add and bit-shift operations for high-speed and power-saving processing in embedded systems. The proposed method can be easily implemented in hardware, because it operates based only on simple sum-of-products operations.
In this study, we confirm that the proposed method can be implemented in FPGA, can only be composed of combinational circuits, and can be executed using hardware with an electrical delay of at most one clock operation. As will be shown later, there is a trade-off between the accuracy of the conventional method over the entire definition region and the efficiency of the proposed method. When the domain is known, the efficiency of our proposed method is very effective in reducing power consumption.
A limitation of the proposed method is that instead of being very fast and efficient, it limits the domain over which it can be computed accurately. The proposed method is referred to as pseudo-normalization, because it is not available for all definition domains. A domain can be controlled by the number of iterations of the algorithm. As the range of vectors used in embedded systems is typically known, the algorithm can be used in a broad range of fields by appropriately designing systems. The proposed algorithm consists of the fast approximate computation of the inverse function and square root function , both of which can be used independently.
In addition, for the approximate calculations of trigonometric and inverse trigonometric functions designed for floating-point numbers, integer transformations are performed using the fast inverse square root (FISR) algorithm to convert them to fixed-point numbers. The acceleration effect of the floating-point computation is limited because of the optimization of the instruction set and compiler of the CPU and microcontroller. However, the implementation of fixed-point computations is approximately three times faster than that of “math.h” on a CPU with an FPU. The proposed algorithm can be used in various vector normalization applications. An example of accelerating integer transformation using inverse trigonometric functions is described herein.
2. Method
First, we overview the conventional and proposed normalization algorithms. Subsequently, we design an approximate computation of one of the components, the reciprocal, using the residual correction method. We then design an approximate computation method for the square root computation, which is another component of the normalization algorithm. Finally, we combine them to propose pseudo-normalization as a vector normalization for integer arithmetic (proposed method).
2.1. Conventional and Proposed Methods
It is important to optimize algorithms in IoT systems wherein power efficiency is required and the amount of computation should be reduced (even by one clock operation). The low-power Coretex-M0 omits the DIV instruction [
21], and in low-spec microcontrollers, 45 divisions are implemented in software rather than provided in the instruction set [
22]. Based on this, the FISR algorithm has been developed for IEEE754 floating-point numbers [
23,
24,
25,
26]. This algorithm uses the bit structure of IEEE754 to approximate rapidly
. Additionally, various extension algorithms have been proposed in recent years [
27,
28,
29]. The FISR algorithm can be used for floating-point arithmetic in the IEEE754 format and has been implemented in various programming languages. It takes advantage of the fact that bit shifting the bit structure of a floating-point number results in a value close to the inverse square root of the original number. It calculates a practical value by converging the approximate value using Newton’s method.
We have used the FISR algorithm to develop fast computation algorithms for embedded systems; however, it is desirable to develop algorithms with fixed-point numbers rather than floating-point numbers to increase computational speed [
30]. Therefore, we propose a new vector normalization algorithm for fixed-point numbers that does not require floating-point number inputs.
In the proposed method, we normalize vectors without using division. If the norm of the vector
is
, normalization is performed to find
. Herein, we can consider normalization if we can compute the function
, i.e., the inverse of the norm
within a domain of a function by applying a multiply–add operation on
r. As division is required to create
in any domain, we consider pseudo-normalization as an approximation restricted to the domain of the function. The next section describes the design of the algorithm. First, we develop a low-cost approximation algorithm for each real number using the proposed residual correction method [
31]. This method can be used to obtain a highly accurate approximate function within a defined region using only multiply–add operations. We then modify the algorithm for use to fixed-point numbers.
2.2. Optimal Reciprocal Approximation
The first formulation of the normalization problem is to search for such that in a specific domain of the function, . For approximate computations, should be an approximate function such that the error is globally smaller within the domain of definition rather than being strictly accurate. We propose the residual correction method to search for such approximate functions and realize highly accurate approximate trigonometric and arctangent functions. In a previously proposed residual correction method, a simple approximation was considered and its residuals were evaluated to create a residual correction function. A highly accurate approximate function was then obtained by removing the residuals. However, owing to the strong nonlinearity of the normalization targeted here, multiplicative residual correction is preferred over additive residual correction.
We consider a Taylor expansion at approximately
, which is suitable as a simple approximate function.
has a wider domain compared with
r without normalization.
We also consider a function that corrects this residual. As additive residual correction results in a constant equation, we use multiplicative residual correction to search for
that minimizes the residual of
. For function shaping, a downward convex quadratic function centered at
is a candidate for
, and it must satisfy
. The following are possible candidates for this function,
Therefore,
has the following form,
The result corrected by this function is shown in
Figure 1 as the second expansion of the domain of definition by the residual correction method. Without normalization (
), it is evident that only
satisfies
. However, with the first expansion of the domain of definition, the domain that satisfies
expands to approximately
. After the second expansion, the domain that satisfies
expands to approximately
. This second approximation can have a broader definition domain compared with the approximate formula obtained using a Taylor expansion of the same order.
Further expansion of the domain using the residual correction method increases the computational cost owing to the strong nonlinearity of the original function; however, the error reduction effect is negligible. Therefore, we consider expanding the domain of definition by iteratively applying this approximate function. The application of
once and its modification to
is a more ideal state from the norm’s perspective with a reduced error. As shown in
Figure 1b, this improvement can be obtained at all points except the endpoints of the
region. Therefore, we consider the repeated application of
as follows:
The number of primes indicates the number of iterations. Although the formula for the expansion looks complicated, it is simply an iterative process associated with the application of
, which can ultimately be considered as a function of
r. The effect of this reiteration is shown in
Figure 2. Two iterations of
expand the domain of definition to
, and three iterations of
expand the domain to
, thus covering almost the entire area. With a good approximate function and convergence by iteration, we obtain a highly accurate estimation formula for
. The increase in computational cost owing to iterative calculations is described in
Section 4.
2.3. Square Root Approximation
To find the norm of a vector, the sum of squares of each element can be directly calculated using a multiply–add operation, but the square root is required for normalization. Therefore, we consider the approximate function of the square root, which is .
To determine the approximate function using the residual correction method, we first consider a simple candidate approximate function. The function design uses
multiples for all but the constant terms, based on an embedded implementation. This can be obtained as the following linear expression using the Taylor expansion around
.
We obtain the blue dots and solid green line in
Figure 3a by evaluating the residuals of the simple approximate function
. The search for residual correction functions leads to the following candidate functions,
Thus, the second approximation is as follows:
This is a more reduced residual in the region compared with the fourth-order Taylor expansion. Although higher-order Taylor expansions are required to reduce errors over the entire region, the residuals are minimized by restricting the region. The coefficients are adjusted to powers of 2 for the subsequent calculations.
2.4. Pseudo-Normalization for Integer
In the previous sections, we showed that the inverse and square roots can be approximated using only a finite number of multiply–add operations. We convert these to fixed-point calculations, and consider fast normalization using an integer. We convert floating-point
r to fixed-point
z. For the fixed-point representation,
is defined as
, where
and
is used as the base. First, the inverse is computed as follows:
where the operation ≫ denotes a right arithmetic shift. The reciprocal is likely to be small. In particular, integer expressions are not possible in this domain. Therefore, the value is expressed in the upper bits of a fixed-point number; hence, it is defined as
(=
M/
r). For normalization, the bit width is adjusted by bit shifting after multiplying the number to be normalized by the reciprocal estimate. For example, if
is an original vector, it can be normalized to
. Here, this equation and Equation (
11) adjust the position of the bit shift. Although the computation of real numbers does not cause a drop-of-digit problem, integer multiplication and bit shifting change the bit width. Thus, the product is adjusted such that it is placed before bit shifting.
The square root can be computed using a multiply–add operation and bit-shift operator, as shown by the following equation:
These are used to construct a pseudo-normalization algorithm. For any input vector of integers, the square of its norm is computed (
Figure 4) using a simple multiply–add operation. The norm is then obtained by the approximate function
of the square root. Finally, the reciprocal approximate function
provides the final normalized coefficients.
3. Results and Discussion
First, we show how the proposed algorithm works on an actual FPGA device, based on simulations and experiments. Considerations regarding computation time and efficiency are then discussed. Finally, as an example of the proposed method, an angle measurement by the IMU is demonstrated. The IMU angle measurement is one of the applications suitable for the proposed method, because it uses acceleration vectors and geomagnetic vectors, and the approximate length of the vectors is known and nonzero.
3.1. Results
The behavior of the proposed algorithm was examined based on simulations. First, the effect of the number of normalization iterations
n was investigated experimentally; the results are shown in
Figure 5. For practical purposes, one or two iterations are sufficient for expanding the domain of definition. However, if the range of the input norm is broad, more iterations can be used to extend the definition domain.
Subsequently, we describe the proposed normalization algorithm in Verilog, which is a hardware description language, and examine its computability for embedded systems. The basic circuit of the proposed algorithm in Verilog is shown in
Figure 6. In addition, the inverse and root functions are shown in
Figure 7 and
Figure 8 as the basic modules. These bit-widths and constants are slightly adjusted to simplify the Verilog’s diagram, but do not affect the calculated results. The simulated operation is shown in
Figure 9. In the simulation,
is generated by assuming
, and the magnitude of the inputs is changed from 0 to 200. The simulation results show that the output vector
is almost constant in the range of the input magnitude
. Normalization is performed to cancel the variation in
.
Finally, the proposed algorithm was implemented on an actual FPGA device to evaluate its computational feasibility and speed. The FPGA used was the DE0-nano (manufactured by Terasic Inc., Hsinchu, Taiwan), an evaluation board for INTEL’s Cyclone IV. The Verilog HDL source was the same as that used in the aforementioned simulation.
Figure 10a shows the results. The experimental results show that the expression
can be calculated.
Figure 10b also shows the timing of the instant when a specific bit on the input becomes “1” and the instant at which a specific bit on the output becomes “1”. The FPGA used in this study is 50 MHz (sampling time 20 ns); this means that our algorithm can compute with only the electrical delay is less than one cycle. The proposed algorithm can be implemented using only combinational circuits, and does not require pipeline processing like sequential circuits, so it can be processed within one cycle in embedded circuits (such as FPGAs). Efficient computation with combinational circuits has been the subject of active research in recent years, including work on binary tree-based dividers, such as those proposed in [
33]. A comparison with other similar vector normalization algorithms is shown in
Table 1. Note that other similar algorithms are implemented as sequential circuits rather than combinational circuits; therefore, the evaluation methods are different. As the conventional algorithms require approximately 20 cycles, implementation in the same FPGA requires a computation time which is approximately equal to 400 ns. Our algorithm, which can be implemented using only combinational circuits for embedded circuits such as FPGAs, is very fast.
Based on these results, accuracy and computation speed of the proposed algorithm and the proposed algorithm application will be discussed in subsequent sections.
3.2. Calculation Speed and Accuracy
As shown in the design phase, the approximation algorithm for the root computation is accurate for real numbers close to . However, the error increases as the numbers deviate from . This is because the highly nonlinear root function is approximated by a linear polynomial with residual correction, and further residual correction to reduce the error increases the computational cost. Therefore, it is desirable to utilize the system in fields where a limit of application can be ensured.
The accuracy of the inverse approximate function
increases with the number of iterations. Although shown in the design phase of product representation, this can be rewritten as follows: the repeated application of
is a recursive structure, and a recurrence relation can be computed.
This recurrence relation is equivalent to Newton’s method for
and is ensured to converge. In addition, compared with the Taylor expansion of
, the following equation holds.
In other words, (n + 1) iterations in the approximate function coincide with the Taylor expansion up to
odd orders. This implies that the iterative computation of the proposed method is analytically equal to the original function in
. Additionally, the proposed multiplicative expansion can be computed with
, whereas
is required to achieve the same accuracy in the Taylor expansion, which significantly reduces the computational cost. Although the proposed method cannot yield Taylor expansion approximations up to even orders because
is an odd function, the form of the proposed method that does not deal with even orders is the preferred expansion basis. The amount of computations and the corresponding error order are summarized in
Table 2.
Using higher-order Taylor expansions as candidates for the residual correction function is one approach, but this approach will likely violate the concept of the residual correction method, which is the search of a simple correction function for efficient computation. However, the fact that the search results agree with the Taylor expansion results is a basis for asserting the validity of the correction function. Currently, the search for the correction function is conducted both empirically and heuristically, but the analytical agreement implies a degree of similarity with the original function, which could be one of the indicators of the search.
3.3. Further Extension of Domain of Definition
Certain applications require broad definition domains. The domain of definition can be expanded by increasing the number of iterations; however, a large number of iterations reduces computational efficiency. Therefore, we expand the domain using another method. First, accuracy decreases when the norm is small. Therefore, when the norm is smaller than a certain value
, the domain expansion as an analogy to the domain expansion based on the double-angle formulas of trigonometric functions [
13] is as follows:
For
,
is experimentally confirmed to be appropriate in terms of continuity and domain expansion. This effect is shown in
Figure 11a. Although
is used here, the domain of definition can be further expanded close to
by repeating
, as shown in
Figure 11b. Furthermore, we note that this is also true for large norms. This is because the norm is estimated to be smaller in the first iteration, even when it is large; as a result, the iterative computation is effective when the norm is large or small. This process extends the available input range to approximately
.
3.4. Application: Computing atan2 Function Using Integer Arithmetic
We have proposed the low-cost and fast approximate computation of trigonometric and inverse trigonometric functions for small embedded devices for the motion measurement of robots and humans. The algorithm is designed using a multiply–add operation. However, the prerequisite is that the input vector (sensor output) must be normalized using the FISR algorithm. The FISR algorithm is used for floating-point numbers, which limits its ability to perform processing in integer calculation cases that are ideal for embedded devices. The proposed method applies the FISR algorithm to integers to solve this problem. For this purpose, integer transformations are performed on previously proposed trigonometric and inverse trigonometric functions (atan2). As shown in the results, a limitation of the proposed method is that the error becomes larger at the edge of the definition region. However, in the IMU sensor signal processing application considered here, such as the gravity acceleration and geomagnetic vectors, the norm as the basis of the frame matrix is not zero, but has a certain magnitude, and the objective is to normalize it. Therefore, it is a suitable target as an application example for the proposed method. The structure of the equation remains the same, and an integer conversion with
yields the following results:
This encodes the angle as an intuitively understandable integer from
to 180 degrees. The process flow of this calculation is shown in
Figure 12. The details of the algorithm are the same as those provided in [
13], except that the input and output are integers. Even if all approximations are made using integers (
Figure 13a), the estimation error is within the quantization error, because it is less than or equal to 1°, although the resolution is increased for evaluation. The results with floating-point numbers, which have been reported in a previous publication [
13], are shown in
Figure 13b. The maximum error in the floating-point calculation was about
rad (=0.05°), while the integer conversion resulted in an error approximately equal to 1 degree. This calculation is approximately three times faster than that of “math.h”, even when it is performed on a CPU with an FPU. Therefore, fast calculations can be expected even if a microcontroller without an FPU or compiler with low optimization capability is used.
We applied integer pseudo-normalization as a preprocessor for a previously developed algorithm. However, it can be applied to various preprocessors that require fast normalization. For example, the proposed algorithm may be effective when performing a large number of calculations on GPUs, such as computing normal vectors for computer graphics and normalization for machine learning. Especially in the field of machine learning, implementation techniques on FPGAs are important for efficient learning and edge AI [
37]. Normalization, division, and square roots are basic elements used to calibrate various calculations. We are currently discussing their application to the field of machine learning. In future work, we will work on the improvement of the efficiency of machine learning implementation by applying the proposed method.
4. Conclusions
We propose pseudo-normalization as a vector normalization method for fixed-point numbers that used only multiply–add operations and bit shifts. Pseudo-normalization consists of the approximate fast computation of and , which can also be used independently.
The approximate fast computation of can be expanded by n iterations, and it is analytically equivalent to the original function at . The proposed multiplicative approximation method converges more efficiently than the series approximation method. Thus, only one or more iterations are required to create a sufficiently broad domain of definition for embedded systems. To verify the feasibility of implementing the algorithm in a FPGA, the algorithm is verified using Verilog and simulations. As an example of applying the proposed algorithm, we confirm that the previously developed fast computation of inverse trigonometric functions can be combined with the proposed algorithm to perform fast and accurate approximate computation by integer transformation.
Future studies will include the construction of a high-speed sensor signal processing system using an actual FPGA device and its application to high-precision human motion measurement with high-temporal resolution. In addition, we will develop a fast computation method for nonlinear elementary functions, such as exponential and logarithmic functions, using the residual correction method. By developing an environment in which these basic operations can be efficiently processed in hardware, we will work to realize large-scale advanced technologies such as machine learning on edge devices.